From 0372856f653d6332919a24c277fd2fe16dedcced Mon Sep 17 00:00:00 2001 From: "Gao, Xiang" Date: Fri, 30 Sep 2022 00:21:00 -0700 Subject: [PATCH] Rebase #1900 (#2009) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * hash update - bug fix for branches (#83865) hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch. Fixed by checking out the branch to make sure it exists locally. example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true Test plan: made it pull request trigger and ran, to get this: https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865 Approved by: https://github.com/zengk95 * [FSDP] Remove unneeded checks (#83150) @awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion. We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150 Approved by: https://github.com/awgu * [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181) https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181 Approved by: https://github.com/kumpera * Transpose scheduler small dim sizes better support (#1910) * Optimize transpose copy on CPU using fbgemm transpose (#83327) ### Description Optimize transpose copy on CPU using fbgemm transpose ### Testing single socket (28cores): ``` before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 2.439e-05 ms; bf16: 2.152e-05 ms torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms ``` single core: ``` before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms; bf16: 0.00103 ms torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566 ms; bf16: 0.000382 ms torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327 Approved by: https://github.com/frank-wei * Grouped grid welford (#1921) Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions. * [ONNX] Use `errors.SymbolicValueError` for more context (#83332) Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values. - Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information - Replace plain RuntimeError with `errors.SymbolicValueError` - Clean up: Use `_is_bool` to replace string comparison on jit types - Clean up: Remove the todo `Remove type ignore after #81112` #77316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332 Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao * [quant][fx] Add support for quantized matmul (#83885) Summary: att, probably missed the op during migration to the reference flow Test Plan: python test/test_quantization.py TestQuantizeFxOps.test_qmatmul Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885 Approved by: https://github.com/andrewor14 * Misc fixes/tuning for transpose scheduler (#1912) * [nn] split rnn_utils test from test_nn.py (#83675) Ref: https://github.com/pytorch/pytorch/issues/63085 Proposed folder structure ``` -> test -> nn -> test_conv.py -> test_pooling.py -> ..... ``` This PR: Moves test related RNN utilities to a different file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675 Approved by: https://github.com/albanD * [optim] rprop: handle complex params as independent real params (#83858) Ref #65711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858 Approved by: https://github.com/albanD * [xla hash update] update the pinned xla hash (#83899) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899 Approved by: https://github.com/pytorchbot * [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939) Enables: test_bmm_cuda_float64 test_bmm_deterministic_cuda_float64 test_csr_matvec_cuda_complex128 test_csr_matvec_cuda_complex64 test_csr_matvec_cuda_float32 test_csr_matvec_cuda_float64 To enable the above tests had to add some more hip mappings for the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939 Approved by: https://github.com/pruthvistony, https://github.com/malfet * Normalize DLPack stride to 1 where shape < 2 (#83158) Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`. The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158 Approved by: https://github.com/ezyang * Remove DBR quantization from the codebase (#83642) Summary: DBR quantization is a no-go for now because it does not align well with PyTorch 2.0 plans and we do not want to build yet another tracing system. Deleting it from the codebase for now since there are no plans to develop this in the near future. We can bring it back at a later time if necessary. Test plan: CI Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642 Approved by: https://github.com/andrewor14, https://github.com/jerryzh168 * Refactored ops on size to be dispatcher ops (#83719) An example of how the graph looks now. ``` def forward(self, x_1): size = torch.ops.math.size(x_1, 0) size_1 = torch.ops.math.size(x_1, 1); x_1 = None ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False) expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]); ones = size = size_1 = None cos_default = torch.ops.aten.cos.default(expand_sym_int); expand_sym_int = None return (cos_default,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719 Approved by: https://github.com/ezyang * Fix stride issue with faketensors (#83822) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822 Approved by: https://github.com/ezyang, https://github.com/ngimel * Nullary RNGOp (#1892) * [ROCm] restore MIOpen benchmark flag default to true (#82656) ### Description PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on. ### Testing CI unit tests cover this capability. Torchvision models demonstrate the performance delta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656 Approved by: https://github.com/ngimel * Update retry action to latest version (#83911) We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true. ``` WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver. /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049 throw err; ^ Error: kill EPERM at process.kill (internal/process/per_thread.js:199:13) at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17) at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21 at Array.forEach () at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23 at Array.forEach () at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27) at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13 at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17) at ChildProcess.emit (events.js:314:20) { errno: 'EPERM', code: 'EPERM', syscall: 'kill' } ``` The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference. Regardless, we should keep our actions up to date anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911 Approved by: https://github.com/malfet * [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353) Nothing in the rest of the header seems to use these. Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353 Approved by: https://github.com/malfet * [functorch] add linalg cross batch rule (#83759) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759 Approved by: https://github.com/zou3519 * Improve DistanceKernel.cu (#83811) include device_sqrt replace reduce_agg by BlockReduce choose implementation by impl_fptr instead of error-prone copy-and-paste Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811 Approved by: https://github.com/ngimel * reinplace pass: bugfix for output node replacement (#83845) Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers. See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845 Approved by: https://github.com/ezyang * reinplace pass: special handling for view_scatter ops (#83846) There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code: ``` def f(): a = torch.zeros(4, 4, 4) a[:, 2:] = torch.ones(4, 2, 4) return a ``` Tracing normally with `make_fx()` gives you: ``` def forward(self): zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False) ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False) slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807) slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807); slice_tensor = None copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones); slice_tensor_1 = ones = None return zeros ``` Functionalizing it gives you: ``` def forward(self): zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False) ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False) slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807) slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807); slice_tensor = None slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807) slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807); slice_tensor_2 = ones = None slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807); zeros = slice_scatter_default = None return slice_scatter_default_1 ``` Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`). What we actually want is to replace this line: ``` slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...); ``` with this: ``` new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...); _ = torch.ops.aten.copy_.default(new_slice, ones) ``` In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed. We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing) I also updated the docs for re-inplacing to more closely match the order of the logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846 Approved by: https://github.com/ezyang * Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886) In general, `.h` files should only include headers that are used in the header Fixes https://github.com/pytorch/pytorch/issues/83856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886 Approved by: https://github.com/ngimel * Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586) Fixes https://github.com/pytorch/torchdynamo/issues/759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586 Approved by: https://github.com/ezyang, https://github.com/ngimel * Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881) Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`. Saving user from setting two env variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881 Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang * Strenghten preconditions of linalg.cross (#83798) This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs. Fixes https://github.com/pytorch/pytorch/issues/77629 Fixes https://github.com/pytorch/pytorch/issues/83756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798 Approved by: https://github.com/mruberry * Fix view_func replay in no-grad mode (#83872) Fixes https://github.com/pytorch/pytorch/issues/83828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872 Approved by: https://github.com/albanD * [vulkan] Add VMA as a third_party subrepo (#83906) the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906 Approved by: https://github.com/kimishpatel * [torchgen] Add documentation for `autogen` keyword (#83610) This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610 Approved by: https://github.com/albanD, https://github.com/bdhirsh * remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709) See https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709 Approved by: https://github.com/kit1980 * [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553) Fixes #83224 cc @kshitij12345 @albanD! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553 Approved by: https://github.com/albanD * [complex] conv_transpose1d (#79694) Reference: https://github.com/pytorch/pytorch/issues/71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694 Approved by: https://github.com/ngimel * Revert "Strenghten preconditions of linalg.cross (#83798)" This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e. Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e * Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855) `_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly. Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855 Approved by: https://github.com/qihqi * Prefer signal from land checks over PR signals (#83715) # The problem When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict) Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation. # The solution This PR changes the behavior so that: 1. If both the PR and land validation ran a workflow, only look at the results from land validation 2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior) ### Bonus fixes It also includes a few extra BE fixes: - Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used - Reduces the number of API calls to github by ~50% during merges. Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715 Approved by: https://github.com/zengk95 * Don't introduce new overload for SymInt (#83628) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh * Remove CoreMLMemoryObserver (#83703) Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up. Test Plan: Diff just removed logging, so just build IG and confirm no errors. Differential Revision: D38843701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703 Approved by: https://github.com/mcr229 * ci: Remove dead code related to android uploads (#83930) These uploads actually never got triggeredhappened in nightlies so removing it altogether. Someone can re-add in the future if they feel these are important but I can't find an instance of this running since we migrated so I have a hard time believing anyone will miss it. https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android Signed-off-by: Eli Uriegas Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930 Approved by: https://github.com/atalman, https://github.com/malfet * [fx][pass infra] Adding error catching (#83933) Example: ``` ====================================================================== ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__ res = fn(module) File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail raise RuntimeError("bad") RuntimeError: bad The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error pm(traced_m) File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__ raise RuntimeError(msg) from e RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass'] ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933 Approved by: https://github.com/SherlockNoMad * Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922) Reviewed By: hl475 Differential Revision: D38945806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922 Approved by: https://github.com/hl475 * Fix use-dict-literal lint (#83718) Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718 Approved by: https://github.com/albanD * Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)" This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a. Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy * Add hypothesis to requirements.txt (#83740) Signed-off-by: Edward Z. Yang Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740 Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519 * [fbia] Keep Track of full qualified name before and after remote sharding (#83889) Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage Test Plan: CI/CD with DISABLE_XLEBB_MATERIALIZATION = True https://fburl.com/fblearner/a8yljbux with DISABLE_XLEBB_MATERIALIZATION = False https://fburl.com/fblearner/2nvi0dam Reviewed By: lliu315gt Differential Revision: D38772525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889 Approved by: https://github.com/houseroad * add merge blocking to ci: sev template (#83940) as in title, so that by default, ci: sev will block merges the line can be removed to not block merges Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940 Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere * Move nnapi code from ATen common code to specific library (#83748) Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now. Test Plan: Sandcastle. Differential Revision: D38761095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748 Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA * Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870) See https://github.com/pytorch/pytorch/issues/38095 Replaced assertEqualIgnoreType with assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870 Approved by: https://github.com/kit1980 * [Nested Tensor] Make offset copy and move assignment more explicit. (#83488) Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies. Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`. The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488 Approved by: https://github.com/albanD, https://github.com/bdhirsh * Remove conj kernels for real dtypes (#80374) `conj_physical_stub` is currently implemented for all dtypes despite it just being a plain copy for real dtypes. So, instead we should defer to the existing copy kernel in these cases. On my build for one CUDA architecture, I see a 2.2 MB decrease in `libtorch_cuda.so` size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374 Approved by: https://github.com/ngimel, https://github.com/atalman * [BE][CUDA] Use packed_accessor64 (#83949) Not sure why we are ignoring those, but SoftMax.cu alone generates 100+ lines of warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto indices_accessor = indices.packed_accessor(); ^ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto grad_values_accessor = grad_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto grad_values_accessor = grad_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto grad_values_accessor = grad_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = values_2.packed_accessor(); ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto out_values_accessor = out_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto grad_values_accessor = grad_values_2.packed_accessor(); ^~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef; int64_t = long int]’: /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef; int64_t = long int]’: /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef; int64_t = long int]’: /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef; int64_t = long int]’: /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations] auto values_accessor = ^~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here GenericPackedTensorAccessor packed_accessor() const & { ^ ~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949 Approved by: https://github.com/ngimel * Support returning symbolic strides from t.stride() in Python (#83842) Signed-off-by: Edward Z. Yang Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842 Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh * Support the XPU backend untyped storage (#83952) Simple add XPU backend in untyped torch storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952 Approved by: https://github.com/ezyang * Support NCCL Premul Sum (#81272) This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum. The major changes include - convert enum ReduceOp to struct - add premul sum specific paths to init.cpp and Ops.cpp. note: - For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272 Approved by: https://github.com/kwen2501 * Test type promotion assertignoretypes (#83867) See #38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867 Approved by: https://github.com/kit1980, https://github.com/mruberry * [Profiler] record nn.Module's parameters (#83209) Summary: Record nn.Module's parameters for detaild memory profiling: - extend 'module_' in value cache & NNModuleInfo to save parameters - python binding and unit test case Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule Differential Revision: D38379717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209 Approved by: https://github.com/robieta * [xla hash update] update the pinned xla hash (#83967) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967 Approved by: https://github.com/pytorchbot * Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) * Fix LTC build warnings (#83955) Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions. ``` /home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] }; ^ ``` cc: @wconstab @desertfire @ke1337 @antoniojkim Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955 Approved by: https://github.com/wconstab * Strenghten preconditions of linalg.cross (#83798) This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs. Fixes https://github.com/pytorch/pytorch/issues/77629 Fixes https://github.com/pytorch/pytorch/issues/83756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798 Approved by: https://github.com/mruberry * Make linalg.inv composite of linalg.solve (#80074) The `getri` kernel calls inside `getrs` so we can do so explicitly ourselves and save ourselves from having to maintain an extra kernel. This way we just need to optimise `lu_factor` and `lu_solve` and `inv` will be as efficient as it can be, as it'll be choosing the best backend to perform the factorisation and the best backend (not necessarily the same) to perform the solve. Fixes https://github.com/pytorch/pytorch/issues/77498 The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074 Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet * Support a stable double backward on linalg.det for real inputs (#80217) The complex case still fails. I do not know why. Fixes https://github.com/pytorch/pytorch/issues/62327 Fixes https://github.com/pytorch/pytorch/issues/53364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217 Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet * [LTC] Add custom lazy tensor save function (#83294) We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA: https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994 The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint. The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions. CC: @wconstab @Krovatkin @JackCaoG Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294 Approved by: https://github.com/wconstab * move pooling test from test_nn to test/nn/test_pooling (#83915) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915 Approved by: https://github.com/albanD * [ONNX] Remove static None graph output (#82623) Fixes #82370 * Unify the export behavior regarding static None outputs. These are dropped for both traced graph and TorchScript graph export. * `Optional` outputs are not affected. Fixes #82370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623 Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock * [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935 Approved by: https://github.com/robieta, https://github.com/slgong-fb * [WIP] Validating input_col for certain datapipes (#80267) Follow up from #79344. Currently WIP due to multiple test failures. Waiting for #80140 to land Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267 Approved by: https://github.com/ejguan * support more symintnode operations (#83877) remove debug code Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877 Approved by: https://github.com/ezyang * add arithmetic ops (#83878) arithmetic ops tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878 Approved by: https://github.com/ezyang * logical ops (#83879) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879 Approved by: https://github.com/ezyang * strip SymIntNodes off in the mobile builds (#83938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938 Approved by: https://github.com/ezyang * [pthreadpool] Cap max thread count to fix TSAN issues (#83950) Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures. Test Plan: CI Reviewed By: kimishpatel Differential Revision: D38136212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950 Approved by: https://github.com/kimishpatel * Skip NCCL slimming for cxx11 libtorch builds (#83959) Fixes https://github.com/pytorch/pytorch/issues/83887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959 Approved by: https://github.com/atalman * add hud link to merge failure message (#83946) as in title, related to https://github.com/pytorch/test-infra/issues/568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83946 Approved by: https://github.com/huydhn * Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920 Test Plan: Sandcastle Differential Revision: D35194656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817 Approved by: https://github.com/malfet * [frontend] Fix tensor list alias annotation (#84005) For issue https://github.com/pytorch/pytorch/issues/77920 and a retry of https://github.com/pytorch/pytorch/pull/83921 The current logic checks alias info before `[]` and after. If no alias info exists after `[]`, we overwrite the alias info before. This logic failed on argument like `Tensor(a!)[]`, dropping the alias info before `[]` on the floor. This PR adds a new alias info if it's missing after `[]`. This way we can keep the alias info before `[]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84005 Approved by: https://github.com/cccclai, https://github.com/bdhirsh * Suppress Anomaly mode warning message (#83966) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83966 Approved by: https://github.com/albanD * Support BF16 for fast layernorm (#83971) Fixes #83970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83971 Approved by: https://github.com/ngimel * Map new CUDA error handling to HIP (#75032) (#83953) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75032 Test Plan: Sandcastle Reviewed By: ezyang, malfet Differential Revision: D35253785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83953 Approved by: https://github.com/ezyang, https://github.com/malfet * Improve Normalization.cuh (#83871) remove unused Ops replaced copy-and-paste by calling BlockReduce (+SumReduceOp +2D block indexing) and removing duplicate warpSum Pull Request resolved: https://github.com/pytorch/pytorch/pull/83871 Approved by: https://github.com/ngimel * Check all CUDA API calls for errors in test/ (#74921) (#83954) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74921 Test Plan: Sandcastle Reviewed By: ezyang, malfet, ngimel Differential Revision: D35194966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83954 Approved by: https://github.com/ezyang * remove duplicate WarpReduceSum (#83757) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83757 Approved by: https://github.com/ngimel * Set python build-docs timeout to 30 minutes and cpp build-docs timeout to 180 minutes (#83957) Anything more means there's something wrong and we should just return. AFAIK the timeout doesn't include queuing time, only the job duration https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes ![Screen Shot 2022-08-23 at 18 31 57](https://user-images.githubusercontent.com/475357/186298046-5637384f-887c-4c6a-a946-c101b6c66741.png) This will help avoid having python build docs timeout after 6 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83957 Approved by: https://github.com/ZainRizvi * [ROCm] Enable test_multiprocessing tests (#82356) Signed-off-by: Jagadish Krishnamoorthy Issue fixed in ROCm 5.2 user space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82356 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn * Pin conda to 4.13.0 (#83991) Recent update to conda 4.14.0 caused breakages in our docker builds: https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d This pins to prevent the errors: ``` Traceback (most recent call last): 2022-08-24T16:20:49.2412247Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1125, in __call__ 2022-08-24T16:20:49.2413036Z File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 86, in main_subshell 2022-08-24T16:20:49.2413615Z File "/opt/conda/lib/python3.9/site-packages/conda/cli/conda_argparse.py", line 93, in do_call 2022-08-24T16:20:49.2414282Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 75, in wrapper 2022-08-24T16:20:49.2415036Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 39, in display_notices 2022-08-24T16:20:49.2415853Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 36, in get_notice_responses 2022-08-24T16:20:49.2416661Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 39, in 2022-08-24T16:20:49.2417399Z File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator 2022-08-24T16:20:49.2418145Z File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result 2022-08-24T16:20:49.2418831Z File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result 2022-08-24T16:20:49.2419543Z File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run 2022-08-24T16:20:49.2420292Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 42, in 2022-08-24T16:20:49.2421070Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/cache.py", line 37, in wrapper 2022-08-24T16:20:49.2421712Z File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 58, in get_channel_notice_response 2022-08-24T16:20:49.2422258Z File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 600, in get 2022-08-24T16:20:49.2422801Z File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 587, in request 2022-08-24T16:20:49.2423226Z File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 701, in send 2022-08-24T16:20:49.2423634Z File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 460, in send 2022-08-24T16:20:49.2424239Z File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 263, in cert_verify 2022-08-24T16:20:49.2424731Z OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /opt/conda/lib/python3.9/site-packages/certifi/cacert.pem 2022-08-24T16:20:49.2424967Z 2022-08-24T16:20:49.2425110Z During handling of the above exception, another exception occurred: 2022-08-24T16:20:49.2425279Z 2022-08-24T16:20:49.2425377Z Traceback (most recent call last): 2022-08-24T16:20:49.2425610Z File "/opt/conda/bin/conda", line 13, in 2022-08-24T16:20:49.2425845Z sys.exit(main()) 2022-08-24T16:20:49.2426176Z File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 129, in main 2022-08-24T16:20:49.2426614Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1413, in conda_exception_handler 2022-08-24T16:20:49.2427054Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1128, in __call__ 2022-08-24T16:20:49.2427555Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1170, in handle_exception 2022-08-24T16:20:49.2427995Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1181, in handle_unexpected_exception 2022-08-24T16:20:49.2428471Z File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1251, in print_unexpected_error_report 2022-08-24T16:20:49.2428873Z ModuleNotFoundError: No module named 'conda.cli.main_info' 2022-08-24T16:20:55.5428691Z The command '/bin/sh -c bash ./install_conda.sh && rm install_conda.sh' returned a non-zero code: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83991 Approved by: https://github.com/malfet * Deletes CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake (#84007) Looking through the code and online, it does not look like these variables actually change anything. Regardless, this change was instituted to fix https://github.com/pytorch/pytorch/issues/13362, but we are again running into similar issues even with the workaround: see https://github.com/pytorch/pytorch/issues/83790. Thus, since 1. this change isn't preventing flakiness 2. these variables do not seem used anywhere in pytorch/pytorch nor mozilla/sccache we should remove this confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84007 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi * Named pipe based watchdog timer (#83695) Summary: This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya). The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it. `LocalTimerClient` and `LocalTimerServer` require a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes. `FileTimerClient` and `FileTimerServer` does not have such limitation. Test Plan: ### Unit Test ``` buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test ``` ``` RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895) Summary Pass: 12 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ``` Differential Revision: D38604238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695 Approved by: https://github.com/d4l3k * Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057) This is to improve the performance for hybrid sparse coo tensor on CPU path. This case is appeared at the DLRM terabyte test. With this fix, according to the previous performance test data, it got ~10x performance improvement on DLRM execution. without this, the DLRM will run as Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 % with this, the DLRM will run as Finished training it 100/1000 of epoch 0, 270.71 ms/it, loss 0.220505, accuracy 0.000 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/23057 Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet * Pretty print stack trace with gm.print_readable() (#83706) Precondition: https://github.com/pytorch/torchdynamo/pull/899 Given following function ``` def my_relu(a): return a.relu() def func(a, b): d = torch.square(a + b) e = my_relu(d) f = d.sin() s = torch.stack([e, f]) s = s.sum() ``` Here are the possible result with various tracing frontend: dynamo, symbolic_trace, make_fx - joint graph with torchdynamo.optimize("aot_nop") Notice that it has a special stack for gradient addition node (for multiple uses of tensor) in backward Notice that "No stacktrace found for following nodes" are shown for nodes with stacktrace ``` def forward(self, primals, tangents): primals_1, primals_2, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b) add_tensor = torch.ops.aten.add.Tensor(primals_1, primals_2); primals_1 = primals_2 = None pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() relu_default = torch.ops.aten.relu.default(pow_tensor_scalar) detach_default = torch.ops.aten.detach.default(relu_default) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin() sin_default = torch.ops.aten.sin.default(pow_tensor_scalar) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f]) stack_default = torch.ops.aten.stack.default([relu_default, sin_default]); relu_default = sin_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum() sum_default = torch.ops.aten.sum.default(stack_default); stack_default = None # No stacktrace found for following nodes is_same_size_default = torch.ops.aten.is_same_size.default(sum_default, tangents_1) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum() expand_default = torch.ops.aten.expand.default(tangents_1, [2, 10, 10]); tangents_1 = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f]) unbind_int = torch.ops.aten.unbind.int(expand_default); expand_default = None getitem = unbind_int[0] getitem_1 = unbind_int[1]; unbind_int = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin() cos_default = torch.ops.aten.cos.default(pow_tensor_scalar); pow_tensor_scalar = None mul_tensor = torch.ops.aten.mul.Tensor(getitem_1, cos_default); getitem_1 = cos_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() detach_default_1 = torch.ops.aten.detach.default(detach_default); detach_default = None threshold_backward_default = torch.ops.aten.threshold_backward.default(getitem, detach_default_1, 0); getitem = detach_default_1 = None # Gradient addition node due to mulitple use of tensor around:, File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() add_tensor_1 = torch.ops.aten.add.Tensor(mul_tensor, threshold_backward_default); mul_tensor = threshold_backward_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b) pow_tensor_scalar_1 = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 1.0); add_tensor = None mul_scalar = torch.ops.aten.mul.Scalar(pow_tensor_scalar_1, 2.0); pow_tensor_scalar_1 = None mul_tensor_1 = torch.ops.aten.mul.Tensor(add_tensor_1, mul_scalar); add_tensor_1 = mul_scalar = None sum_sym_int = torch.ops.aten.sum.SymInt(mul_tensor_1, [0], True) view_sym_int = torch.ops.aten.view.SymInt(sum_sym_int, [10]); sum_sym_int = None return pytree.tree_unflatten([sum_default, mul_tensor_1, view_sym_int], self._out_spec) ``` - default symbolic_trace Notice that nodes without stacktrace are folded under same region ``` def forward(self, a, b): # No stacktrace found for following nodes add = a + b; a = b = None square = torch.square(add); add = None relu = square.relu() sin = square.sin(); square = None stack = torch.stack([relu, sin]); relu = sin = None sum_1 = stack.sum(); stack = None return sum_1 ``` - symbolic_trace with record_stack_traces=True ``` def forward(self, a, b): # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b) add = a + b; a = b = None square = torch.square(add); add = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() relu = square.relu() # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin() sin = square.sin(); square = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f]) stack = torch.stack([relu, sin]); relu = sin = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum() sum_1 = stack.sum(); stack = None return sum_1 ``` - make_fx without decomposition ``` def forward(self, a_1, b_1): # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b) add_tensor = torch.ops.aten.add.Tensor(a_1, b_1); a_1 = b_1 = None pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2); add_tensor = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() relu_default = torch.ops.aten.relu.default(pow_tensor_scalar) detach_default = torch.ops.aten.detach.default(relu_default) # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin() sin_default = torch.ops.aten.sin.default(pow_tensor_scalar); pow_tensor_scalar = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f]) stack_default = torch.ops.aten.stack.default([relu_default, sin_default]); relu_default = sin_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum() sum_default = torch.ops.aten.sum.default(stack_default); stack_default = None return sum_default ``` - make_fx with decomposition to prims ``` def forward(self, a_1, b_1): # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b) broadcast_in_dim_default = torch.ops.prims.broadcast_in_dim.default(b_1, [10, 10], [1]); b_1 = None add_default = torch.ops.prims.add.default(a_1, broadcast_in_dim_default); a_1 = broadcast_in_dim_default = None mul_default = torch.ops.prims.mul.default(add_default, add_default); add_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu() le_default = torch.ops.prims.le.default(mul_default, 0.0) where_default = torch.ops.prims.where.default(le_default, 0.0, mul_default); le_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin() sin_default = torch.ops.prims.sin.default(mul_default); mul_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f]) cat_default = torch.ops.prims.cat.default([where_default, sin_default], 0); where_default = sin_default = None split_dim_default = torch.ops.prims.split_dim.default(cat_default, 0, 2); cat_default = None # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum() convert_element_type_default = torch.ops.prims.convert_element_type.default(split_dim_default, torch.float32); split_dim_default = None sum_default = torch.ops.prims.sum.default(convert_element_type_default, [0, 1, 2]); convert_element_type_default = None return sum_default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83706 Approved by: https://github.com/Chillee, https://github.com/ezyang * Add comments for block_reduce.cuh (#83825) ~~Add warning for the BlockReduce result Remove redundant __syncthreads~~ Add comments for BlockReduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/83825 Approved by: https://github.com/ngimel * Add docstring type guidelines for list & tuple to `CONTRIBUTING.md` (#83634) Minor followup to: https://github.com/pytorch/pytorch/pull/83536 For Google style docstrings, `list` and `tuple` should be completely lowercase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83634 Approved by: https://github.com/ngimel * use condensed disabled tests file (#84017) follow up to https://github.com/pytorch/test-infra/pull/545 then we can get rid of the non condensed version Pull Request resolved: https://github.com/pytorch/pytorch/pull/84017 Approved by: https://github.com/huydhn, https://github.com/janeyx99 * Revert "Make linalg.inv composite of linalg.solve (#80074)" This reverts commit 4737b3361479f4104efaa3bfa2ea517eaacb60fb. Reverted https://github.com/pytorch/pytorch/pull/80074 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628 * Revert "[xla hash update] update the pinned xla hash (#83967)" This reverts commit ce7a9f92e30b93ab6efff4135be005c9afd0533a. Reverted https://github.com/pytorch/pytorch/pull/83967 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628 * Revert "Don't introduce new overload for SymInt (#83628)" This reverts commit 8fae7027b399e65e6071d335aa874497682c84d0. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222 * [Quant] Vectorize scalar remainder in quantized kernel for normalization (#79673) ## Description This PR improves performance of quantized kernel for normalize by vectorizing scalar remainder. In the current implementation [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp), the computation is vectorized while the scalar remainder is handled in a `for` loop. The remainder is also vectorized to improve performance in this PR. This kernel is for contiguous (NCHW) memory layout. For channels-last memory layout, a fast path is added in this PR https://github.com/pytorch/pytorch/pull/70520 The improvement is beneficial for layer norm, group norm and instance norm as this kernel is used for them. ## Changes 1. Add an argument `size` to `Vectorized::loadu()` for vec256_qint and vec512_qint. 2. Load the remainder with the new `loadu` and do computation in the similar way as for vectorized part. ## Validation ### Test method: Run quantized group norm with group = 2. Op CPU time measured by `torch.profiler.profile` with warmup = 20, active = 200 ### Common environment: - Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz - OS: CentOS Linux 7 (Core) (x86_64) - Python version: 3.7.10 - Use JeMalloc memory allocator - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto - Using Intel OpenMP - KMP_AFFINITY=granularity=fine,compact,1,0 - KMP_BLOCKTIME=1 ### Case 1: AVX2 **Environment** - GCC version: (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) - AVX2 enabled, AVX512 disabled, i.e., vec256 used **Run a single instance on a single core** Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments -- | -- | -- | -- | -- | -- | -- (1, 2, 8, 5) | 3.73 | 3.75 | 4.51 | 99.41% | 82.75% | Remainder size = 8 (1, 2, 8, 6) | 3.76 | 4.00 | 4.53 | 93.93% | 82.95% | Remainder size = 16 (1, 2, 8, 7) | 3.74 | 4.01 | 4.52 | 93.34% | 82.84% | Remainder size = 24 (1, 2, 8, 8) | 3.90 | 3.96 | 4.49 | 98.49% | 87.00% | No remainder (1, 2, 8, 17) | 4.00 | 4.17 | 4.72 | 95.83% | 84.69% | Remainder size = 8 (1, 2, 8, 18) | 4.00 | 4.23 | 4.72 | 94.54% | 84.89% | Remainder size = 16 (1, 2, 8, 19) | 4.03 | 4.29 | 4.76 | 94.01% | 84.70% | Remainder size = 24 (1, 2, 8, 20) | 3.92 | 3.93 | 4.76 | 99.67% | 82.29% | No remainder (1, 2, 8, 33) | 4.10 | 4.18 | 5.06 | 97.92% | 81.00% | Remainder size = 8 (1, 2, 8, 34) | 4.07 | 4.23 | 5.06 | 96.40% | 80.53% | Remainder size = 16 (1, 2, 8, 35) | 4.11 | 4.42 | 5.09 | 93.03% | 80.72% | Remainder size = 24 (1, 2, 8, 36) | 4.03 | 4.06 | 5.11 | 99.24% | 78.83% | No remainder ![image](https://user-images.githubusercontent.com/12522207/173979129-e393e13f-71f5-4987-95ea-ac6e0c895bd7.png) **Run a single instance on two cores** Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments -- | -- | -- | -- | -- | -- | -- (1, 4, 8, 5) | 5.09 | 5.24 | 5.52 | 97.17% | 92.29% | Remainder size = 8 (1, 4, 8, 6) | 5.22 | 5.50 | 5.56 | 94.95% | 93.86% | Remainder size = 16 (1, 4, 8, 7) | 5.04 | 5.60 | 5.51 | 89.97% | 91.44% | Remainder size = 24 (1, 4, 8, 8) | 5.30 | 5.29 | 5.56 | 100.23% | 95.27% | No remainder (1, 4, 8, 17) | 5.36 | 5.56 | 6.05 | 96.53% | 88.69% | Remainder size = 8 (1, 4, 8, 18) | 5.48 | 5.71 | 6.25 | 95.99% | 87.67% | Remainder size = 16 (1, 4, 8, 19) | 5.44 | 5.81 | 6.25 | 93.65% | 87.11% | Remainder size = 24 (1, 4, 8, 20) | 5.43 | 5.34 | 6.07 | 101.76% | 89.43% | No remainder (1, 4, 8, 33) | 5.52 | 5.58 | 6.51 | 98.89% | 84.75% | Remainder size = 8 (1, 4, 8, 34) | 5.50 | 5.71 | 6.63 | 96.22% | 82.95% | Remainder size = 16 (1, 4, 8, 35) | 5.50 | 6.16 | 6.40 | 89.33% | 85.95% | Remainder size = 24 (1, 4, 8, 36) | 5.37 | 5.48 | 6.54 | 97.94% | 81.98% | No remainder ![image](https://user-images.githubusercontent.com/12522207/173981377-6222e278-0948-4f52-809b-28899399ca65.png) ### Case 2: AVX512 **Environment** - GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) - AVX512 enabled, i.e., vec512 used **Run a single instance on a single core** Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments -- | -- | -- | -- | -- | -- | -- (1, 2, 16, 5) | 3.66 | 3.94 | 4.52 | 92.79% | 80.93% | Remainder size = 16 (1, 2, 16, 6) | 3.77 | 4.28 | 4.60 | 88.15% | 81.90% | Remainder size = 32 (1, 2, 16, 7) | 3.85 | 4.41 | 4.57 | 87.36% | 84.20% | Remainder size = 48 (1, 2, 16, 8) | 3.70 | 3.76 | 4.62 | 98.62% | 80.10% | No remainder (1, 2, 16, 17) | 3.91 | 4.06 | 4.97 | 96.43% | 78.71% | Remainder size = 16 (1, 2, 16, 18) | 3.82 | 4.34 | 5.01 | 88.19% | 76.30% | Remainder size = 32 (1, 2, 16, 19) | 3.86 | 4.56 | 5.05 | 84.63% | 76.28% | Remainder size = 48 (1, 2, 16, 20) | 3.80 | 3.87 | 5.08 | 98.14% | 74.73% | No remainder (1, 2, 16, 33) | 3.89 | 4.23 | 5.65 | 91.94% | 68.85% | Remainder size = 16 (1, 2, 16, 34) | 3.91 | 4.46 | 5.70 | 87.68% | 68.61% | Remainder size = 32 (1, 2, 16, 35) | 4.04 | 4.68 | 5.72 | 86.44% | 70.64% | Remainder size = 48 (1, 2, 16, 36) | 4.00 | 3.99 | 5.71 | 100.28% | 69.96% | No remainder ![image](https://user-images.githubusercontent.com/12522207/173982490-4687c5bc-50e8-49aa-9fe2-7967c738dbfb.png) **Run a single instance on two cores** Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments -- | -- | -- | -- | -- | -- | -- (1, 4, 16, 5) | 5.43 | 5.53 | 5.92 | 98.12% | 91.60% | Remainder size = 16 (1, 4, 16, 6) | 5.35 | 5.85 | 6.05 | 91.53% | 88.54% | Remainder size = 32 (1, 4, 16, 7) | 5.31 | 6.04 | 6.18 | 87.97% | 85.93% | Remainder size = 48 (1, 4, 16, 8) | 5.30 | 5.27 | 6.30 | 100.66% | 84.16% | No remainder (1, 4, 16, 17) | 5.47 | 5.67 | 6.48 | 96.51% | 84.45% | Remainder size = 16 (1, 4, 16, 18) | 5.53 | 5.86 | 6.59 | 94.28% | 83.78% | Remainder size = 32 (1, 4, 16, 19) | 5.48 | 6.13 | 6.57 | 89.39% | 83.38% | Remainder size = 48 (1, 4, 16, 20) | 5.35 | 5.31 | 6.95 | 100.79% | 76.91% | No remainder (1, 4, 16, 33) | 5.62 | 5.77 | 7.31 | 97.28% | 76.80% | Remainder size = 16 (1, 4, 16, 34) | 5.56 | 5.85 | 7.06 | 95.03% | 78.71% | Remainder size = 32 (1, 4, 16, 35) | 5.67 | 6.10 | 7.09 | 93.03% | 79.98% | Remainder size = 48 (1, 4, 16, 36) | 5.50 | 5.39 | 7.20 | 102.15% | 76.42% | No remainder ![image](https://user-images.githubusercontent.com/12522207/173982748-5f003630-18a4-4c3d-a643-b8711892cc39.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/79673 Approved by: https://github.com/jerryzh168 * Increase timeout for linux binary builds (#84008) Increase timeout for linux binary builds This mitigates conda build issue: https://github.com/pytorch/pytorch/issues/84003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84008 Approved by: https://github.com/malfet * [NVFuser] Upstream push 0811 (#83239) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. double support in expression evaluator - bug fixes: 1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784) 2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard - scheduler: 1. manual transpose schedule example 2. WIP transpose scheduler Commits that's in this PR from the devel branch: ``` b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854) 8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889) 83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898) 69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883) 15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888) aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239 Approved by: https://github.com/davidberard98 * [TorchTidy] Adding support for unique tensor identifiers (#80266) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80266 Approved by: https://github.com/robieta * fix oneDNN channels_last path issue (#83653) Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653 Approved by: https://github.com/mingfeima, https://github.com/malfet * [caffe2] Remove last clang-for-cuda sources (#84021) Summary: We're no longer pursuing clang-for-cuda, so remove the last use-case. Test Plan: CI Reviewed By: pallab-zz Differential Revision: D38996710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84021 Approved by: https://github.com/malfet * Revert "Support NCCL Premul Sum (#81272)" This reverts commit 432c508e71111f9d5382322e0e6b1bc1c66bf0ec. Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds * Revert "[TorchTidy] Adding support for unique tensor identifiers (#80266)" This reverts commit b6ba41921daf6365a762562641bfd846437c8529. Reverted https://github.com/pytorch/pytorch/pull/80266 on behalf of https://github.com/malfet due to Broke number of trunk jobs, see https://hud.pytorch.org/pytorch/pytorch/commit/b6ba41921daf6365a762562641bfd846437c8529 * NCCL: Re-enable parallel builds (#83696) Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696 Approved by: https://github.com/malfet * [fx+scripting] Adding num_iter_1 and num_iter_2 params LearningRate op (#83691) Summary: Adding num_iter_1 and num_iter_2 to learning rate op Test Plan: Exisiting unit tests Differential Revision: D38762710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83691 Approved by: https://github.com/qxy11 * Fix dumb make_fx issue (#84011) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84011 Approved by: https://github.com/ezyang * [fx] add deferred weights (xl_weight) and tracing for xl_embedding_bag (#84016) Test Plan: added unit tests Reviewed By: jfix71 Differential Revision: D36152238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84016 Approved by: https://github.com/jfix71 * Enable cache action for lint workflow (#84026) Cache all python dependencies using [GHA cache](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows). I'm doing this for lint workflow first and will slowly roll it out to other workflows. ### Testing Before caching, pip cache is not found. Dependencies installation continues as usual: ![Screen Shot 2022-08-24 at 16 36 15](https://user-images.githubusercontent.com/475357/186543554-9d7f5978-2c2d-4362-9535-c3b17e922da1.png) After caching https://github.com/pytorch/pytorch/runs/8006214772?check_suite_focus=true. The long hash at the end of the cache key is the hash of requirements files ![Screen Shot 2022-08-24 at 16 51 51](https://user-images.githubusercontent.com/475357/186543825-055ea025-3d42-42fc-877d-baec358de0ed.png) Note that the cache is in the runners themselves. This should be a transparent process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84026 Approved by: https://github.com/seemethere, https://github.com/suo, https://github.com/malfet * switching the exact check to isinstance check (#84023) Simplifying a type check if an object is a SymIntNode in `is_symint_node` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84023 Approved by: https://github.com/ezyang * Disable autocast cache during aotdispatch (#84035) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84035 Approved by: https://github.com/jansel * Make linalg.inv composite of linalg.solve (#80074) The `getri` kernel calls inside `getrs` so we can do so explicitly ourselves and save ourselves from having to maintain an extra kernel. This way we just need to optimise `lu_factor` and `lu_solve` and `inv` will be as efficient as it can be, as it'll be choosing the best backend to perform the factorisation and the best backend (not necessarily the same) to perform the solve. Fixes https://github.com/pytorch/pytorch/issues/77498 The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074 Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet * add qscheme check for quantization observer (#80126) Motivation: each quantization observer only supports a limit qschemes, we need to do this check at the initiation step, rather than at the running step, such as MinMaxObserver with set qscheme with **torch.per_channel_affine**, there will have a runtime error at the running the calibration step: ``` AttributeError: 'MinMaxObserver' object has no attribute 'ch_axis' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/80126 Approved by: https://github.com/jerryzh168 * [functorch] add batching rule for fill_.Tensor (#84015) I think this is what the theseus folks ran into, but will confirm with them later. Test Plan: - new manual test; the OpInfo for fill_ isn't sufficient and it is difficult to modify Pull Request resolved: https://github.com/pytorch/pytorch/pull/84015 Approved by: https://github.com/Chillee * fix `NoneType` object has no attribute `python_exit_status` (#83985) Fixes #83791 Prevents the Error when `_utils` has been cleared by Python before `__del__` is invoked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83985 Approved by: https://github.com/NivekT * Decomposition - batch_norm, save_mean and save_variance always float32 (#84013) AMP error shown here - https://github.com/pytorch/torchdynamo/issues/835 Test missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/84013 Approved by: https://github.com/ezyang * enable qlinear dynamic parallelization with fbgemm (#84033) Test Plan: CI Differential Revision: D39004891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84033 Approved by: https://github.com/jerryzh168 * [quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - Documentation @vkuzo - docs/source/conf.py - docs/source/quantization.rst - [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168 - [common test routine](test/quantization/ao_migration/common.py) @HDCharles - JIT stuff @jamesr66a - torch/csrc/jit/passes/hoist_conv_packed_params.cpp - torch/csrc/jit/passes/quantization/helper.h - torch/csrc/jit/serialization/import_source.cpp Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/) Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713 Approved by: https://github.com/jerryzh168 * [quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - [Documentation](docs/source/quantization-support.rst) @vkuzo - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10 - [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo - [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a - [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)! Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714 Approved by: https://github.com/jerryzh168 * [quant][ao_migration] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` (#78715) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] [Current PR] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860927/)! Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78715 Approved by: https://github.com/jerryzh168 * [quant][ao_migration] `torch.nn.quantizable` → `torch.ao.nn.quantizable`. (#78717) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] [Current PR] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - `torch/ao/nn/__init__.py` → Changing the imports to lazy. Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861090/)! Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78717 Approved by: https://github.com/jerryzh168 * [quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168 * disable c10::SymIntNode tests on mobile (#84066) This fixes c++ tests' breaks where we were passing pointers and expected `is_symbolic` to return `true` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84066 Approved by: https://github.com/albanD * [GHF][BE] Move merge rules to yaml (#84065) To allow comments Update `trymerge.yaml`, `revert.yaml` and `tryrebase.yaml` to use v4 setup-python action and install pyyaml Reformat json to yaml by running: ``` python -c "import yaml;print(yaml.dump(yaml.safe_load(open('.github/merge_rules.yaml')), sort_keys=False))" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84065 Approved by: https://github.com/b0noI, https://github.com/huydhn * run functorch decomps after functionalization when enabled (#83992) This is a short-to-midterm fix for https://github.com/pytorch/pytorch/issues/83923. By running functionalization before decomps, we guarantee that functionalization won't have to see any primtorch view/inplace ops like `broadcast_in_dim`. This will only really be a problem if there's a function in the decomposition table that decomposes a functional op into mutations. If that comes up later, we'll need to revisit https://github.com/pytorch/pytorch/issues/83923. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83992 Approved by: https://github.com/ezyang * functionalization: support inplace views on inputs (#83993) A version of this PR was sitting at https://github.com/pytorch/pytorch/pull/82601 but that PR some other cleanup that relies on being able to use functorch in pytorch/pytorch CI tests, which isn't ready yet. I pulled the change out here to unblock functionalization for some models run with inductor (see https://github.com/pytorch/torchdynamo/issues/964#issuecomment-1225971788). Pull Request resolved: https://github.com/pytorch/pytorch/pull/83993 Approved by: https://github.com/ezyang * [DataPipe] Reset Shuffler's iterator when NotStarted (#83535) This PR changes the behavior of `IterDataPipe` to always invoke `reset` for the state of `NotStarted`. The main reason is we normally put lazy initialization code into `reset` function. Even for the state of `NotStarted`, we should invoke `reset` to initialize those lazy variables. Otherwise, we have to manually determine if the state is `NotStarted` or `Iterating` in `__iter__` function and only manually invoke `reset` in the state of `NotStarted`. This PR also makes `Shuffler` is able to serialize with `buffer` and `rng_state`. The following part is removed: ~I am also add `_snapshot_state` into serialization state and during `__setstate__` only change the state to `Restored` if the original state is `Iterating`. Especially, for the case of deserializing/serializing `NotStarted` DataPipe (multiprocessing), we would invoke `set_seed` for `Shuffler`. We need the `DataPipe` remains as `NotStarted` to properly `reset`.~ I am listing all the expected behavior state transition below: - Initial state: `NotStarted` - `iter` -> Call `reset` and change the state to `Iterating` - serialize/deserialize -> Keep the state as `NotStarted` (will `reset` if `iter` is called afterwards) - Initial state: `Iterating` - `iter` -> Call `reset` and keep the state to `Iterating` - serialize/deserialize -> Change the state as `Restored` - Initial state: `Restored` - `iter` -> Only change the state to `Iterating` - serialize/deserialize -> Not allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/83535 Approved by: https://github.com/NivekT * [ONNX] Assign ONNXScopeName during function substituion (#82039) Previously only traced IR graph stores module typename and variable name in `scope` in `node`. This change enables such `scope` info for IR graph generated by torch script. Torch script produced IR graphs emit nodes for module object and module method call. This structured graph is flattened in `function_substition` pass prior to other ONNX conversion passes. This PR extends `function_substition` pass to record the module typename and variable name info in `scope`, while inlining the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82039 Approved by: https://github.com/justinchuby, https://github.com/abock * Torch cond operator, python dispatch, pyoperator (#83154) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/83154 Approved by: https://github.com/ezyang * [vulkan] use VMA at third-party (#83934) Remove the VMA checked in at `aten/src/ATen/native/vulkan/api/vk_mem_alloc.h`, and use the version checked into `fbsource/third_party` instead. Also change open source CMakeLists to look for VMA in third_party submodule directory. Note that I had to add an alternate VulkanMemoryAllocator target that uses `fb_xplat_cxx_library` instead of `oxx_static_library` to make it work with vulkan targets in `caffe2`. Before landing this diff, make sure https://github.com/pytorch/pytorch/pull/83906 is committed on open source, which adds VMA as a git submodule of pytorch. Differential Revision: [D38943217](https://our.internmc.facebook.com/intern/diff/D38943217/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38943217/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83934 Approved by: https://github.com/manuelcandales * [GHF] Land validation should not change default branch (#84084) This prevents a loophole, where somebody submits a PR that modifies merge rules and request land validation, so that their PR will be validated against those rules, rather than ones currently in trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84084 Approved by: https://github.com/janeyx99, https://github.com/kit1980 * [ONNX] Add runtime type checking to `export` (#83673) This PR adds an internal wrapper on the [beartype](https://github.com/beartype/beartype) library to perform runtime type checking in `torch.onnx`. It uses beartype when it is found in the environment and is reduced to a no-op when beartype is not found. Setting the env var `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS` will turn on the feature. setting `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED` will disable all checks. When not set and `beartype` is installed, a warning message is emitted. Now when users call an api with invalid arguments e.g. ```python torch.onnx.export(conv, y, path, export_params=True, training=False) # traning should take TrainingModel, not bool ``` they get ``` Traceback (most recent call last): File "bisect_m1_error.py", line 63, in main() File "bisect_m1_error.py", line 59, in main reveal_error() File "bisect_m1_error.py", line 32, in reveal_error torch.onnx.export(conv, y, cpu_model_path, export_params=True, training=False) File "<@beartype(torch.onnx.utils.export) at 0x1281f5a60>", line 136, in export File "pytorch/venv/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception raise exception_cls( # type: ignore[misc] beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter training=False violates type hint , as False not instance of . ``` when `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK` is not set and `beartype` is installed, a warning message is emitted. ``` >>> torch.onnx.export("foo", "bar", "f") :1: CallHintViolationWarning: Traceback (most recent call last): File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 54, in _coerce_beartype_exceptions_to_warnings return beartyped(*args, **kwargs) File "<@beartype(torch.onnx.utils.export) at 0x7f1d4ab35280>", line 39, in export File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception raise exception_cls( # type: ignore[misc] beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter model='foo' violates type hint typing.Union[torch.nn.modules.module.Module, torch.jit._script.ScriptModule, torch.jit.ScriptFunction], as 'foo' not , , or . Traceback (most recent call last): File "", line 1, in File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 63, in _coerce_beartype_exceptions_to_warnings return func(*args, **kwargs) File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 482, in export _export( File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 1422, in _export with exporter_context(model, training, verbose): File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen) File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 177, in exporter_context with select_model_mode_for_export( File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen) File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 95, in select_model_mode_for_export originally_training = model.training AttributeError: 'str' object has no attribute 'training' ``` We see the error is caught right when the type mismatch happens, improving from what otherwise would become `AttributeError: 'str' object has no attribute 'training'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83673 Approved by: https://github.com/BowenBao * example program for paper intro (#83945) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83945 Approved by: https://github.com/jansel * New TORCH_UCC_BLOCKING_WAIT env variable (#81791) Cherry-pick of https://github.com/facebookresearch/torch_ucc/pull/95. I recommend waiting until https://github.com/pytorch/pytorch/pull/81583 is merged first, so the CI is checking if this PR compiles correctly. Marking this as a draft for now, will change to "ready for review" once https://github.com/pytorch/pytorch/pull/81583 merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81791 Approved by: https://github.com/kwen2501 * Make graph_module.print_readable() discoverable (#83960) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83960 Approved by: https://github.com/ezyang * Fix FSDP not all outputs used in loss (#83195) There are a couple issues / assumptions within FSDP today that this PR attempts to fix: - In wait_for_post_backward, we assume that if a param required grad, its post backward was called, but this is not true, i.e. if its output did not participate in grad computation, it would not have called post backward. To fix this we simply removed those assertions. - There is a deeper issue where in `_finalize_params`, we could end up assigning a grad of the sharded shape to an unsharded parameter gradient field, which would raise a shape error. This can happen for example if a parameter's usage transitions from used --> unused. In this case, when the parameter was used, it would have had a gradient, then user could have possibly called `zero_grad()` and p.grad would not be `None`. This in `_prep_grad_for_backward`, we would assign a `_saved_grad_shard` to this gradient field which would be the sharded shape. In `_finalize_param`, our parameter would be unsharded (since post_backward was not called), but we'd try to assign, raising the shape issue. This issue is fixed by checking `_post_backward_called`. If this is False, we simply skip the assignment because there is no new gradient to update. - A final issue as mentioned above is that if post_backward is not called, we never reshard the full param. This is fixed by checking if we haven't resharded (basically if post_backward_called == False), and if so, performing a reshard. A few things to note: - This logic may have to be revisited when non-recursive wrapping lands as there are multiple FlatParams per FSDP unit - This logic may not work when post_backward_hook fires but p.grad is None, i.e. the short-circuiting here: https://github.com/pytorch/pytorch/blob/f534b2c627da65bbee7ccc8f7e054da0ba48eb79/torch/distributed/fsdp/fully_sharded_data_parallel.py#L2884. As a quick fix, we could just move `_post_backward_called` flag change to after this, or just perform a reshard before returning early. I am not sure how to repro a case where p.grad == None but we call the post-backward hook, https://github.com/pytorch/pytorch/issues/83197 might be a possibility, but I think it is fine to not support this yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83195 Approved by: https://github.com/awgu * Silence namedtuple warning in dist (#84072) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84072 Approved by: https://github.com/awgu * Don't introduce new overload for SymInt (#83628) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh * Fix missing include for size_t (#84088) Fixes the following issue: ```C++ In file included from /home/gaoxiang/pytorch-ucc/c10/test/util/ConstexprCrc_test.cpp:1: In file included from /home/gaoxiang/pytorch-ucc/c10/util/ConstexprCrc.h:3: /home/gaoxiang/pytorch-ucc/c10/util/IdWrapper.h:42:10: error: unknown type name 'size_t'; did you mean 'std::size_t'? friend size_t hash_value(const concrete_type& v) { ^~~~~~ std::size_t /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/12.2.0/../../../../include/c++/12.2.0/x86_64-pc-linux-gnu/bits/c++config.h:298:26: note: 'std::size_t' declared here typedef __SIZE_TYPE__ size_t; ^ 1 error generated. [111/2069] Generating /home/gaoxiang/pytorch-ucc/torch/csrc/a...ch-ucc/torch/testing/_internal/generated/annotated_fn_args.py ninja: build stopped: subcommand failed. ``` This error happens with my GCC 12.2.0 + Clang 14.0.6. Full environment: ``` Collecting environment information... PyTorch version: 1.13.0a0+git14a53e6 Is debug build: True CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A OS: Arch Linux (x86_64) GCC version: (GCC) 12.2.0 Clang version: 14.0.6 CMake version: version 3.24.1 Libc version: glibc-2.36 Python version: 3.10.6 (main, Aug 3 2022, 17:39:45) [GCC 12.1.1 20220730] (64-bit runtime) Python platform: Linux-5.19.3-arch1-1-x86_64-with-glibc2.36 Is CUDA available: True CUDA runtime version: 11.7.99 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 515.65.01 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.4.1 /usr/lib/libcudnn_adv_infer.so.8.4.1 /usr/lib/libcudnn_adv_train.so.8.4.1 /usr/lib/libcudnn_cnn_infer.so.8.4.1 /usr/lib/libcudnn_cnn_train.so.8.4.1 /usr/lib/libcudnn_ops_infer.so.8.4.1 /usr/lib/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] torch==1.13.0a0+gitbcc6f6c [pip3] torch-ucc==1.0.0 [pip3] torchani==2.2 [pip3] torchvision==0.2.2.post3 [conda] Could not collect ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84088 Approved by: https://github.com/ezyang * Fix small typo in cuda.rst (#84012) This fixes a very minor typo in the CUDA semantics doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84012 Approved by: https://github.com/malfet * Use size to check same tensor sizes in reduce_scatter and allgather (#84099) Summary: Previous code uses tensor.numel() to check if all tensors have the same size in order to switch between reduce_scatter_v v.s. reduce_scatter, same applies to allgather. However, if the user input tensor is zero in the last dimension (e.g., [648632,0]), then numel() returns zero and check_same_numel is always true. This patch fixes the check to use size rather than numel, to cover the above case. Differential Revision: D39044439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84099 Approved by: https://github.com/kwen2501 * Separate kernel compilation API from kernel execution API (#1914) 1. Mostly mechanical changes to refactor some of KernelArgumentHolder in our stack instead of direct use of at::Tensor/IValue: Note: we are still holding a ref counted at::Tensor within kernel arg holder for tensor entries, simply because we want to forward it in case of aliased output. This is quite unsatisfying. But to properly strip framework Tensor from codegen stack, we need quite some refactor to abstract away the ownership of memory and allocator. That's for some future PRs. 2. Separate compilation from execution of kernels, currently using FusionExecutorCache::compileFusion and FusionExecutorCache::runFusionWithInputs. Note that the compilation API is still experimental. We currently kick off compilation into a separate thread. This part would need to be exposed & integrated into our python API. TODO for follow up PRs: - trivial forwarding input to outputs - infer outputs should switch from meta tensor to fake tensor in order to preserve device - segmented fusion should/could be compiled in parallel, since we can infer outputs without a compiled kernel. - inputs_id_lookup should be refactored into KernelArgumentHolder, since we currently use args for passing inputs around. - index mode currently is per fusion. which is not neccesary and could be refactored into per segmented fuion instead. - bind kernel inputs should also try to bind cpu scalar with int type, since the runtime value can also be used in shape inference. Generally speaking, cpu scalar dtype should also be checked during validation. - high water mark could be refactored into using occupancy API after compilation, so we are not unnecessarily recompile when we don't have to. * Use an unused variable (#84073) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84073 Approved by: https://github.com/huydhn * Remove unreachable except block (#84070) This was introduced because two PRs tried to fix an issue concurently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84070 Approved by: https://github.com/huydhn, https://github.com/janeyx99 * Upstream cherry pick fixes 0811 (#1934) cherry-pick upstream CI fixes from #83067 & #83239 * [xla hash update] update the pinned xla hash (#84043) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84043 Approved by: https://github.com/pytorchbot * Made some minor cleanups to decompositions (#83814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83814 Approved by: https://github.com/ngimel * Fix preconditions of adaptive_avg_pooling2d (#84061) Before, if the input had dimension `4`, the channel had to be of dimension non zero. This was not what the errors advertised Pull Request resolved: https://github.com/pytorch/pytorch/pull/84061 Approved by: https://github.com/Chillee * [composite compliance] cov, corrcoef (#82954) Ref: #69991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82954 Approved by: https://github.com/zou3519 * Enable -Wunused-local-typedefs (#83708) I recently had a PR reverted because it triggered an unused-local-typedefs warning, so disabling these in the CMake build is counter-productive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83708 Approved by: https://github.com/albanD * Use C10_HAS_CPP_ATTRIBUTE to simplify nodiscard definition (#83976) `C10_HAS_CPP_ATTRIBUTE` only expands to `__has_cpp_attribute` when it is defined, so we avoid the extra `#if defined(__has_cpp_attribute)` checks and double-nested `#if`s Pull Request resolved: https://github.com/pytorch/pytorch/pull/83976 Approved by: https://github.com/albanD * [functorch] add lstsq batch rule (#82325) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82325 Approved by: https://github.com/zou3519 * do not use deprecated functions (#1935) * Map IterationDomains through view operations. (#1919) * mac circleci workflows (#82780) Add mac and ios workflows to circleci so they can be run on pull m1 tests not included because circleci doesnt have machines Unsure how to get certain environment variables (specifically for arm64 ios builds that require env vars like `IOS_SIGN_KEY_2022` and `IOS_DEV_TEAM_ID` that are stored in the org-member context which is not accessible by everyone. doc regarding env vars https://docs.google.com/document/d/1J_3Z9sfu2vlHMF1fjdJfeTuxPXC6dgqJs7aU0KpYSBU/edit# Pull Request resolved: https://github.com/pytorch/pytorch/pull/82780 Approved by: https://github.com/malfet, https://github.com/huydhn * Add type hints to torch.save, torch.load (#83937) I'll probably need help with this one. I'm not sure what the full type signature for `map_location` should be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83937 Approved by: https://github.com/malfet, https://github.com/albanD * Expose ProcessGroup::Work.wait() API to TorchScript (#83303) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83303 Approved by: https://github.com/rohan-varma * Update proxy_tensor.py to support List input/output (#83302) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83302 Approved by: https://github.com/Chillee * Make allreduce compatible with fx ProxyTensor (#84126) land after #83122 This PR explores solutions for 2 issues: 1. Collective comm ops are inplace ops, and does not return a tensor. With that, `make_fx` cannot include comm ops in the traced graph. The current solution is to make comm ops return a tuple of `(output_tensors, work_handle)`, so that [`proxy_call`](https://github.com/pytorch/pytorch/blob/90821aab100a436424113e2306eac63f5e247ee5/torch/fx/experimental/proxy_tensor.py#L170-L172) can handle that. It won't change the behavior of existing c10d Python/C++ APIs, so I directly added the code to `Ops.cpp`. 2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore the `wait()` call on the work when tracing graph. However, this might break correctness, as when running the traced function, it could consume a tensor before it's ready. The current solution is to create a `CommTensor` tensor subclass to explicitly call `wait()`. In this PR, I am only doing this in the test, as we will need more discussion to see if we can add this to c10d Python implementations. kudos to @Chillee @wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126 Approved by: https://github.com/wanchaol * Propagate permissive mapping information into indexing pass (#1929) * [ONNX] Clean up patch functions (#83136) Changes: - Move namespace handling from `_new_node` to `_graph_op` for clarity - Always require the `aten` namespace when creating aten ops. Remove the `aten` argument supplied in `_aten_op` for clarity - Rename the `_ATTR_PATTERN` global - Improve types - Update `_add_attribute` to raise ValueErrors Pull Request resolved: https://github.com/pytorch/pytorch/pull/83136 Approved by: https://github.com/BowenBao * [Profiler][Minor] Extend Python bindings (#83622) Adding some fields which are needed for memory profiling. Differential Revision: [D38528382](https://our.internmc.facebook.com/intern/diff/D38528382/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83622 Approved by: https://github.com/Gamrix * [Profiler][Trivial] Add null handling to `AppendOnlyList::copy` memcpy path. (#83963) It is apparently undefined behavior to do pointer arithmetic on nullptr. In the case of AppendOnlyList, `next_` will only be null if `end_` is also null and thus the `memcpy` path will only be triggered if `n == 0`. Nonetheless, it is UB to `memcpy(0, 0, 0)` The extra null check is in a `C10_LIKELY` block so the extra cost should be negligible, and indeed after dusting off the component microbenchmarks there's no observable difference. Differential Revision: [D38969443](https://our.internmc.facebook.com/intern/diff/D38969443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83963 Approved by: https://github.com/slgong-fb * Update Dynamo pin (#83829) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829 Approved by: https://github.com/ezyang * make job pass even if monitoring script fails (#84068) makes github slightly less confusing to look at when a test fails Pull Request resolved: https://github.com/pytorch/pytorch/pull/84068 Approved by: https://github.com/huydhn, https://github.com/malfet * [ONNX] Export node and value with scope name (#82040) Introduce `_jit_pass_onnx_assign_node_and_value_names` to parse and assign scoped name for nodes and values in exported onnx graph. Module layer information is obtained from `ONNXScopeName` captured in `scope` attribute in nodes. For nodes, the processed onnx node name are stored in attribute `onnx_name`. For values, the processed onnx output name are stored as `debugName`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82040 Approved by: https://github.com/AllenTiTaiWang, https://github.com/justinchuby, https://github.com/abock * Add support to traverse all python collection objects (#84079) Fixes https://github.com/pytorch/data/issues/752 This PR makes `traverse` function supporting more collections data structures from Python. Please let me know if anyone has a better idea about how to elegantly check if the object is a collection then we can dive into this object to see wether there is any DataPipe wrapped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079 Approved by: https://github.com/NivekT * Read via FileAdapter when loading files in torch if not flatbuffer (#84028) Summary: This will optimize memory usage at the small cost of loading time when loading mobile models restoring the behavior before D36926217 (https://github.com/pytorch/pytorch/commit/fed12ff680813c0fab7dba7232f6b4cd8b33b8d3). Test Plan: Signals Reviewed By: qihqi Differential Revision: D38998858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84028 Approved by: https://github.com/qihqi, https://github.com/cccclai * Enable cache action for windows and other minor workflows (#84093) Following up on https://github.com/pytorch/pytorch/pull/84026, these are the rest of pip dependencies that I can find. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84093 Approved by: https://github.com/malfet * [Nested Tensor] do not use at::cuda::getDefaultCUDAStream() (#84134) Use at::cuda::getCurrentCUDAStream(), not getDefaultCUDAStream(). Otherwise, add/remove padding kernels won't sync with current stream, resulting in flaky unit tests in test_nestedtensor.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84134 Approved by: https://github.com/drisspg * Fix a bug (#1936) A bit uncomfortable not using an initialization list to initialize value_, but can't think of any other way to workaround the c10::variant deprecated problem. * [fx][pass] Fix type of exception (#84094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84094 Approved by: https://github.com/SherlockNoMad * [Profiler][Trivial] Cleanup ExperimentalConfig (#83890) I'm trying to limit how much is in headers to make it easier to read the API surface. In a similar vein, we can replace `hasOptions` with `operator bool` so it just does the right thing in the check. Differential Revision: [D38917366](https://our.internmc.facebook.com/intern/diff/D38917366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83890 Approved by: https://github.com/slgong-fb * [Profiler] Add `disabled` and `global` methods to ProfilerConfig. (#83891) `ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts. Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891 Approved by: https://github.com/slgong-fb * [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202) Fixes: https://github.com/pytorch/data/issues/718 This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202 Approved by: https://github.com/NivekT * [Prim] Implement group_norm_backward (#84037) Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus: ``` #!/usr/bin/env python3.8 import torch func = torch.ops.aten.native_group_norm_backward.default decomp = torch._decomp.decomposition_table[func] for args in ( (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]), (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])): nrc=func(*args) drc=decomp(*args) for i in range(len(nrc)): print(i, torch.max(nrc[i]-drc[i])) print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037 Approved by: https://github.com/Chillee, https://github.com/ngimel * Revert "[xla hash update] update the pinned xla hash (#84043)" This reverts commit ddedc294fbb4c13170811442b590a18e950dae67. Reverted https://github.com/pytorch/pytorch/pull/84043 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/83628 * Revert "Don't introduce new overload for SymInt (#83628)" This reverts commit 9790d90e4b0288796ab44a6b4979db0a67580ba8. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487 * [AOT Autograd] Redirect named_parameters to original mod (#84157) Helps in comparing accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84157 Approved by: https://github.com/Chillee * [Nested Tensor] detach (#84078) ## Summary Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually. The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b. Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect? Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078 Approved by: https://github.com/albanD * Enforce explicit ProcessGroup passed into DefaultState (#84105) Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105 Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao * _to_copy decomp (#84108) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108 Approved by: https://github.com/Chillee * [ONNX] Fix type annotations and enable type checking for all apis (#84091) Enable runtime type checking for all torch.onnx public apis, symbolic functions and most helpers (minus two that does not have a checkable type: `_.JitType` does not exist) by adding the beartype decorator. Fix type annotations to makes unit tests green. Profile: export `torchvision.models.alexnet(pretrained=True)` ``` with runtime type checking: 21.314 / 10 passes without runtime type checking: 20.797 / 10 passes + 2.48% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84091 Approved by: https://github.com/BowenBao * Add nvprims.var_mean (#83508) This PR adds nvfuser-specific primitive - `var_mean`. Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager. I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`). Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti): ```py import torch from torch._prims.context import TorchRefsNvfuserCapabilityMode from torch.fx.experimental.proxy_tensor import make_fx from torch._prims.executor import execute def func(a): return torch.native_layer_norm(a, (1024,), None, None, 1e-6) a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda") with TorchRefsNvfuserCapabilityMode(): gm = make_fx(func)(a) for _ in range(10): execute(gm, a, executor="strictly_nvfuser"); ``` run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py` ```py # WITH THIS PR # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.033792 ms, achieved: 621.818 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.032608 ms, achieved: 644.396 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.03072 ms, achieved: 684 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # ON MASTER # kernel1 run in 0.05632 ms, achieved: 373.091 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043808 ms, achieved: 479.649 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s ``` So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape. Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`). Ref. https://github.com/pytorch/pytorch/issues/80187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508 Approved by: https://github.com/ngimel * [xla hash update] update the pinned xla hash (#84164) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84164 Approved by: https://github.com/pytorchbot * Revert "Make allreduce compatible with fx ProxyTensor (#84126)" This reverts commit ec5b83f76847584013a9cd4177d389a408033614. Reverted https://github.com/pytorch/pytorch/pull/84126 on behalf of https://github.com/malfet due to Likely broke multigpu periodic jobs, see https://github.com/pytorch/pytorch/runs/8044611438?check_suite_focus=true * Fix softmax bwd sizes. (#1890) * Test `rand` in a fusion with zero tensor input (#1932) * Improve trivial reduction merge support (#1931) * Double support on all expression evaluators (#1937) * arange support (#1933) * Replace assertEqualIgnoreTypes from common_methods_invocations.py (#84076) This addresses TODO:38095 . More details at https://github.com/pytorch/pytorch/issues/38095 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/84076 Approved by: https://github.com/kit1980 * Nvfuser to copy decomp to prim (#83782) Conditional decomposing aten::_to_copy to nvprim::convert_element_type to allow fusion with type casting, which is introduced during type promotion phase at torch decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83782 Approved by: https://github.com/ngimel * Tensor factories must set the output shape as its input (#1939) * Revert "[xla hash update] update the pinned xla hash (#84164)" This reverts commit c032b097e315177af5bc867eeee5452b7df32952. Reverted https://github.com/pytorch/pytorch/pull/84164 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally * Revert "Add nvprims.var_mean (#83508)" This reverts commit 7e7694b6615fbf46abfab234615fa891c2819eb7. Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally * Revert "[ONNX] Fix type annotations and enable type checking for all apis (#84091)" This reverts commit 6446da17305960088dfae501d5c7358af068fa81. Reverted https://github.com/pytorch/pytorch/pull/84091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally * Fix arange when step is negative (#1942) * The device version of ceilDiv assumes positive inputs, so when step is negative, it gives an incorrect result. For example, I see FusionStandAloneArange results in a write error with compute-sanitizer when start = 0, stop = -1, step = -1.5 and dtype = kLong. * Add full, full_like, zeros, zeros_like, ones, ones_like (#1943) * Move detection of self mapping IDs to IterDomainGraph from (#1941) * test the groups the same order as they are merged (#1949) * Exclude unsupported data types (#1951) * Exclude unsupported data types * Some indexing cleanups, Add eye support (#1940) * Fix detection of unmappable root domains (#1952) ComputeAtRootDomainMap flags domains that should not be mapped due to reductions. Previously, checking if a domain potentially causes an invalid mapping is only done with one domain in each group of domains that are found to be mappable so far. That's not actually sufficient as the unmappable domain set is created just once with no root mapping information. The fix is to check all consumer domains of a producer tensor. A small other fix is also done to address a different problem discovered after the first fix. * Fill allocation with nan on tests (#1956) * TVDomainGuard factory (#1953) * Some cleanup (#1957) * Remove unused variables (#1955) * Improve the comments at the beginning of index_compute.h (#1946) I just started to learn indexing, and the comment at the beginning of index_compute.h does not look good... * Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930) * WAR on index mapping when exact and permissive maps differ (#1960) * Fix dump effective bandwidth (#1962) * Upstream push ci fixes (#1965) Cherry-picking upstream build failure patches from PR pytorch#84626 Changes includes: 1. added throw in stringify 2. Split fused_reduction.cu as its size exceeds the limit in MSVC 3. update bzl build for runtime header 4. Fix a bug originally reported in https://github.com/pytorch/pytorch/pull/84626 5. Meta internal build fix Co-authored-by: Naoya Maruyama * View scheduling (#1928) * Move scheduler vectorize utilities into their own file (#1959) * Enable transpose scheduler (#1927) * Minor fix (#1967) * Fix canScheduleCompileTime check of transpose scheduler (#1969) * Add a null scheduler that helps segmenting away no-op schedules (#1835) Co-authored-by: Gao, Xiang * Enable Transpose operation (#1882) * Segment self mapping fusions (#1954) * Add support for some empty fusion (#1981) * Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987) * Add support for uniform RNG (#1986) * Minor cleanup (#1992) * Fix missing thread predicates Unlikely to matter, but should be necessary * Minor cleanup lower_unroll.cpp (#1994) * Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988) * Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989) * Minor build fix. (#1996) * Improve divisible split detection (#1970) * cleanup (#1997) * Just fixes comments (#1998) * Just fixes comments * Fix build problem (#1999) * More strict validation (#2000) * Test util cleanup (#2003) Don't clear the memory allocator cache as it shouldn't be necessary * fix merge * fix merge * format * Make inlining even more modular (#2004) I don't like `TensorView::setComputeAt` and `TensorView::setMaxProducer`, they are private, and I can not use them conveniently. It would be better if there is some public method of `TensorView` that allows directly setting the CA position of a TV with necessary validation. So I added two public methods `TensorView::inlineAt` and `TensorView::updateMaxProducerPosition` and removed `TensorView::setComputeAt` and `TensorView::setMaxProducer`. The `inlineAt` can be safely used publicly. It will not inline into disallowed dimensions, and the max producer position will be kept consistent. There are two ways of using `inlineAt`: If you only want to set the CA position of a single tensor, then simply do ```C++ tv->inlineAt(pos, /*best_effort=*/true); ``` If you want to set the CA position of multiple tensors, then you can do ```C++ MaxPosCalculator calc; for (auto tv : tensors) { tv->inlineAt(pos, /*best_effort=*/true, &calc); } ``` In both case, the max producer position will be updated at the end of the `inlineAt` call. Manually constructing the object of `MaxPosCalculator` is mainly for performance reasons: we don't want to build unmappable dimensions every time we call `inlineAt`. If we want to inline multiple tensors, we should build at the beginning and use it in all `inlineAt` calls. Even though `inlineAt` always updates the max producer position automatically, there are still cases where we want to manually trigger an update of the max producer position, and the `updateMaxProducerPosition` is designed for such a purpose. It is mainly used for grouped reductions. **With `inlineAt`, I can refactor inlining to make it even more modular:** There is no longer an `InlinePropagator`. Innermost inlining is now just a dumb for loop: ```C++ MaxPosCalculator calc; for (auto tv : all_tvs) { tv->inlineAt(-1, /*best_effort=*/true, &calc); } ``` For standard and best effort inlining, we need first to do a propagation to find the positions in each tensor mapped to the given reference tensor's given position. With the positions calculated, inlining is again a dumb for loop. * Contiguous indexing for View operations (#1990) * Enable tests previously disabled due to an aliasing bug (#2005) * Enable tests previously disabled due to an aliasing bug The bug was fixed by #1792 * Add matmul benchmark (#2007) Co-authored-by: Catherine Lee Co-authored-by: Rohan Varma Co-authored-by: CaoE Co-authored-by: Naoya Maruyama Co-authored-by: Justin Chu Co-authored-by: Jerry Zhang Co-authored-by: Kshiteej K Co-authored-by: PyTorch MergeBot Co-authored-by: jpvillam Co-authored-by: mattip Co-authored-by: Vasiliy Kuznetsov Co-authored-by: Horace He Co-authored-by: Jeff Daily Co-authored-by: Jane Xu Co-authored-by: Scott Wolchok Co-authored-by: samdow Co-authored-by: chengscott <60510scott@gmail.com> Co-authored-by: Brian Hirsh Co-authored-by: Nikita Shulga Co-authored-by: Ivan Yashchuk Co-authored-by: Ke Wen Co-authored-by: lezcano Co-authored-by: soulitzer Co-authored-by: Stephen Jia Co-authored-by: Mengwei Liu Co-authored-by: Kaichen Liu Co-authored-by: Khushi Agrawal Co-authored-by: chenlai Co-authored-by: Zain Rizvi Co-authored-by: Edward Z. Yang Co-authored-by: John Detloff Co-authored-by: Eli Uriegas Co-authored-by: Angela Yi Co-authored-by: Shirong Wu Co-authored-by: Sergii Dymchenko Co-authored-by: Nan Xiao Co-authored-by: Hansong Zhang Co-authored-by: Ishan-Rajgarhia Co-authored-by: Driss Guessous Co-authored-by: Peter Bell Co-authored-by: Lu, Chengjun Co-authored-by: Masaki Kozuki Co-authored-by: Souranil Sen Co-authored-by: Seonglyong Gong Co-authored-by: Henry Tu Co-authored-by: Antonio Kim Co-authored-by: BowenBao Co-authored-by: John Clow Co-authored-by: Robert Co-authored-by: Nikolay Korovaiko Co-authored-by: Digant Desai Co-authored-by: Richard Barnes Co-authored-by: Larry Liu <8188269+larryliu0820@users.noreply.github.com> Co-authored-by: Sherlock Huang Co-authored-by: thomasw21 <24695242+thomasw21@users.noreply.github.com> Co-authored-by: Huy Do Co-authored-by: Jagadish Krishnamoorthy Co-authored-by: Bin Chen Co-authored-by: Chen, Jian Ping Co-authored-by: ProGamerGov Co-authored-by: Weiwen Xia Co-authored-by: atalman Co-authored-by: jjsjann123 Co-authored-by: XiaobingSuper Co-authored-by: Andrew Gallagher Co-authored-by: Mandar Deshpande Co-authored-by: Alex Beloi Co-authored-by: Richard Zou Co-authored-by: erjia Co-authored-by: Animesh Jain Co-authored-by: Jianyu Huang Co-authored-by: zaf Co-authored-by: Michael Voznesensky Co-authored-by: migeedz Co-authored-by: Christian Jauvin Co-authored-by: Min Si Co-authored-by: Christian Sarofeen Co-authored-by: Adam J. Stewart Co-authored-by: Shen Li Co-authored-by: S. Song <41357537+shmsong@users.noreply.github.com> Co-authored-by: Taylor Robie Co-authored-by: Ian Graves Co-authored-by: Natalia Gimelshein Co-authored-by: Ivan Yashchuk Co-authored-by: kuttire42 <64169153+kuttire42@users.noreply.github.com> Co-authored-by: Naoya Maruyama Co-authored-by: Ryan Spring --- .circleci/cimodel/data/dimensions.py | 1 + .../cimodel/data/simple/ios_definitions.py | 25 +- .../cimodel/data/simple/macos_definitions.py | 105 +- .circleci/cimodel/data/simple/nightly_ios.py | 8 +- .../simple/upload_test_stats_definition.py | 20 + .../cimodel/data/simple/util/versions.py | 14 +- .circleci/config.yml | 294 +- .circleci/docker/build.sh | 10 +- .circleci/docker/common/install_base.sh | 3 +- .circleci/docker/common/install_conda.sh | 6 +- .circleci/docker/common/install_ucc.sh | 48 + .circleci/docker/requirements-ci.txt | 10 + .circleci/docker/ubuntu-cuda/Dockerfile | 11 + .circleci/docker/ubuntu/Dockerfile | 11 + .circleci/generate_config_yml.py | 10 + .../job-specs/job-specs-custom.yml | 205 +- .github/ISSUE_TEMPLATE/ci-sev.md | 2 + .github/PULL_REQUEST_TEMPLATE.md | 9 +- .../actions/get-workflow-job-id/action.yml | 2 +- .github/actions/setup-win/action.yml | 5 + .github/ci_commit_pins/torchdynamo.txt | 2 +- .github/ci_commit_pins/vision.txt | 2 +- .github/ci_commit_pins/xla.txt | 2 +- .github/generated-ciflow-ruleset.json | 5 - .github/merge_rules.json | 230 - .github/merge_rules.yaml | 342 + .github/requirements-gha-cache.txt | 16 + .github/scale-config.yml | 2 +- .github/scripts/comment_on_pr.py | 34 + .../scripts/generate_binary_build_matrix.py | 2 +- .github/scripts/get_workflow_job_id.py | 6 +- .github/scripts/install_nvidia_utils_linux.sh | 4 +- .github/scripts/lint_test_ownership.py | 88 - .github/scripts/test_trymerge.py | 21 +- .github/scripts/trymerge.py | 244 +- .github/scripts/trymerge_explainer.py | 146 + .github/scripts/update_commit_hashes.py | 1 + .github/templates/common.yml.j2 | 2 + .../linux_binary_build_workflow.yml.j2 | 2 +- .../windows_binary_build_workflow.yml.j2 | 4 +- .../workflows/_android-full-build-test.yml | 36 - .github/workflows/_binary-build-linux.yml | 5 +- .github/workflows/_binary-test-linux.yml | 4 +- .github/workflows/_binary-upload.yml | 61 +- .github/workflows/_buck-build-test.yml | 9 +- .github/workflows/_docs.yml | 16 +- .github/workflows/_ios-build-test.yml | 7 + .github/workflows/_linux-test.yml | 3 +- .github/workflows/_mac-build.yml | 8 +- ...{_mac-test-arm64.yml => _mac-test-mps.yml} | 2 +- .github/workflows/_mac-test.yml | 51 +- .github/workflows/_rocm-test.yml | 1 + .github/workflows/_win-test.yml | 1 + .../workflows/cancel_redundant_workflows.yml | 23 - .github/workflows/docker-release.yml | 89 + .github/workflows/lint.yml | 16 +- .github/workflows/mac-mps.yml | 35 + .github/workflows/periodic.yml | 52 + .github/workflows/pr-labels.yml | 12 +- .github/workflows/pull.yml | 312 + .../workflows/push_nightly_docker_ghcr.yml | 39 + .github/workflows/revert.yml | 26 +- .github/workflows/stale_pull_requests.yml | 42 - .github/workflows/trunk.yml | 35 +- .github/workflows/trymerge.yml | 26 +- .github/workflows/tryrebase.yml | 27 +- .github/workflows/update-viablestrict.yml | 16 +- .github/workflows/update_pytorch_labels.yml | 2 +- .github/workflows/update_s3_htmls.yml | 2 +- .gitmodules | 3 + .jenkins/caffe2/test.sh | 2 +- .jenkins/pytorch/build.sh | 10 + .jenkins/pytorch/common_utils.sh | 2 + .jenkins/pytorch/macos-build.sh | 4 +- .jenkins/pytorch/macos-common.sh | 41 +- .jenkins/pytorch/macos-test.sh | 27 +- .jenkins/pytorch/multigpu-test.sh | 4 +- .jenkins/pytorch/test.sh | 9 +- .../win-test-helpers/build_pytorch.bat | 4 +- ...miniconda3.bat => activate_miniconda3.bat} | 16 +- .../win-test-helpers/setup_pytorch_env.bat | 8 +- .lintrunner.toml | 4 + BUILD.bazel | 1 + CMakeLists.txt | 103 +- CODEOWNERS | 14 +- CONTRIBUTING.md | 41 +- Dockerfile | 26 +- WORKSPACE | 5 +- aten/CMakeLists.txt | 2 +- aten/src/ATen/BatchingRegistrations.cpp | 5 - aten/src/ATen/Context.h | 4 + aten/src/ATen/DLConvertor.cpp | 19 +- aten/src/ATen/Dispatch.h | 16 +- aten/src/ATen/EmptyTensor.cpp | 8 + aten/src/ATen/ExpandUtils.h | 4 +- aten/src/ATen/FunctionalStorageImpl.cpp | 6 +- aten/src/ATen/NestedTensorImpl.cpp | 149 +- aten/src/ATen/NestedTensorImpl.h | 92 +- aten/src/ATen/Parallel.h | 1 + aten/src/ATen/SparseCsrTensorImpl.cpp | 3 + aten/src/ATen/SparseCsrTensorImpl.h | 1 + aten/src/ATen/TensorIterator.h | 4 + aten/src/ATen/TensorMeta.h | 1 + aten/src/ATen/TensorSubclassLikeUtils.h | 19 + aten/src/ATen/ThreadLocalState.cpp | 4 +- aten/src/ATen/ThreadLocalState.h | 2 +- aten/src/ATen/Utils.h | 53 - aten/src/ATen/autocast_mode.cpp | 15 +- aten/src/ATen/core/List.h | 17 +- aten/src/ATen/core/List_inl.h | 10 +- aten/src/ATen/core/NamedRegistrations.cpp | 1 - aten/src/ATen/core/PhiloxRNGEngine.h | 131 +- aten/src/ATen/core/PythonFallbackKernel.cpp | 4 +- aten/src/ATen/core/PythonFallbackKernel.h | 2 +- aten/src/ATen/core/TensorBase.h | 16 + aten/src/ATen/core/TorchDispatchModeTLS.cpp | 58 - aten/src/ATen/core/TorchDispatchUtils.cpp | 31 + ...DispatchModeTLS.h => TorchDispatchUtils.h} | 14 +- .../ATen/core/dispatch/DispatchKeyExtractor.h | 5 +- aten/src/ATen/core/dispatch/Dispatcher.h | 20 + aten/src/ATen/core/dispatch/OperatorEntry.cpp | 22 +- aten/src/ATen/core/dispatch/OperatorEntry.h | 6 + aten/src/ATen/core/function_schema.h | 15 +- aten/src/ATen/core/interned_strings.cpp | 1 + aten/src/ATen/core/interned_strings.h | 14 +- aten/src/ATen/core/ivalue_inl.h | 1 + aten/src/ATen/core/symbol.h | 1 + aten/src/ATen/cpu/vec/vec256/vec256_qint.h | 94 +- .../cpu/vec/vec256/vsx/vec256_float_vsx.h | 22 +- aten/src/ATen/cpu/vec/vec512/vec512_qint.h | 72 + aten/src/ATen/cpu/vec/vec_base.h | 3 +- aten/src/ATen/cuda/CUDABlas.cpp | 63 +- aten/src/ATen/cuda/CUDABlas.h | 18 +- aten/src/ATen/cuda/CUDAEvent.h | 23 + aten/src/ATen/cuda/CUDASparse.h | 15 +- aten/src/ATen/cuda/CUDASparseDescriptors.cpp | 8 +- aten/src/ATen/cuda/CUDASparseDescriptors.h | 15 +- aten/src/ATen/cuda/jiterator.h | 2 +- aten/src/ATen/cuda/jiterator_impl.h | 30 +- aten/src/ATen/cuda/llvm_complex.cpp | 6 +- aten/src/ATen/jit_macros.h | 7 - aten/src/ATen/jiterator_macros.h | 4 +- aten/src/ATen/mps/IndexKernels.h | 132 + aten/src/ATen/mps/MPSAllocator.mm | 4 +- aten/src/ATen/mps/MPSDevice.h | 9 + aten/src/ATen/mps/MPSDevice.mm | 43 +- aten/src/ATen/mps/MPSFallback.mm | 4 - aten/src/ATen/mps/MPSGuardImpl.mm | 3 - .../ATen/native/AdaptiveAveragePooling.cpp | 10 +- aten/src/ATen/native/BatchLinearAlgebra.cpp | 262 +- aten/src/ATen/native/Convolution.cpp | 27 +- aten/src/ATen/native/Correlation.cpp | 9 +- aten/src/ATen/native/Cross.cpp | 24 +- aten/src/ATen/native/DispatchStub.h | 1 + aten/src/ATen/native/Dropout.cpp | 7 +- aten/src/ATen/native/ForeachOpsKernels.cpp | 10 + aten/src/ATen/native/Integration.cpp | 6 +- aten/src/ATen/native/Linear.cpp | 3 - aten/src/ATen/native/LinearAlgebra.cpp | 42 +- aten/src/ATen/native/MaxPooling.cpp | 8 +- aten/src/ATen/native/Normalization.cpp | 9 +- aten/src/ATen/native/Onehot.cpp | 4 +- aten/src/ATen/native/Pool.h | 5 + aten/src/ATen/native/README.md | 22 + aten/src/ATen/native/RangeFactories.cpp | 4 + aten/src/ATen/native/ReduceAllOps.cpp | 15 + aten/src/ATen/native/ReduceOps.cpp | 74 +- aten/src/ATen/native/ReduceOpsUtils.h | 2 +- aten/src/ATen/native/SoftMax.cpp | 17 +- aten/src/ATen/native/Sorting.cpp | 6 +- aten/src/ATen/native/SpectralOps.cpp | 7 +- .../ATen/native/TensorAdvancedIndexing.cpp | 6 +- aten/src/ATen/native/TensorConversions.cpp | 12 +- aten/src/ATen/native/TensorFactories.cpp | 48 +- aten/src/ATen/native/TensorShape.cpp | 13 +- aten/src/ATen/native/TestOps.cpp | 29 + aten/src/ATen/native/UpSample.h | 16 +- .../ao_sparse/quantized/cpu/qnnpack_utils.h | 2 +- aten/src/ATen/native/cpu/CopyKernel.cpp | 35 +- aten/src/ATen/native/cpu/CopyKernel.h | 12 + aten/src/ATen/native/cpu/Loops.h | 9 - aten/src/ATen/native/cpu/UnaryOpsKernel.cpp | 20 +- aten/src/ATen/native/cpu/UpSampleKernel.cpp | 28 +- .../native/cuda/AdaptiveAveragePooling.cu | 11 +- aten/src/ATen/native/cuda/Blas.cpp | 4 +- aten/src/ATen/native/cuda/Copy.cu | 2 - aten/src/ATen/native/cuda/Copy.h | 10 + aten/src/ATen/native/cuda/CumminmaxKernel.cu | 29 + aten/src/ATen/native/cuda/CumprodKernel.cu | 23 + aten/src/ATen/native/cuda/CumsumKernel.cu | 25 + aten/src/ATen/native/cuda/DistanceKernel.cu | 138 +- aten/src/ATen/native/cuda/EmbeddingBag.cu | 11 +- .../ATen/native/cuda/ForeachPointwiseOp.cu | 31 +- .../ATen/native/cuda/FractionalMaxPool2d.cu | 14 +- aten/src/ATen/native/cuda/Indexing.cu | 5 + aten/src/ATen/native/cuda/JitLoops.cuh | 4 - .../ATen/native/cuda/LinearAlgebraStubs.cpp | 9 +- .../ATen/native/cuda/LogcumsumexpKernel.cu | 37 + aten/src/ATen/native/cuda/Loss.cu | 2 + aten/src/ATen/native/cuda/NLLLoss2d.cu | 12 +- aten/src/ATen/native/cuda/Normalization.cuh | 83 +- .../ATen/native/cuda/PersistentSoftmax.cuh | 8 +- aten/src/ATen/native/cuda/ScanKernels.cpp | 5 + .../cuda/{ScanKernels.cu => ScanUtils.cuh} | 89 +- aten/src/ATen/native/cuda/SoftMax.cu | 12 +- aten/src/ATen/native/cuda/TensorFactories.cu | 8 +- aten/src/ATen/native/cuda/TensorTopK.cu | 14 +- .../ATen/native/cuda/UnaryComplexKernels.cu | 39 +- .../ATen/native/cuda/UnarySpecialOpsKernel.cu | 9 +- aten/src/ATen/native/cuda/block_reduce.cuh | 43 +- aten/src/ATen/native/cuda/jit_utils.cpp | 258 +- aten/src/ATen/native/cuda/jit_utils.h | 1 - .../src/ATen/native/cuda/layer_norm_kernel.cu | 2 +- .../native/cuda/linalg/BatchLinearAlgebra.cpp | 323 +- .../cuda/linalg/BatchLinearAlgebraLib.cpp | 95 - .../cuda/linalg/BatchLinearAlgebraLib.h | 5 - .../ATen/native/cuda/reduction_template.cuh | 16 + aten/src/ATen/native/cudnn/Conv_v7.cpp | 8 +- aten/src/ATen/native/cudnn/Conv_v8.cpp | 16 +- aten/src/ATen/native/layer_norm.cpp | 12 +- aten/src/ATen/native/miopen/Conv_miopen.cpp | 227 + aten/src/ATen/native/mkldnn/Common.h | 46 + aten/src/ATen/native/mkldnn/Conv.cpp | 89 +- aten/src/ATen/native/mkldnn/ConvPrepack.cpp | 289 + aten/src/ATen/native/mkldnn/ConvPrepack.h | 49 + aten/src/ATen/native/mkldnn/Matmul.cpp | 14 +- aten/src/ATen/native/mkldnn/OpContext.cpp | 47 + aten/src/ATen/native/mkldnn/OpContext.h | 99 + aten/src/ATen/native/mkldnn/Pooling.cpp | 18 +- .../mkldnn/RegisterMkldnnOpContextClass.cpp | 60 + aten/src/ATen/native/mps/OperationUtils.mm | 7 +- .../ATen/native/mps/operations/Activation.mm | 39 +- .../ATen/native/mps/operations/BinaryOps.mm | 1 + .../ATen/native/mps/operations/BitwiseOps.mm | 336 + .../ATen/native/mps/operations/ConstantOps.mm | 13 +- .../ATen/native/mps/operations/Convolution.mm | 68 +- .../native/mps/operations/Distributions.mm | 17 - .../src/ATen/native/mps/operations/Indexing.h | 51 + .../ATen/native/mps/operations/Indexing.mm | 145 +- aten/src/ATen/native/mps/operations/Linear.mm | 9 +- .../native/mps/operations/LinearAlgebra.mm | 65 +- .../src/ATen/native/mps/operations/LossOps.mm | 6 - aten/src/ATen/native/mps/operations/Pad.mm | 304 + .../native/mps/operations/PointwiseOps.mm | 5 + .../src/ATen/native/mps/operations/Pooling.mm | 16 - .../ATen/native/mps/operations/ReduceOps.mm | 1 - aten/src/ATen/native/mps/operations/Repeat.mm | 13 +- aten/src/ATen/native/mps/operations/RnnOps.mm | 3 - .../native/mps/operations/ScatterGather.mm | 49 +- aten/src/ATen/native/mps/operations/Shape.mm | 286 - .../src/ATen/native/mps/operations/SoftMax.mm | 2 - .../native/mps/operations/TensorCompare.mm | 6 - .../native/mps/operations/TriangularOps.mm | 14 +- aten/src/ATen/native/mps/operations/View.mm | 26 +- aten/src/ATen/native/native_functions.yaml | 654 +- .../native/nested/NestedTensorBackward.cpp | 106 + .../ATen/native/nested/NestedTensorMath.cpp | 99 +- .../src/ATen/native/nested/NestedTensorMath.h | 53 +- .../NestedTensorTransformerFunctions.cpp | 25 +- .../nested/NestedTensorTransformerFunctions.h | 2 - .../cuda/NestedTensorTransformerFunctions.cpp | 1 + .../cuda/NestedTensorTransformerFunctions.cu | 6 +- aten/src/ATen/native/quantized/README.md | 3 +- .../ATen/native/quantized/cpu/QuantUtils.h | 20 + .../native/quantized/cpu/conv_serialization.h | 3 + .../cpu/kernels/QuantizedOpKernels.cpp | 51 +- aten/src/ATen/native/quantized/cpu/qconv.cpp | 27 +- .../quantized/cpu/qembeddingbag_prepack.cpp | 17 +- .../src/ATen/native/quantized/cpu/qlinear.cpp | 17 +- .../native/quantized/cpu/qlinear_dynamic.cpp | 23 +- .../fully-connected-sparse-operator-tester.h | 2 +- .../gemm-block-sparse-microkernel-tester.h | 2 +- .../ATen/native/sparse/SparseCsrTensor.cpp | 69 +- .../ATen/native/sparse/SparseTensorMath.cpp | 281 +- .../src/ATen/native/sparse/SparseTensorMath.h | 1 + aten/src/ATen/native/sparse/cuda/SoftMax.cu | 14 +- .../native/sparse/cuda/SparseBlasImpl.cpp | 16 +- .../native/sparse/cuda/SparseCUDABlas.cpp | 10 +- .../sparse/cuda/SparseCUDATensorMath.cu | 16 +- .../ATen/native/sparse/cuda/SparseMatMul.cu | 6 +- aten/src/ATen/native/tags.yaml | 6 + .../ATen/native/transformers/attention.cpp | 116 +- .../native/transformers/cuda/attention.cu | 10 +- .../ATen/native/transformers/transformer.cpp | 7 +- aten/src/ATen/native/ts_native_functions.yaml | 1 - aten/src/ATen/native/vulkan/api/Allocator.h | 2 +- aten/src/ATen/native/vulkan/api/Command.cpp | 76 + aten/src/ATen/native/vulkan/api/Command.h | 16 +- aten/src/ATen/native/vulkan/api/Common.h | 45 +- aten/src/ATen/native/vulkan/api/Context.cpp | 60 +- aten/src/ATen/native/vulkan/api/Context.h | 127 +- aten/src/ATen/native/vulkan/api/Resource.cpp | 50 +- aten/src/ATen/native/vulkan/api/Resource.h | 18 + aten/src/ATen/native/vulkan/api/Runtime.cpp | 11 + .../src/ATen/native/vulkan/api/vk_mem_alloc.h | 19558 ------------- aten/src/ATen/native/vulkan/glsl/add.glsl | 13 +- aten/src/ATen/native/vulkan/glsl/add_.glsl | 9 +- aten/src/ATen/native/vulkan/glsl/div.glsl | 13 +- aten/src/ATen/native/vulkan/glsl/div_.glsl | 9 +- aten/src/ATen/native/vulkan/glsl/mul.glsl | 13 +- aten/src/ATen/native/vulkan/glsl/mul_.glsl | 9 +- aten/src/ATen/native/vulkan/glsl/sub.glsl | 13 +- aten/src/ATen/native/vulkan/glsl/sub_.glsl | 9 +- .../src/ATen/native/vulkan/ops/Arithmetic.cpp | 149 +- aten/src/ATen/native/vulkan/ops/Batchnorm.cpp | 2 +- aten/src/ATen/native/vulkan/ops/Common.cpp | 36 - aten/src/ATen/native/vulkan/ops/Common.h | 52 +- aten/src/ATen/native/vulkan/ops/Concat.cpp | 4 +- .../ATen/native/vulkan/ops/Convolution.cpp | 1118 +- aten/src/ATen/native/vulkan/ops/Convolution.h | 150 +- aten/src/ATen/native/vulkan/ops/Copy.cpp | 193 +- aten/src/ATen/native/vulkan/ops/Copy.h | 32 +- aten/src/ATen/native/vulkan/ops/Glu.cpp | 2 +- aten/src/ATen/native/vulkan/ops/Gru.cpp | 328 +- aten/src/ATen/native/vulkan/ops/Gru.h | 128 +- aten/src/ATen/native/vulkan/ops/Lerp.cpp | 14 +- aten/src/ATen/native/vulkan/ops/Lstm.cpp | 264 +- aten/src/ATen/native/vulkan/ops/Lstm.h | 102 +- aten/src/ATen/native/vulkan/ops/Mm.cpp | 182 +- aten/src/ATen/native/vulkan/ops/Mm.h | 71 +- .../vulkan/ops/QuantizedConvolution.cpp | 648 - .../native/vulkan/ops/QuantizedConvolution.h | 44 - aten/src/ATen/native/vulkan/ops/Register.cpp | 306 +- aten/src/ATen/native/vulkan/ops/Shape.cpp | 2 +- aten/src/ATen/native/vulkan/ops/Slice.cpp | 8 +- aten/src/ATen/native/vulkan/ops/Tensor.h | 36 +- .../vulkan/ops/TransposeConvolution2d.cpp | 600 - .../vulkan/ops/TransposeConvolution2d.h | 125 - aten/src/ATen/native/vulkan/ops/Utils.cpp | 172 +- aten/src/ATen/native/vulkan/ops/Utils.h | 17 + .../native/vulkan/ops/VulkanOpContext.cpp | 34 - .../ATen/native/vulkan/ops/VulkanOpContext.h | 35 - .../native/vulkan/ops/VulkanPackedContext.h | 33 + aten/src/ATen/native/vulkan/ops/cumsum.cpp | 3 +- .../ATen/templates/DispatchKeyFunctions_inl.h | 5 - .../templates/RegisterDispatchDefinitions.ini | 24 + .../ATen/templates/RegisterDispatchKey.cpp | 27 +- aten/src/ATen/test/cpu_generator_test.cpp | 36 +- aten/src/ATen/test/cuda_generator_test.cu | 20 +- aten/src/ATen/test/vulkan_api_test.cpp | 523 +- .../ATen/test/vulkan_quantized_api_test.cpp | 2 + aten/src/ATen/test/xnnpack_test.cpp | 233 +- benchmarks/cpp/nvfuser/CMakeLists.txt | 5 +- .../cpp/nvfuser/batch_norm_channels_first.cpp | 4 - .../batch_norm_channels_first_backward.cpp | 4 - .../cpp/nvfuser/batch_norm_channels_last.cpp | 4 - .../batch_norm_channels_last_backward.cpp | 4 - benchmarks/cpp/nvfuser/bert.cpp | 24 +- benchmarks/cpp/nvfuser/broadcast.cpp | 10 +- benchmarks/cpp/nvfuser/gelu_backward.cpp | 9 +- benchmarks/cpp/nvfuser/heuristic_lookup.cpp | 14 +- benchmarks/cpp/nvfuser/instance_norm.cpp | 6 +- benchmarks/cpp/nvfuser/layer_norm.cpp | 8 +- .../cpp/nvfuser/layer_norm_backward.cpp | 9 +- benchmarks/cpp/nvfuser/lstm_cell.cpp | 4 +- benchmarks/cpp/nvfuser/matmul.cpp | 357 + benchmarks/cpp/nvfuser/reduction.cpp | 10 +- benchmarks/cpp/nvfuser/rms_norm.cpp | 2 - benchmarks/cpp/nvfuser/rms_norm_backward.cpp | 3 - benchmarks/cpp/nvfuser/scale_bias_relu.cpp | 18 +- benchmarks/cpp/nvfuser/shape_inference.cpp | 9 +- benchmarks/cpp/nvfuser/softmax.cpp | 6 +- benchmarks/cpp/nvfuser/softmax_backward.cpp | 34 +- benchmarks/cpp/nvfuser/softmax_dropout.cpp | 4 +- benchmarks/cpp/nvfuser/timm.cpp | 11 +- benchmarks/cpp/nvfuser/utils.cpp | 25 +- benchmarks/cpp/nvfuser/utils.h | 26 +- benchmarks/distributed/ddp/benchmark.py | 2 +- .../operator_benchmark/pt/qactivation_test.py | 14 +- .../operator_benchmark/pt/qarithmetic_test.py | 2 +- .../pt/qatembedding_ops_test.py | 2 +- benchmarks/operator_benchmark/pt/qcat_test.py | 2 +- .../operator_benchmark/pt/qconv_test.py | 2 +- .../pt/qembeddingbag_test.py | 2 +- .../operator_benchmark/pt/qlinear_test.py | 4 +- .../pt/quantization_test.py | 2 +- .../static_runtime/test_static_runtime.cc | 32 +- buckbuild.bzl | 63 +- build.bzl | 1 + build_variables.bzl | 25 +- c10/core/DispatchKeySet.cpp | 2 +- c10/core/SymInt.cpp | 109 +- c10/core/SymInt.h | 62 +- c10/core/SymIntArrayRef.h | 4 +- c10/core/SymIntNodeImpl.h | 5 +- c10/core/TensorImpl.cpp | 67 +- c10/core/TensorImpl.h | 63 +- c10/core/WrapDimMinimal.cpp | 7 +- c10/core/impl/GPUTrace.cpp | 22 + c10/core/impl/GPUTrace.h | 30 + c10/core/impl/PyInterpreter.cpp | 40 + c10/core/impl/PyInterpreter.h | 107 +- c10/core/impl/TorchDispatchModeTLS.cpp | 38 + c10/core/impl/TorchDispatchModeTLS.h | 20 + c10/cuda/CUDACachingAllocator.cpp | 112 +- c10/cuda/CUDACachingAllocator.h | 23 +- c10/cuda/CUDAStream.cpp | 9 + c10/cuda/impl/CUDAGuardImpl.h | 21 + c10/macros/Macros.h | 50 +- c10/test/core/SymInt_test.cpp | 3 +- c10/util/Exception.h | 4 + c10/util/IdWrapper.h | 1 + c10/util/SmallVector.cpp | 1 + c10/util/SmallVector.h | 1 + c10/util/hash.h | 8 + c10/util/logging_is_google_glog.h | 21 +- c10/util/strides.h | 14 +- c10/util/variant.h | 1 - caffe2/CMakeLists.txt | 98 +- caffe2/core/tensor.h | 8 + caffe2/quantization/server/dnnlowp.h | 2 + .../server/fully_connected_fake_lowp_op.h | 2 + caffe2/serialize/inline_container.cc | 8 +- caffe2/serialize/inline_container.h | 4 - caffe2/serialize/versions.h | 13 - caffe2/sgd/learning_rate_op.cc | 10 +- caffe2/utils/threadpool/ThreadPool.cc | 11 + cmake/Dependencies.cmake | 13 +- cmake/External/nccl.cmake | 44 +- cmake/External/ucc.cmake | 19 +- cmake/public/LoadHIP.cmake | 17 +- cmake/public/utils.cmake | 24 +- defs_gpu.bzl | 4 +- docker.Makefile | 47 +- docs/requirements.txt | 15 +- docs/source/amp.rst | 2 +- docs/source/backends.rst | 1 + docs/source/community/governance.rst | 29 +- docs/source/community/persons_of_interest.rst | 16 +- docs/source/conf.py | 53 +- docs/source/cuda.rst | 1 + docs/source/elastic/timer.rst | 11 + docs/source/index.rst | 1 + docs/source/masked.rst | 11 + docs/source/notes/cuda.rst | 2 +- docs/source/onnx.rst | 29 +- docs/source/optim.rst | 1 + docs/source/package.rst | 32 +- docs/source/quantization-support.rst | 47 +- docs/source/quantization.rst | 20 +- .../unittest/windows/scripts/environment.yml | 1 + .../codegen/gen_functorch_lagging_op_db.py | 58 - .../maml_omniglot/support/omniglot_loaders.py | 2 +- functorch/functorch/_src/aot_autograd.py | 531 +- functorch/functorch/_src/compile_utils.py | 10 + functorch/functorch/_src/compilers.py | 290 +- functorch/functorch/_src/fx_minifier.py | 383 +- functorch/functorch/_src/partitioners.py | 17 +- functorch/functorch/_src/python_key.py | 5 +- functorch/functorch/_src/vmap.py | 18 +- functorch/functorch/compile/__init__.py | 8 +- .../functorch/csrc/BatchRulesActivation.cpp | 5 + .../functorch/csrc/BatchRulesBinaryOps.cpp | 64 +- .../csrc/BatchRulesDecompositions.cpp | 22 + functorch/functorch/csrc/BatchRulesHelper.cpp | 44 + functorch/functorch/csrc/BatchRulesHelper.h | 3 + .../csrc/BatchRulesLinearAlgebra.cpp | 333 +- .../functorch/csrc/BatchRulesReduceOps.cpp | 5 - .../functorch/csrc/BatchedTensorImpl.cpp | 5 + functorch/functorch/csrc/BatchedTensorImpl.h | 5 + functorch/functorch/csrc/CompileCache.cpp | 2 +- functorch/functorch/csrc/Constants.h | 1 - functorch/functorch/csrc/CustomFunction.cpp | 3 +- functorch/functorch/csrc/DynamicLayer.cpp | 2 - functorch/functorch/csrc/Interpreter.h | 5 +- .../csrc/LegacyBatchingRegistrations.cpp | 110 +- .../functorch/csrc/PyTorchOperatorHacks.cpp | 90 - functorch/functorch/csrc/dim/arena.h | 328 + functorch/functorch/csrc/dim/dim.cpp | 3191 +++ functorch/functorch/csrc/dim/dim.h | 8 + functorch/functorch/csrc/dim/minpybind.h | 710 + .../csrc/dim/python_variable_simple.h | 49 + functorch/functorch/csrc/init.cpp | 25 +- functorch/functorch/dim/README.md | 759 + functorch/functorch/dim/__init__.py | 170 + functorch/functorch/dim/batch_tensor.py | 26 + functorch/functorch/dim/delayed_mul_tensor.py | 67 + functorch/functorch/dim/dim.py | 95 + functorch/functorch/dim/magic_trace.py | 34 + functorch/functorch/dim/op_properties.py | 282 + functorch/functorch/dim/reference.py | 557 + functorch/functorch/dim/tree_map.py | 12 + functorch/functorch/dim/wrap_type.py | 49 + functorch/functorch/experimental/cond.py | 137 + functorch/functorch/experimental/ops.py | 36 + .../colab/per_sample_grads_colab.ipynb | 4 +- functorch/notebooks/ensembling.ipynb | 2 +- functorch/notebooks/per_sample_grads.ipynb | 4 +- functorch/op_analysis/gen_data.py | 2 +- functorch/setup.py | 1 + functorch/test/attn_ft.py | 140 + functorch/test/attn_positional.py | 93 + functorch/test/common_utils.py | 158 +- functorch/test/discover_coverage.py | 4 - functorch/test/functorch_lagging_op_db.py | 574 - functorch/test/test_control_flow.py | 183 + functorch/test/test_dims.py | 594 + functorch/test/test_eager_transforms.py | 247 +- functorch/test/test_functionalize.py | 2 - functorch/test/test_minifier.py | 73 +- functorch/test/test_ops.py | 392 +- functorch/test/test_pythonkey.py | 192 +- functorch/test/test_vmap.py | 532 +- functorch/test/xfail_suggester.py | 3 + ios/TestApp/fastlane/Fastfile | 2 +- requirements.txt | 1 + scripts/onnx/test.sh | 1 + setup.py | 339 +- test/allowlist_for_publicAPI.json | 22 +- test/ao/sparsity/test_composability.py | 14 +- test/ao/sparsity/test_data_sparsifier.py | 99 +- test/cpp/api/CMakeLists.txt | 24 +- test/cpp/api/autograd.cpp | 139 +- test/cpp/api/dataloader.cpp | 2 - test/cpp/c10d/CMakeLists.txt | 13 + test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp | 3 +- test/cpp/c10d/ProcessGroupNCCLTest.cpp | 8 +- test/cpp/c10d/ProcessGroupUCCTest.cpp | 35 + test/cpp/jit/CMakeLists.txt | 7 +- test/cpp/jit/test_flatbuffer.cpp | 227 +- test/cpp/jit/test_load_upgraders.cpp | 2 - test/cpp/jit/test_misc.cpp | 109 + test/cpp/lazy/test_ir.cpp | 10 +- test/cpp/lazy/test_lazy_ops.cpp | 2 + test/cpp/profiler/containers.cpp | 15 + test/cpp/tensorexpr/test_cuda.cpp | 819 +- .../_shard/checkpoint/test_checkpoint.py | 139 +- .../_shard/checkpoint/test_planner.py | 268 + .../_shard/checkpoint/test_utils.py | 4 +- .../sharded_tensor/ops/test_embedding.py | 18 +- .../sharded_tensor/ops/test_embedding_bag.py | 12 +- .../sharded_tensor/ops/test_tensor_ops.py | 8 + .../timer/file_based_local_timer_test.py | 266 + .../fsdp/test_checkpoint_wrapper.py | 8 +- test/distributed/fsdp/test_fsdp_comm_hooks.py | 193 +- test/distributed/fsdp/test_fsdp_misc.py | 151 +- .../distributed/fsdp/test_fsdp_optim_state.py | 245 +- test/distributed/fsdp/test_fsdp_state_dict.py | 20 +- test/distributed/fsdp/test_shard_utils.py | 26 +- test/distributed/fsdp/test_utils.py | 11 + test/distributed/test_c10d_nccl.py | 10 +- test/distributed/test_store.py | 2 +- test/distributions/test_distributions.py | 21 +- ..._compat-fx_backcompat_class_members.expect | 4 +- ...t-fx_backcompat_function_signatures.expect | 2 +- .../check_forward_backward_compatibility.py | 142 +- test/fx/quantization.py | 2 +- test/fx/test_pass_infra.py | 15 + test/fx/test_subgraph_rewriter.py | 167 +- test/fx/test_z3_gradual_types.py | 188 +- test/jit/test_backends.py | 11 +- test/jit/test_freezing.py | 117 +- test/jit/test_legacy_upgraders.py | 553 - test/jit/test_module_interface.py | 34 +- test/jit/test_tensor_creation_ops.py | 8 +- test/jit/test_upgraders.py | 13 - test/jit/test_with.py | 2 + test/mobile/test_lite_script_type.py | 2 +- .../test_quantize_fx_lite_script_module.py | 2 +- test/nn/test_packed_sequence.py | 392 + test/nn/test_pooling.py | 1429 + .../expect/TestOperators.test_acos.expect | 2 +- .../TestOperators.test_add_broadcast.expect | 2 +- ...stOperators.test_add_left_broadcast.expect | 2 +- ...tOperators.test_add_size1_broadcast.expect | 2 +- ...tors.test_add_size1_right_broadcast.expect | 2 +- ....test_add_size1_singleton_broadcast.expect | 2 +- .../TestOperators.test_addconstant.expect | 2 +- .../expect/TestOperators.test_addmm.expect | 2 +- .../expect/TestOperators.test_argmax.expect | 2 +- .../expect/TestOperators.test_asin.expect | 2 +- .../expect/TestOperators.test_at_op.expect | 8 +- .../expect/TestOperators.test_atan.expect | 2 +- .../TestOperators.test_avg_pool2d.expect | 2 +- .../expect/TestOperators.test_baddbmm.expect | 2 +- .../expect/TestOperators.test_basic.expect | 2 +- .../TestOperators.test_batchnorm.expect | 7 +- .../TestOperators.test_batchnorm_1d.expect | 7 +- ...stOperators.test_batchnorm_noaffine.expect | 7 +- ...tOperators.test_batchnorm_onnx_irv4.expect | 7 +- ...stOperators.test_batchnorm_training.expect | 9 +- .../expect/TestOperators.test_chunk.expect | 2 +- .../expect/TestOperators.test_clip.expect | 2 +- .../expect/TestOperators.test_clip_max.expect | 2 +- .../expect/TestOperators.test_clip_min.expect | 2 +- .../expect/TestOperators.test_concat2.expect | 2 +- .../expect/TestOperators.test_conv.expect | 2 +- .../TestOperators.test_conv_onnx_irv4.expect | 2 +- .../TestOperators.test_convtranspose.expect | 2 +- .../onnx/expect/TestOperators.test_cos.expect | 2 +- .../expect/TestOperators.test_dict.expect | 2 +- .../expect/TestOperators.test_dict_str.expect | 2 +- .../onnx/expect/TestOperators.test_dim.expect | 2 +- .../expect/TestOperators.test_dropout.expect | 2 +- .../TestOperators.test_dropout_default.expect | 2 +- ...TestOperators.test_dropout_training.expect | 2 +- .../onnx/expect/TestOperators.test_elu.expect | 2 +- .../TestOperators.test_embedding_bags.expect | 2 +- .../TestOperators.test_empty_like.expect | 2 +- .../expect/TestOperators.test_equal.expect | 2 +- .../onnx/expect/TestOperators.test_erf.expect | 2 +- .../onnx/expect/TestOperators.test_exp.expect | 2 +- .../expect/TestOperators.test_expand.expect | 2 +- .../expect/TestOperators.test_flatten.expect | 7 +- .../TestOperators.test_flatten2D.expect | 2 +- .../TestOperators.test_frobenius_norm.expect | 2 +- .../expect/TestOperators.test_full.expect | 2 +- .../TestOperators.test_full_like.expect | 2 +- .../expect/TestOperators.test_gather.expect | 2 +- test/onnx/expect/TestOperators.test_ge.expect | 2 +- .../expect/TestOperators.test_gelu.expect | 2 +- test/onnx/expect/TestOperators.test_gt.expect | 2 +- .../expect/TestOperators.test_hardtanh.expect | 2 +- .../TestOperators.test_implicit_expand.expect | 2 +- .../expect/TestOperators.test_index.expect | 2 +- .../expect/TestOperators.test_isnan.expect | 2 +- .../TestOperators.test_layer_norm_aten.expect | 2 +- test/onnx/expect/TestOperators.test_le.expect | 2 +- .../expect/TestOperators.test_linear.expect | 2 +- .../TestOperators.test_log_sigmoid.expect | 2 +- .../TestOperators.test_logsoftmax.expect | 2 +- test/onnx/expect/TestOperators.test_lt.expect | 2 +- .../onnx/expect/TestOperators.test_max.expect | 2 +- .../expect/TestOperators.test_maxpool.expect | 2 +- .../TestOperators.test_maxpool_indices.expect | 2 +- .../expect/TestOperators.test_mean.expect | 2 +- .../TestOperators.test_mean_dtype.expect | 2 +- .../expect/TestOperators.test_meshgrid.expect | 32 +- .../onnx/expect/TestOperators.test_min.expect | 2 +- test/onnx/expect/TestOperators.test_mm.expect | 2 +- .../expect/TestOperators.test_mul_bool.expect | 2 +- .../TestOperators.test_mul_fp_bool.expect | 2 +- .../expect/TestOperators.test_narrow.expect | 2 +- test/onnx/expect/TestOperators.test_ne.expect | 2 +- .../expect/TestOperators.test_nonzero.expect | 2 +- .../expect/TestOperators.test_norm_p1.expect | 2 +- .../expect/TestOperators.test_norm_p2.expect | 2 +- .../TestOperators.test_ones_like.expect | 2 +- .../onnx/expect/TestOperators.test_pad.expect | 12 +- .../expect/TestOperators.test_params.expect | 2 +- ...TestOperators.test_params_onnx_irv4.expect | 2 +- .../expect/TestOperators.test_permute2.expect | 2 +- .../onnx/expect/TestOperators.test_pow.expect | 2 +- .../expect/TestOperators.test_prelu.expect | 2 +- .../expect/TestOperators.test_prod.expect | 2 +- .../TestOperators.test_prod_dtype.expect | 2 +- .../expect/TestOperators.test_rand.expect | 2 +- .../expect/TestOperators.test_randn.expect | 2 +- ...rs.test_reduce_sum_negative_indices.expect | 2 +- .../TestOperators.test_reduced_mean.expect | 2 +- ...stOperators.test_reduced_mean_dtype.expect | 2 +- ...Operators.test_reduced_mean_keepdim.expect | 2 +- .../TestOperators.test_reduced_prod.expect | 2 +- ...stOperators.test_reduced_prod_dtype.expect | 2 +- ...Operators.test_reduced_prod_keepdim.expect | 2 +- .../TestOperators.test_reduced_sum.expect | 2 +- ...estOperators.test_reduced_sum_dtype.expect | 2 +- ...tOperators.test_reduced_sum_keepdim.expect | 2 +- .../TestOperators.test_reducemax.expect | 2 +- .../TestOperators.test_reducemin.expect | 2 +- .../TestOperators.test_remainder.expect | 2 +- .../expect/TestOperators.test_repeat.expect | 2 +- ...tOperators.test_repeat_dim_overflow.expect | 2 +- .../expect/TestOperators.test_rrelu.expect | 26 +- .../expect/TestOperators.test_rsqrt.expect | 2 +- .../expect/TestOperators.test_rsub.expect | 2 +- .../TestOperators.test_scatter_add.expect | 2 +- .../expect/TestOperators.test_selu.expect | 2 +- .../TestOperators.test_shape_value_map.expect | 12 +- .../expect/TestOperators.test_sign.expect | 2 +- .../onnx/expect/TestOperators.test_sin.expect | 2 +- .../expect/TestOperators.test_slice.expect | 2 +- .../expect/TestOperators.test_split.expect | 2 +- ...TestOperators.test_split_with_sizes.expect | 2 +- .../expect/TestOperators.test_sqrt.expect | 2 +- .../onnx/expect/TestOperators.test_std.expect | 2 +- .../onnx/expect/TestOperators.test_sum.expect | 2 +- .../TestOperators.test_sum_dtype.expect | 2 +- .../onnx/expect/TestOperators.test_tan.expect | 2 +- .../TestOperators.test_transpose.expect | 2 +- .../expect/TestOperators.test_type_as.expect | 2 +- .../expect/TestOperators.test_unfold.expect | 2 +- .../TestOperators.test_unsqueeze.expect | 2 +- ...erators.test_upsample_nearest_scale.expect | 2 +- ..._nearest_scale_default_scale_factor.expect | 2 +- ...perators.test_upsample_nearest_size.expect | 2 +- .../expect/TestOperators.test_view.expect | 7 +- .../TestOperators.test_view_flatten.expect | 7 +- .../TestOperators.test_zeros_like.expect | 2 +- test/onnx/internal/test_beartype.py | 86 + test/onnx/onnx_test_common.py | 6 + test/onnx/pytorch_test_common.py | 18 + test/onnx/test_autograd_funs.py | 212 + test/onnx/test_models.py | 12 +- test/onnx/test_models_onnxruntime.py | 28 +- test/onnx/test_onnx_opset.py | 2 +- test/onnx/test_pytorch_jit_onnx.py | 13 +- .../test_pytorch_onnx_caffe2_quantized.py | 4 +- test/onnx/test_pytorch_onnx_no_runtime.py | 23 +- test/onnx/test_pytorch_onnx_onnxruntime.py | 184 +- test/onnx/test_utility_funs.py | 124 +- test/onnx/test_verification.py | 33 + test/quantization/ao_migration/common.py | 22 +- .../ao_migration/test_ao_migration.py | 350 + .../bc/test_backward_compatibility.py | 4 +- .../experimental/apot_fx_graph_mode_ptq.py | 131 + .../experimental/apot_fx_graph_mode_qat.py | 94 + ...raph_mode_apot.py => quantization_util.py} | 152 +- test/quantization/core/test_docs.py | 39 +- .../core/test_quantized_functional.py | 2 +- .../core/test_quantized_module.py | 88 +- test/quantization/core/test_quantized_op.py | 147 +- test/quantization/core/test_utils.py | 65 + .../quantization/core/test_workflow_module.py | 8 +- test/quantization/dbr/test_quantize_dbr.py | 1619 -- test/quantization/eager/test_fuse_eager.py | 2 +- .../eager/test_numeric_suite_eager.py | 7 +- .../eager/test_quantize_eager_ptq.py | 26 +- .../eager/test_quantize_eager_qat.py | 12 +- test/quantization/fx/test_equalize_fx.py | 2 +- test/quantization/fx/test_model_report_fx.py | 114 +- test/quantization/fx/test_numeric_suite_fx.py | 6 +- test/quantization/fx/test_quantize_fx.py | 209 +- test/run_doctests.sh | 29 + test/run_test.py | 93 +- test/test_ao_sparsity.py | 8 +- test/test_autograd.py | 440 +- test/test_binary_ufuncs.py | 21 +- test/test_cpp_api_parity.py | 59 +- test/test_cpp_extensions_jit.py | 3 +- ...cpp_extensions_open_device_registration.py | 4 +- test/test_cuda.py | 44 +- test/test_cuda_trace.py | 96 + test/test_dataloader.py | 69 +- test/test_datapipe.py | 277 +- test/test_decomp.py | 13 +- test/test_dlpack.py | 193 + test/test_dynamic_shapes.py | 45 +- test/test_dynamo_cudagraphs.py | 192 - test/test_expanded_weights.py | 9 +- test/test_fake_tensor.py | 28 +- test/test_foreach.py | 14 +- test/test_function_schema.py | 6 + test/test_functionalization.py | 118 +- test/test_fx.py | 86 +- test/test_fx_passes.py | 371 +- test/test_fx_reinplace_pass.py | 112 +- test/test_jit.py | 15 +- test/test_jit_autocast.py | 98 +- test/test_jit_cuda_fuser.py | 154 +- test/test_jit_fuser_te.py | 37 + test/test_jiterator.py | 10 +- test/test_linalg.py | 98 +- test/test_maskedtensor.py | 245 + test/test_meta.py | 14 + test/test_mkldnn_fusion.py | 118 + test/test_module_init.py | 87 +- test/test_mps.py | 340 +- test/test_multiprocessing.py | 5 +- test/test_namedtensor.py | 12 +- test/test_namedtuple_return_api.py | 40 +- test/test_native_mha.py | 1 + test/test_nestedtensor.py | 313 +- test/test_nn.py | 2131 +- test/test_nnapi.py | 15 +- test/test_ops.py | 51 +- test/test_ops_jit.py | 191 +- test/test_optim.py | 148 +- test/test_overrides.py | 46 + test/test_per_overload_api.py | 8 + test/test_prims.py | 119 +- test/test_profiler.py | 338 +- test/test_profiler_tree.py | 185 +- test/test_proxy_tensor.py | 504 +- test/test_public_bindings.py | 12 +- test/test_python_dispatch.py | 49 +- test/test_pytree.py | 42 +- test/test_quantization.py | 16 +- test/test_reductions.py | 6 +- test/{jit => }/test_schema_check.py | 36 +- test/test_sort_and_select.py | 3 +- test/test_sparse.py | 123 +- test/test_sparse_csr.py | 168 +- test/test_spectral_ops.py | 2 +- test/test_tensor_creation_ops.py | 20 +- test/test_testing.py | 6 +- test/test_torch.py | 210 +- test/test_transformers.py | 66 +- test/test_type_promotion.py | 73 +- test/test_unary_ufuncs.py | 22 +- test/test_utils.py | 59 + third_party/VulkanMemoryAllocator | 1 + third_party/cpuinfo | 2 +- third_party/cpuinfo.BUILD | 55 - third_party/ideep | 2 +- third_party/nccl/nccl | 2 +- tools/autograd/context.py | 12 + tools/autograd/derivatives.yaml | 111 +- tools/autograd/gen_autograd.py | 12 +- tools/autograd/gen_autograd_functions.py | 44 +- tools/autograd/gen_python_functions.py | 5 - tools/autograd/gen_trace_type.py | 15 +- tools/autograd/gen_variable_type.py | 130 +- tools/autograd/load_derivatives.py | 175 +- tools/autograd/templates/VariableType.cpp | 7 +- tools/autograd/templates/python_enum_tag.cpp | 3 +- .../templates/python_variable_methods.cpp | 21 +- .../package/tool/summarize_jsons.py | 2 +- tools/onnx/update_default_opset_version.py | 150 +- tools/setup_helpers/cmake_utils.py | 2 +- tools/stats/import_test_stats.py | 39 +- tools/stats/print_test_stats.py | 6 +- tools/stats/upload_test_stats.py | 37 +- tools/target_definitions.bzl | 33 +- tools/test/test_codegen.py | 202 +- tools/test/test_codegen_model.py | 64 +- tools/test/test_selective_build.py | 21 + tools/testing/test_selections.py | 2 +- torch/CMakeLists.txt | 25 +- torch/_C/__init__.pyi.in | 25 +- torch/_C/_autograd.pyi | 91 +- torch/_C/_distributed_rpc.pyi | 5 +- torch/_C/_profiler.pyi | 111 + torch/__init__.py | 5 +- torch/_decomp/__init__.py | 14 +- torch/_decomp/decompositions.py | 264 +- .../dbr => torch/_dispatch}/__init__.py | 0 torch/_dispatch/_dispatcher.py | 50 + torch/_lazy/__init__.py | 13 + torch/_lazy/extract_compiled_graph.py | 2 +- torch/_masked/__init__.py | 2 +- torch/_meta_registrations.py | 33 +- torch/_namedtensor_internals.py | 2 + torch/_ops.py | 19 +- torch/_prims/__init__.py | 310 +- torch/_prims/context.py | 91 +- torch/_prims/executor.py | 19 +- torch/_prims/nvfuser_executor.py | 22 +- torch/_prims/nvfuser_prims.py | 280 + torch/_prims_common/__init__.py | 79 +- torch/_prims_common/wrappers.py | 49 +- torch/_refs/__init__.py | 720 +- torch/_refs/nn/functional/__init__.py | 53 + torch/_subclasses/fake_tensor.py | 150 +- torch/_subclasses/meta_utils.py | 38 +- torch/_tensor.py | 8 +- torch/_tensor_str.py | 15 +- torch/_torch_docs.py | 11 +- torch/ao/nn/__init__.py | 18 +- torch/ao/nn/qat/__init__.py | 1 + torch/ao/nn/qat/dynamic/__init__.py | 1 + torch/ao/nn/qat/dynamic/modules/__init__.py | 3 + torch/ao/nn/qat/dynamic/modules/linear.py | 25 + torch/ao/nn/qat/modules/__init__.py | 14 + torch/ao/nn/qat/modules/conv.py | 264 + torch/ao/nn/qat/modules/embedding_ops.py | 143 + torch/ao/nn/qat/modules/linear.py | 77 + torch/ao/nn/quantizable/__init__.py | 1 + torch/ao/nn/quantizable/modules/__init__.py | 9 + torch/ao/nn/quantizable/modules/activation.py | 454 + torch/ao/nn/quantizable/modules/rnn.py | 386 + torch/ao/nn/quantized/__init__.py | 38 + torch/ao/nn/quantized/_reference/__init__.py | 1 + .../quantized/_reference/modules/__init__.py | 20 + .../nn/quantized/_reference/modules/conv.py | 316 + .../nn/quantized/_reference/modules/linear.py | 55 + .../ao/nn/quantized/_reference/modules/rnn.py | 471 + .../nn/quantized/_reference/modules/sparse.py | 92 + .../nn/quantized/_reference/modules/utils.py | 154 + torch/ao/nn/quantized/dynamic/__init__.py | 1 + .../nn/quantized/dynamic/modules/__init__.py | 19 + torch/ao/nn/quantized/dynamic/modules/conv.py | 399 + .../ao/nn/quantized/dynamic/modules/linear.py | 127 + torch/ao/nn/quantized/dynamic/modules/rnn.py | 1054 + torch/ao/nn/quantized/functional.py | 616 + torch/ao/nn/quantized/modules/__init__.py | 136 + torch/ao/nn/quantized/modules/activation.py | 278 + torch/ao/nn/quantized/modules/batchnorm.py | 101 + torch/ao/nn/quantized/modules/conv.py | 937 + torch/ao/nn/quantized/modules/dropout.py | 27 + .../ao/nn/quantized/modules/embedding_ops.py | 295 + .../quantized/modules/functional_modules.py | 233 + torch/ao/nn/quantized/modules/linear.py | 302 + .../ao/nn/quantized/modules/normalization.py | 204 + torch/ao/nn/quantized/modules/rnn.py | 47 + torch/ao/nn/quantized/modules/utils.py | 113 + .../ao/nn/sparse/quantized/dynamic/linear.py | 8 +- torch/ao/nn/sparse/quantized/linear.py | 2 +- torch/ao/ns/_numeric_suite.py | 4 +- torch/ao/ns/_numeric_suite_dbr.py | 112 - torch/ao/ns/fx/mappings.py | 8 +- torch/ao/ns/fx/utils.py | 2 +- torch/ao/ns/fx/weight_utils.py | 6 +- torch/ao/quantization/_correct_bias.py | 2 +- torch/ao/quantization/_dbr/README.md | 259 - torch/ao/quantization/_dbr/auto_trace.py | 723 - .../quantization/_dbr/auto_trace_rewriter.py | 247 - torch/ao/quantization/_dbr/function_fusion.py | 101 - torch/ao/quantization/_dbr/fusion.py | 56 - torch/ao/quantization/_dbr/mappings.py | 178 - torch/ao/quantization/_dbr/model_utils.py | 163 - .../ao/quantization/_dbr/module_swap_utils.py | 79 - .../_dbr/qconfig_mapping_utils.py | 25 - .../quantization/_dbr/quantization_state.py | 986 - .../ao/quantization/_dbr/torchscript_utils.py | 15 - torch/ao/quantization/_dbr/utils.py | 751 - torch/ao/quantization/_quantize_dbr.py | 144 - .../quantization/backend_config/__init__.py | 18 +- .../_common_operator_config_utils.py | 565 +- .../backend_config/backend_config.py | 60 +- .../ao/quantization/backend_config/fbgemm.py | 114 + .../ao/quantization/backend_config/native.py | 372 +- .../backend_config/observation_type.py | 13 - .../ao/quantization/backend_config/qnnpack.py | 114 + .../quantization/backend_config/tensorrt.py | 93 +- torch/ao/quantization/backend_config/utils.py | 113 +- torch/ao/quantization/experimental/linear.py | 6 - .../ao/quantization/experimental/observer.py | 18 +- torch/ao/quantization/fuse_modules.py | 1 + .../ao/quantization/fuser_method_mappings.py | 6 +- torch/ao/quantization/fx/_equalize.py | 47 +- .../fx/_lower_to_native_backend.py | 60 +- .../quantization/fx/_model_report/README.md | 88 +- .../quantization/fx/_model_report/detector.py | 212 +- .../fx/_model_report/model_report.py | 197 +- .../fx/_model_report/model_report_observer.py | 2 +- .../_model_report/model_report_visualizer.py | 116 +- .../quantization/fx/backend_config_utils.py | 50 +- .../fx/common_quantization_patterns.py | 8 - torch/ao/quantization/fx/convert.py | 48 +- torch/ao/quantization/fx/custom_config.py | 43 +- torch/ao/quantization/fx/fuse.py | 35 +- torch/ao/quantization/fx/fusion_patterns.py | 2 +- torch/ao/quantization/fx/pattern_utils.py | 4 +- torch/ao/quantization/fx/prepare.py | 194 +- ...nfig_utils.py => qconfig_mapping_utils.py} | 20 +- .../quantization/fx/quantization_patterns.py | 9 +- torch/ao/quantization/fx/tracer.py | 2 +- torch/ao/quantization/observer.py | 34 +- .../ao/quantization/quantization_mappings.py | 13 +- torch/ao/quantization/quantize.py | 2 +- torch/ao/quantization/quantize_fx.py | 251 +- .../activation_sparsifier.py | 50 +- .../data_scheduler/base_data_scheduler.py | 10 +- .../data_sparsifier/base_data_sparsifier.py | 2 +- .../data_sparsifier/quantization_utils.py | 130 + torch/ao/sparsity/_mappings.py | 5 +- .../ao/sparsity/scheduler/lambda_scheduler.py | 1 + .../ao/sparsity/sparsifier/base_sparsifier.py | 8 +- torch/ao/sparsity/sparsifier/utils.py | 2 +- torch/autograd/__init__.py | 34 +- torch/autograd/anomaly_mode.py | 20 +- torch/autograd/forward_ad.py | 3 + torch/autograd/function.py | 3 + torch/autograd/functional.py | 6 + torch/autograd/grad_mode.py | 12 +- torch/autograd/graph.py | 10 +- torch/autograd/profiler.py | 8 +- torch/autograd/profiler_legacy.py | 2 +- torch/backends/xeon/run_cpu.py | 16 +- torch/csrc/DynamicTypes.cpp | 2 + torch/csrc/Exceptions.cpp | 14 +- torch/csrc/Exceptions.h | 81 +- torch/csrc/Module.cpp | 2 + torch/csrc/Storage.cpp | 2 + torch/csrc/api/include/torch/nn/pimpl.h | 10 +- torch/csrc/autograd/FunctionsManual.cpp | 143 +- torch/csrc/autograd/FunctionsManual.h | 9 +- torch/csrc/autograd/TraceTypeManual.cpp | 4 +- torch/csrc/autograd/anomaly_mode.cpp | 8 +- torch/csrc/autograd/anomaly_mode.h | 12 +- torch/csrc/autograd/autograd.cpp | 6 + .../autograd_not_implemented_fallback.cpp | 3 +- torch/csrc/autograd/custom_function.cpp | 13 + torch/csrc/autograd/custom_function.h | 5 + torch/csrc/autograd/engine.cpp | 23 +- torch/csrc/autograd/engine.h | 170 +- torch/csrc/autograd/function.h | 44 + torch/csrc/autograd/functions/tensor.cpp | 19 +- torch/csrc/autograd/graph_task.h | 193 + torch/csrc/autograd/init.cpp | 234 +- torch/csrc/autograd/input_buffer.cpp | 1 + torch/csrc/autograd/input_buffer.h | 1 - torch/csrc/autograd/profiler_kineto.cpp | 457 +- torch/csrc/autograd/profiler_kineto.h | 294 +- torch/csrc/autograd/profiler_legacy.cpp | 21 +- torch/csrc/autograd/profiler_python.cpp | 72 +- torch/csrc/autograd/python_anomaly_mode.cpp | 2 +- torch/csrc/autograd/python_cpp_function.cpp | 39 +- torch/csrc/autograd/python_cpp_function.h | 8 + torch/csrc/autograd/python_function.cpp | 94 +- torch/csrc/autograd/python_hook.cpp | 140 +- torch/csrc/autograd/python_hook.h | 11 +- .../python_torch_functions_manual.cpp | 363 - torch/csrc/autograd/python_variable.cpp | 143 +- torch/csrc/autograd/python_variable.h | 2 + .../autograd/python_variable_indexing.cpp | 5 +- torch/csrc/autograd/saved_variable.cpp | 7 +- torch/csrc/autograd/variable.cpp | 9 +- torch/csrc/cuda/Module.cpp | 122 +- torch/csrc/cuda/shared/cudart.cpp | 22 +- torch/csrc/deploy/deploy.cpp | 4 + torch/csrc/deploy/interpreter/defs.bzl | 8 +- torch/csrc/distributed/c10d/Ops.cpp | 6 +- torch/csrc/distributed/c10d/ProcessGroup.hpp | 9 + .../distributed/c10d/ProcessGroupGloo.cpp | 4 - .../distributed/c10d/ProcessGroupNCCL.cpp | 411 +- .../distributed/c10d/ProcessGroupNCCL.hpp | 49 + .../csrc/distributed/c10d/ProcessGroupUCC.cpp | 84 +- torch/csrc/distributed/c10d/UCCUtils.cpp | 5 + torch/csrc/distributed/c10d/UCCUtils.hpp | 33 + torch/csrc/distributed/c10d/debug.h | 2 +- torch/csrc/distributed/c10d/init.cpp | 9 + torch/csrc/init_flatbuffer_module.cpp | 21 +- .../backends/coreml/objc/PTMCoreMLBackend.mm | 63 +- .../backends/coreml/objc/PTMCoreMLCompiler.h | 12 +- .../backends/coreml/objc/PTMCoreMLCompiler.mm | 143 +- .../coreml/objc/PTMCoreMLModelWrapper.h | 9 - .../coreml/observer/PTMCoreMLObserver.h | 47 - .../coreml/observer/PTMCoreMLObserver.mm | 8 - torch/csrc/jit/codegen/cuda/arith.cpp | 197 +- torch/csrc/jit/codegen/cuda/arith.h | 49 +- torch/csrc/jit/codegen/cuda/codegen.cpp | 275 +- torch/csrc/jit/codegen/cuda/compute_at.cpp | 21 +- torch/csrc/jit/codegen/cuda/compute_at.h | 2 +- .../csrc/jit/codegen/cuda/compute_at_map.cpp | 360 +- torch/csrc/jit/codegen/cuda/compute_at_map.h | 23 +- torch/csrc/jit/codegen/cuda/contiguity.cpp | 617 +- torch/csrc/jit/codegen/cuda/contiguity.h | 152 +- torch/csrc/jit/codegen/cuda/disjoint_set.h | 11 +- torch/csrc/jit/codegen/cuda/dispatch.cpp | 90 + torch/csrc/jit/codegen/cuda/dispatch.h | 24 + torch/csrc/jit/codegen/cuda/dynamic_type.h | 67 +- .../jit/codegen/cuda/evaluator_common.cpp | 234 +- .../csrc/jit/codegen/cuda/evaluator_common.h | 102 +- torch/csrc/jit/codegen/cuda/executor.cpp | 392 +- torch/csrc/jit/codegen/cuda/executor.h | 51 +- .../jit/codegen/cuda/executor_kernel_arg.cpp | 35 + .../jit/codegen/cuda/executor_kernel_arg.h | 258 +- .../csrc/jit/codegen/cuda/executor_utils.cpp | 423 +- torch/csrc/jit/codegen/cuda/executor_utils.h | 9 +- .../csrc/jit/codegen/cuda/expr_evaluator.cpp | 19 +- torch/csrc/jit/codegen/cuda/expr_evaluator.h | 13 +- torch/csrc/jit/codegen/cuda/fusion.cpp | 28 +- torch/csrc/jit/codegen/cuda/fusion.h | 8 +- .../jit/codegen/cuda/fusion_segmenter.cpp | 60 +- .../csrc/jit/codegen/cuda/fusion_segmenter.h | 14 +- torch/csrc/jit/codegen/cuda/graph_fuser.cpp | 12 +- .../jit/codegen/cuda/grouped_reduction.cpp | 18 +- .../csrc/jit/codegen/cuda/grouped_reduction.h | 4 + torch/csrc/jit/codegen/cuda/index_compute.cpp | 297 +- torch/csrc/jit/codegen/cuda/index_compute.h | 113 +- .../jit/codegen/cuda/inline_propagator.cpp | 385 - .../csrc/jit/codegen/cuda/inline_propagator.h | 118 - torch/csrc/jit/codegen/cuda/inlining.cpp | 306 + torch/csrc/jit/codegen/cuda/inlining.h | 100 + torch/csrc/jit/codegen/cuda/interface.cpp | 56 + torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp | 4 +- torch/csrc/jit/codegen/cuda/ir_base_nodes.h | 1 + torch/csrc/jit/codegen/cuda/ir_builder.cpp | 4 + torch/csrc/jit/codegen/cuda/ir_cloner.cpp | 16 + torch/csrc/jit/codegen/cuda/ir_cloner.h | 4 + torch/csrc/jit/codegen/cuda/ir_graphviz.cpp | 38 + torch/csrc/jit/codegen/cuda/ir_graphviz.h | 4 + .../jit/codegen/cuda/ir_interface_nodes.h | 29 +- .../csrc/jit/codegen/cuda/ir_internal_nodes.h | 498 +- torch/csrc/jit/codegen/cuda/ir_iostream.cpp | 355 +- torch/csrc/jit/codegen/cuda/ir_iostream.h | 6 + torch/csrc/jit/codegen/cuda/ir_nodes.cpp | 670 +- torch/csrc/jit/codegen/cuda/ir_utils.cpp | 144 +- torch/csrc/jit/codegen/cuda/ir_utils.h | 9 + torch/csrc/jit/codegen/cuda/iter_visitor.cpp | 149 +- torch/csrc/jit/codegen/cuda/iter_visitor.h | 99 +- torch/csrc/jit/codegen/cuda/kernel.cpp | 17 +- torch/csrc/jit/codegen/cuda/kernel_cache.cpp | 379 +- torch/csrc/jit/codegen/cuda/kernel_cache.h | 90 +- .../codegen/cuda/kernel_expr_evaluator.cpp | 79 +- .../jit/codegen/cuda/kernel_expr_evaluator.h | 14 +- torch/csrc/jit/codegen/cuda/kernel_ir.cpp | 46 + torch/csrc/jit/codegen/cuda/kernel_ir.h | 68 +- torch/csrc/jit/codegen/cuda/lower2device.cpp | 19 +- torch/csrc/jit/codegen/cuda/lower2device.h | 27 +- .../jit/codegen/cuda/lower_alias_memory.cpp | 22 +- .../jit/codegen/cuda/lower_allocation.cpp | 15 +- .../codegen/cuda/lower_divisible_split.cpp | 121 + .../jit/codegen/cuda/lower_divisible_split.h | 29 + .../jit/codegen/cuda/lower_double_buffer.cpp | 3 +- .../csrc/jit/codegen/cuda/lower_expr_sort.cpp | 2 +- .../codegen/cuda/lower_fused_reduction.cpp | 12 +- torch/csrc/jit/codegen/cuda/lower_index.cpp | 263 +- torch/csrc/jit/codegen/cuda/lower_index.h | 23 +- .../jit/codegen/cuda/lower_index_compute.cpp | 191 +- .../jit/codegen/cuda/lower_index_compute.h | 12 + .../jit/codegen/cuda/lower_insert_syncs.cpp | 4 +- torch/csrc/jit/codegen/cuda/lower_loops.cpp | 4 +- .../csrc/jit/codegen/cuda/lower_predicate.cpp | 11 +- .../cuda/lower_predicate_elimination.cpp | 75 +- torch/csrc/jit/codegen/cuda/lower_shift.cpp | 125 +- torch/csrc/jit/codegen/cuda/lower_shift.h | 33 +- .../codegen/cuda/lower_sync_information.cpp | 20 +- .../codegen/cuda/lower_thread_predicate.cpp | 7 +- .../codegen/cuda/lower_trivial_broadcast.cpp | 6 +- .../codegen/cuda/lower_trivial_broadcast.h | 3 +- torch/csrc/jit/codegen/cuda/lower_unroll.cpp | 10 +- torch/csrc/jit/codegen/cuda/lower_utils.cpp | 372 +- torch/csrc/jit/codegen/cuda/lower_utils.h | 112 +- .../jit/codegen/cuda/lower_validation.cpp | 45 +- .../jit/codegen/cuda/lower_warp_reduce.cpp | 6 +- torch/csrc/jit/codegen/cuda/manager.cpp | 12 +- torch/csrc/jit/codegen/cuda/mutator.cpp | 123 +- .../jit/codegen/cuda/non_divisible_split.cpp | 13 +- torch/csrc/jit/codegen/cuda/nvfuser.cmake | 4 +- torch/csrc/jit/codegen/cuda/ops/alias.cpp | 39 +- torch/csrc/jit/codegen/cuda/ops/composite.cpp | 22 +- torch/csrc/jit/codegen/cuda/ops/composite.h | 2 + .../jit/codegen/cuda/ops/normalization.cpp | 58 + .../csrc/jit/codegen/cuda/ops/normalization.h | 11 + .../codegen/cuda/parallel_dimension_map.cpp | 2 +- .../jit/codegen/cuda/parallel_type_bitmap.cpp | 2 + torch/csrc/jit/codegen/cuda/parser.cpp | 169 +- .../cuda/python_frontend/python_bindings.cpp | 9 +- .../csrc/jit/codegen/cuda/reference_tensor.h | 27 - .../csrc/jit/codegen/cuda/root_domain_map.cpp | 197 +- torch/csrc/jit/codegen/cuda/root_domain_map.h | 17 +- .../codegen/cuda/runtime/fused_reduction.cu | 1855 +- .../cuda/runtime/fused_welford_helper.cu | 93 + .../cuda/runtime/fused_welford_impl.cu | 623 + .../csrc/jit/codegen/cuda/runtime/helpers.cu | 19 + .../codegen/cuda/runtime/random_numbers.cu | 34 +- torch/csrc/jit/codegen/cuda/runtime/tuple.cu | 173 + .../codegen/cuda/scheduler/all_schedulers.h | 5 +- .../cuda/scheduler/compile_time_info.h | 56 +- .../jit/codegen/cuda/scheduler/heuristic.h | 3 +- .../jit/codegen/cuda/scheduler/mma_utils.cpp | 8 +- .../codegen/cuda/scheduler/normalization.cpp | 8 +- .../jit/codegen/cuda/scheduler/pointwise.cpp | 277 +- .../jit/codegen/cuda/scheduler/pointwise.h | 138 + .../cuda/scheduler/pointwise_utils.cpp | 46 +- .../codegen/cuda/scheduler/pointwise_utils.h | 20 +- .../jit/codegen/cuda/scheduler/reduction.cpp | 8 +- .../cuda/scheduler/reduction_utils.cpp | 13 +- .../jit/codegen/cuda/scheduler/registry.cpp | 453 +- .../jit/codegen/cuda/scheduler/registry.h | 26 +- .../jit/codegen/cuda/scheduler/transpose.cpp | 1140 + .../jit/codegen/cuda/scheduler/transpose.h | 115 + .../cuda/scheduler/transpose_heuristic.h | 163 + .../csrc/jit/codegen/cuda/scheduler/utils.cpp | 965 +- torch/csrc/jit/codegen/cuda/scheduler/utils.h | 178 +- .../cuda/scheduler/vectorize_helper.cpp | 287 + .../codegen/cuda/scheduler/vectorize_helper.h | 14 +- torch/csrc/jit/codegen/cuda/tensor_view.cpp | 139 +- torch/csrc/jit/codegen/cuda/test/test_gpu.cpp | 967 +- .../cuda/test/test_gpu_fused_reduction.cpp | 312 + .../jit/codegen/cuda/test/test_gpu_rng.cu | 211 +- .../cuda/test/test_gpu_tensor_factories.cpp | 339 + .../codegen/cuda/test/test_gpu_transpose.cpp | 1028 + .../jit/codegen/cuda/test/test_gpu_utils.cpp | 273 + .../codegen/cuda/test/test_gpu_validator.h | 59 +- .../jit/codegen/cuda/test/test_gpu_view.cpp | 699 +- torch/csrc/jit/codegen/cuda/test/test_utils.h | 71 +- .../jit/codegen/cuda/tools/stringify_file.py | 10 +- .../csrc/jit/codegen/cuda/transform_iter.cpp | 6 +- torch/csrc/jit/codegen/cuda/type.cpp | 64 +- torch/csrc/jit/codegen/cuda/type.h | 17 +- .../csrc/jit/codegen/cuda/type_inference.cpp | 11 +- torch/csrc/jit/codegen/cuda/utils.cpp | 93 +- torch/csrc/jit/codegen/cuda/utils.h | 13 +- torch/csrc/jit/codegen/fuser/codegen.cpp | 2 +- .../jit/codegen/fuser/cuda/fused_kernel.cpp | 5 + torch/csrc/jit/frontend/builtin_functions.cpp | 100 - .../jit/frontend/function_schema_parser.cpp | 6 +- torch/csrc/jit/frontend/ir_emitter.cpp | 4 +- torch/csrc/jit/frontend/schema_matching.cpp | 4 - .../csrc/jit/frontend/schema_type_parser.cpp | 6 +- torch/csrc/jit/frontend/source_range.cpp | 3 + torch/csrc/jit/frontend/sugared_value.h | 8 - torch/csrc/jit/frontend/tracer.cpp | 7 +- torch/csrc/jit/ir/alias_analysis.cpp | 15 +- torch/csrc/jit/ir/ir.cpp | 1 + torch/csrc/jit/mobile/flatbuffer_loader.cpp | 209 +- torch/csrc/jit/mobile/flatbuffer_loader.h | 194 +- torch/csrc/jit/mobile/import.cpp | 53 +- .../jit/mobile/model_tracer/TracerRunner.cpp | 14 +- torch/csrc/jit/mobile/promoted_prim_ops.cpp | 15 + torch/csrc/jit/mobile/promoted_prim_ops.h | 6 + .../operator_upgraders/upgraders_guard.cpp | 12 - .../jit/operator_upgraders/upgraders_guard.h | 10 - torch/csrc/jit/passes/autocast.cpp | 41 + .../frozen_conv_add_relu_fusion_cuda.cpp | 16 +- .../jit/passes/hoist_conv_packed_params.cpp | 15 +- torch/csrc/jit/passes/mkldnn_rewrite.cpp | 234 + torch/csrc/jit/passes/mkldnn_rewrite.h | 34 + torch/csrc/jit/passes/normalize_ops.cpp | 1 + torch/csrc/jit/passes/onnx.cpp | 38 +- .../jit/passes/onnx/function_extraction.cpp | 8 +- .../jit/passes/onnx/function_substitution.cpp | 107 + torch/csrc/jit/passes/onnx/naming.cpp | 205 + torch/csrc/jit/passes/onnx/naming.h | 30 + .../autograd_function_process.cpp | 58 + .../autograd_function_process.h | 11 + .../jit/passes/onnx/shape_type_inference.cpp | 34 +- torch/csrc/jit/passes/quantization/helper.h | 4 +- .../jit/passes/symbolic_shape_analysis.cpp | 53 +- torch/csrc/jit/passes/tensorexpr_fuser.cpp | 37 +- torch/csrc/jit/passes/utils/memory_dag.cpp | 2 +- torch/csrc/jit/passes/vulkan_rewrite.cpp | 58 +- torch/csrc/jit/python/init.cpp | 123 +- torch/csrc/jit/python/pybind_utils.cpp | 8 +- torch/csrc/jit/python/pybind_utils.h | 17 +- torch/csrc/jit/python/python_ir.cpp | 12 +- torch/csrc/jit/python/script_init.cpp | 3 - torch/csrc/jit/runtime/graph_executor.cpp | 2 +- torch/csrc/jit/runtime/operator.h | 16 + torch/csrc/jit/runtime/register_prim_ops.cpp | 36 + .../serialized_shape_function_registry.cpp | 127 +- torch/csrc/jit/runtime/static/native_ops.cpp | 30 +- torch/csrc/jit/runtime/static/ops.cpp | 53 + torch/csrc/jit/runtime/static/passes.cpp | 10 +- torch/csrc/jit/runtime/symbolic_script.cpp | 15 +- .../jit/runtime/symbolic_shape_registry.cpp | 49 +- .../runtime/symbolic_shape_registry_util.cpp | 1 + torch/csrc/jit/serialization/export.cpp | 51 +- .../serialization/flatbuffer_serializer.cpp | 56 +- .../jit/serialization/flatbuffer_serializer.h | 68 +- .../flatbuffer_serializer_jit.cpp | 31 +- .../serialization/flatbuffer_serializer_jit.h | 5 +- torch/csrc/jit/serialization/import.cpp | 3 +- .../csrc/jit/serialization/import_source.cpp | 33 +- torch/csrc/jit/serialization/pickler.cpp | 15 +- torch/csrc/jit/serialization/python_print.cpp | 11 +- torch/csrc/jit/serialization/unpickler.cpp | 14 + .../jit/tensorexpr/external_functions.cpp | 31 + torch/csrc/jit/tensorexpr/kernel.cpp | 51 +- torch/csrc/jit/tensorexpr/kernel.h | 6 + torch/csrc/jit/tensorexpr/lowerings.cpp | 23 + .../csrc/jit/tensorexpr/operators/conv2d.cpp | 62 + torch/csrc/jit/tensorexpr/operators/conv2d.h | 14 +- torch/csrc/lazy/core/dynamic_ir.h | 5 +- torch/csrc/lazy/core/lazy_graph_executor.cpp | 9 +- torch/csrc/lazy/core/lazy_graph_executor.h | 4 +- torch/csrc/lazy/core/shape_inference.cpp | 14 +- torch/csrc/lazy/core/shape_inference.h | 4 +- torch/csrc/lazy/core/tensor_impl.cpp | 34 +- torch/csrc/lazy/core/tensor_impl.h | 4 +- torch/csrc/lazy/python/python_util.cpp | 4 +- torch/csrc/lazy/ts_backend/dynamic_ir.cpp | 14 +- torch/csrc/lazy/ts_backend/dynamic_ir.h | 8 +- .../lazy/ts_backend/ts_native_functions.cpp | 6 - torch/csrc/onnx/init.cpp | 17 +- torch/csrc/onnx/onnx.h | 2 + torch/csrc/profiler/api.cpp | 22 +- torch/csrc/profiler/api.h | 18 +- torch/csrc/profiler/collection.cpp | 418 +- torch/csrc/profiler/collection.h | 98 +- torch/csrc/profiler/containers.h | 19 + .../profiler/execution_graph_observer.cpp | 174 +- .../csrc/profiler/execution_graph_observer.h | 9 +- .../csrc/profiler/kineto_client_interface.cpp | 25 +- torch/csrc/profiler/kineto_shim.cpp | 2 +- torch/csrc/profiler/python/init.cpp | 218 + torch/csrc/profiler/python/init.h | 11 + torch/csrc/profiler/util.cpp | 21 +- torch/csrc/profiler/util.h | 53 + torch/csrc/tensor/python_tensor.cpp | 4 + torch/csrc/tensor/python_tensor.h | 6 +- torch/csrc/utils/out_types.cpp | 23 +- torch/csrc/utils/out_types.h | 4 +- torch/csrc/utils/python_arg_parser.cpp | 2 +- torch/csrc/utils/python_arg_parser.h | 63 +- torch/csrc/utils/python_dispatch.cpp | 67 + torch/csrc/utils/tensor_numpy.cpp | 7 +- torch/csrc/utils/torch_dispatch_mode.h | 8 +- torch/cuda/__init__.py | 1 + torch/cuda/_memory_viz.py | 188 + torch/cuda/jiterator.py | 2 +- torch/cuda/memory.py | 24 + .../distributed/_shard/checkpoint/__init__.py | 14 +- torch/distributed/_shard/checkpoint/api.py | 29 +- .../_shard/checkpoint/default_planner.py | 204 + .../distributed/_shard/checkpoint/metadata.py | 45 +- .../distributed/_shard/checkpoint/planner.py | 344 + .../_shard/checkpoint/planner_helpers.py | 199 + .../_shard/checkpoint/resharding.py | 117 +- .../_shard/checkpoint/state_dict_loader.py | 125 +- .../_shard/checkpoint/state_dict_saver.py | 15 +- torch/distributed/_shard/checkpoint/utils.py | 44 +- torch/distributed/_shard/partial_tensor.py | 1 + .../_shard/sharded_optim/__init__.py | 1 + .../_shard/sharded_tensor/__init__.py | 22 +- .../_shard/sharded_tensor/_ops/_common.py | 3 +- .../_shard/sharded_tensor/_ops/tensor_ops.py | 11 +- .../distributed/_shard/sharded_tensor/api.py | 4 +- torch/distributed/_shard/sharding_plan/api.py | 1 + .../chunk_sharding_spec_ops/_common.py | 270 +- .../chunk_sharding_spec_ops/embedding.py | 170 +- .../chunk_sharding_spec_ops/embedding_bag.py | 621 +- .../_checkpoint/checkpoint_wrapper.py | 40 +- .../algorithms/_comm_hooks/default_hooks.py | 85 +- .../algorithms/ddp_comm_hooks/__init__.py | 1 + .../ddp_comm_hooks/debugging_hooks.py | 1 + .../ddp_comm_hooks/default_hooks.py | 5 + .../ddp_comm_hooks/post_localSGD_hook.py | 1 + .../ddp_comm_hooks/powerSGD_hook.py | 2 + .../ddp_comm_hooks/quantization_hooks.py | 2 + torch/distributed/algorithms/join.py | 1 + .../algorithms/model_averaging/averagers.py | 55 +- .../hierarchical_model_averager.py | 71 +- torch/distributed/autograd/__init__.py | 1 + torch/distributed/distributed_c10d.py | 45 +- torch/distributed/elastic/timer/__init__.py | 1 + .../elastic/timer/file_based_local_timer.py | 313 + torch/distributed/fsdp/_optim_utils.py | 228 +- .../fsdp/{shard_utils.py => _shard_utils.py} | 113 +- torch/distributed/fsdp/_utils.py | 11 + .../fsdp/fully_sharded_data_parallel.py | 595 +- torch/distributed/fsdp/wrap.py | 2 +- torch/distributed/launch.py | 20 +- torch/distributed/nn/api/remote_module.py | 4 + torch/distributed/nn/functional.py | 1 + torch/distributed/optim/functional_rprop.py | 5 +- torch/distributed/optim/optimizer.py | 5 + .../optim/post_localSGD_optimizer.py | 63 +- torch/distributed/optim/utils.py | 1 + .../optim/zero_redundancy_optimizer.py | 1 + torch/distributed/pipeline/sync/pipe.py | 5 + torch/distributed/rpc/api.py | 35 +- torch/distributed/rpc/functions.py | 1 + torch/distributed/rpc/options.py | 1 + .../rpc/server_process_global_profiler.py | 3 +- torch/distributed/run.py | 9 +- torch/distributed/utils.py | 24 +- torch/distributions/bernoulli.py | 1 + torch/distributions/beta.py | 1 + torch/distributions/binomial.py | 1 + torch/distributions/categorical.py | 1 + torch/distributions/cauchy.py | 1 + torch/distributions/chi2.py | 1 + torch/distributions/continuous_bernoulli.py | 1 + torch/distributions/dirichlet.py | 1 + torch/distributions/exponential.py | 1 + torch/distributions/fishersnedecor.py | 1 + torch/distributions/gamma.py | 1 + torch/distributions/geometric.py | 1 + torch/distributions/gumbel.py | 1 + torch/distributions/half_cauchy.py | 1 + torch/distributions/half_normal.py | 1 + torch/distributions/independent.py | 8 +- torch/distributions/kumaraswamy.py | 1 + torch/distributions/laplace.py | 1 + torch/distributions/lkj_cholesky.py | 1 + torch/distributions/log_normal.py | 1 + torch/distributions/logistic_normal.py | 3 +- .../lowrank_multivariate_normal.py | 4 +- torch/distributions/mixture_same_family.py | 17 +- torch/distributions/multinomial.py | 1 + torch/distributions/multivariate_normal.py | 3 + torch/distributions/normal.py | 1 + torch/distributions/one_hot_categorical.py | 1 + torch/distributions/pareto.py | 1 + torch/distributions/poisson.py | 1 + torch/distributions/relaxed_bernoulli.py | 3 +- torch/distributions/relaxed_categorical.py | 3 +- torch/distributions/studentT.py | 1 + torch/distributions/uniform.py | 1 + torch/distributions/von_mises.py | 3 +- torch/distributions/weibull.py | 1 + torch/distributions/wishart.py | 3 +- torch/functional.py | 89 +- torch/futures/__init__.py | 2 + torch/fx/_symbolic_trace.py | 127 +- torch/fx/experimental/const_fold.py | 4 + torch/fx/experimental/meta_tracer.py | 4 +- .../constraint_generator.py | 214 +- torch/fx/experimental/proxy_tensor.py | 657 +- torch/fx/experimental/symbolic_shapes.py | 111 +- torch/fx/experimental/unification/core.py | 1 + torch/fx/experimental/unification/dispatch.py | 2 +- torch/fx/experimental/unification/match.py | 5 +- torch/fx/experimental/unification/more.py | 3 + .../unification/multipledispatch/core.py | 12 +- .../multipledispatch/dispatcher.py | 10 +- .../unification/multipledispatch/utils.py | 8 +- .../unification/multipledispatch/variadic.py | 1 + torch/fx/experimental/unification/utils.py | 1 + torch/fx/experimental/unification/variable.py | 5 +- torch/fx/graph.py | 57 +- torch/fx/graph_module.py | 12 +- torch/fx/interpreter.py | 20 +- torch/fx/operator_schemas.py | 2 +- torch/fx/passes/backends/nvfuser.py | 2 +- torch/fx/passes/infra/pass_manager.py | 16 +- torch/fx/passes/pass_manager.py | 45 +- torch/fx/passes/reinplace.py | 288 +- torch/fx/passes/splitter_base.py | 32 +- torch/fx/passes/tools_common.py | 67 +- torch/fx/passes/utils/__init__.py | 2 +- torch/fx/passes/utils/common.py | 18 +- torch/fx/passes/utils/matcher_utils.py | 233 + torch/fx/proxy.py | 21 +- torch/fx/subgraph_rewriter.py | 276 +- torch/fx/traceback.py | 62 + torch/hub.py | 16 +- torch/jit/_shape_functions.py | 100 +- torch/jit/quantized.py | 18 +- torch/library.py | 1 + torch/linalg/__init__.py | 8 +- torch/masked/__init__.py | 2 + torch/masked/maskedtensor/__init__.py | 8 + torch/masked/maskedtensor/binary.py | 189 + torch/masked/maskedtensor/core.py | 590 + torch/masked/maskedtensor/creation.py | 58 + torch/masked/maskedtensor/passthrough.py | 42 + torch/masked/maskedtensor/unary.py | 188 + torch/monitor/__init__.py | 1 + torch/nn/functional.py | 10 +- torch/nn/grad.py | 2 + torch/nn/init.py | 1 + torch/nn/intrinsic/qat/modules/conv_fused.py | 2 +- torch/nn/intrinsic/qat/modules/linear_relu.py | 3 +- .../quantized/dynamic/modules/linear_relu.py | 7 +- .../nn/intrinsic/quantized/modules/bn_relu.py | 10 +- .../intrinsic/quantized/modules/conv_relu.py | 14 +- .../quantized/modules/linear_relu.py | 7 +- torch/nn/modules/activation.py | 51 +- torch/nn/modules/batchnorm.py | 4 + torch/nn/modules/channelshuffle.py | 1 + torch/nn/modules/container.py | 4 +- torch/nn/modules/distance.py | 13 +- torch/nn/modules/fold.py | 3 + torch/nn/modules/lazy.py | 1 + torch/nn/modules/loss.py | 33 +- torch/nn/modules/module.py | 40 +- torch/nn/modules/padding.py | 9 + torch/nn/modules/pooling.py | 3 +- torch/nn/modules/rnn.py | 12 +- torch/nn/modules/sparse.py | 4 + torch/nn/modules/transformer.py | 102 +- torch/nn/modules/upsampling.py | 91 +- torch/nn/parallel/data_parallel.py | 1 + torch/nn/parallel/distributed.py | 26 +- torch/nn/qat/__init__.py | 6 + torch/nn/qat/dynamic/__init__.py | 6 + torch/nn/qat/dynamic/modules/linear.py | 35 +- torch/nn/qat/modules/__init__.py | 20 +- torch/nn/qat/modules/conv.py | 276 +- torch/nn/qat/modules/embedding_ops.py | 151 +- torch/nn/qat/modules/linear.py | 87 +- torch/nn/quantizable/modules/__init__.py | 6 +- torch/nn/quantizable/modules/activation.py | 464 +- torch/nn/quantizable/modules/rnn.py | 396 +- torch/nn/quantized/__init__.py | 1 + .../quantized/_reference/modules/__init__.py | 19 +- torch/nn/quantized/_reference/modules/conv.py | 335 +- .../nn/quantized/_reference/modules/linear.py | 67 +- torch/nn/quantized/_reference/modules/rnn.py | 494 +- .../nn/quantized/_reference/modules/sparse.py | 105 +- .../nn/quantized/_reference/modules/utils.py | 175 +- torch/nn/quantized/dynamic/__init__.py | 2 +- .../nn/quantized/dynamic/modules/__init__.py | 19 +- torch/nn/quantized/dynamic/modules/conv.py | 403 +- torch/nn/quantized/dynamic/modules/linear.py | 136 +- torch/nn/quantized/dynamic/modules/rnn.py | 1065 +- torch/nn/quantized/functional.py | 619 +- torch/nn/quantized/modules/__init__.py | 125 +- torch/nn/quantized/modules/activation.py | 295 +- torch/nn/quantized/modules/batchnorm.py | 115 +- torch/nn/quantized/modules/conv.py | 928 +- torch/nn/quantized/modules/dropout.py | 35 +- torch/nn/quantized/modules/embedding_ops.py | 303 +- .../quantized/modules/functional_modules.py | 239 +- torch/nn/quantized/modules/linear.py | 304 +- torch/nn/quantized/modules/normalization.py | 216 +- torch/nn/quantized/modules/rnn.py | 54 +- torch/nn/quantized/modules/utils.py | 88 +- torch/nn/utils/_deprecation_utils.py | 45 + .../conv_expanded_weights.py | 23 +- .../nn/utils/_expanded_weights/conv_utils.py | 57 +- torch/nn/utils/_per_sample_grad.py | 1 + torch/nn/utils/init.py | 1 + torch/nn/utils/memory_format.py | 15 +- torch/nn/utils/parametrizations.py | 15 +- torch/nn/utils/parametrize.py | 1 + torch/nn/utils/prune.py | 40 +- torch/nn/utils/rnn.py | 11 +- torch/nn/utils/stateless.py | 1 + torch/onnx/__init__.py | 3 + torch/onnx/_constants.py | 4 +- torch/onnx/_exporter_states.py | 13 + torch/onnx/_globals.py | 20 +- .../_dbr => onnx/_internal}/__init__.py | 0 torch/onnx/_internal/_beartype.py | 105 + torch/onnx/_onnx_supported_ops.py | 2 +- torch/onnx/_patch_torch.py | 130 +- torch/onnx/_type_utils.py | 239 + torch/onnx/errors.py | 73 +- torch/onnx/symbolic_caffe2.py | 28 +- torch/onnx/symbolic_helper.py | 601 +- torch/onnx/symbolic_opset10.py | 69 +- torch/onnx/symbolic_opset11.py | 134 +- torch/onnx/symbolic_opset12.py | 85 +- torch/onnx/symbolic_opset13.py | 58 +- torch/onnx/symbolic_opset14.py | 1 + torch/onnx/symbolic_opset16.py | 7 +- torch/onnx/symbolic_opset17.py | 19 + torch/onnx/symbolic_opset8.py | 45 +- torch/onnx/symbolic_opset9.py | 1072 +- torch/onnx/utils.py | 158 +- torch/onnx/verification.py | 104 +- torch/optim/adam.py | 40 +- torch/optim/adamax.py | 11 + torch/optim/adamw.py | 15 +- torch/optim/lr_scheduler.py | 65 +- torch/optim/lr_scheduler.pyi | 5 + torch/optim/rmsprop.py | 58 +- torch/optim/rprop.py | 48 +- torch/optim/sgd.py | 2 + torch/optim/swa_utils.py | 8 +- torch/overrides.py | 7 +- torch/profiler/__init__.py | 35 +- torch/profiler/_pattern_matcher.py | 207 +- torch/profiler/profiler.py | 2 +- torch/quantization/fx/_equalize.py | 4 +- torch/quasirandom.py | 1 + torch/serialization.py | 48 +- torch/sparse/__init__.py | 1 + torch/testing/_creation.py | 1 + .../testing/_internal/autocast_test_lists.py | 19 + .../_internal/check_kernel_launches.py | 7 +- torch/testing/_internal/common_cuda.py | 13 +- torch/testing/_internal/common_device_type.py | 9 +- .../_internal/common_methods_invocations.py | 22952 ++++++---------- torch/testing/_internal/common_modules.py | 32 +- torch/testing/_internal/common_nn.py | 35 + .../testing/_internal/common_quantization.py | 10 +- torch/testing/_internal/common_utils.py | 245 +- .../testing/_internal/composite_compliance.py | 4 +- .../_internal/distributed/distributed_test.py | 89 +- .../_internal/distributed/rpc/rpc_test.py | 2 +- .../_internal/distributed/rpc_utils.py | 2 +- .../_internal/jit_metaprogramming_utils.py | 1 + torch/testing/_internal/opinfo/__init__.py | 2 + torch/testing/_internal/opinfo/core.py | 2657 ++ .../_internal/opinfo/definitions/__init__.py | 18 + .../_internal/opinfo/definitions/_masked.py | 1132 + .../_internal/opinfo/definitions/fft.py | 715 + .../_internal/opinfo/definitions/linalg.py | 2282 ++ .../_internal/opinfo/definitions/special.py | 684 + torch/testing/_internal/opinfo/refs.py | 214 + torch/testing/_internal/opinfo/utils.py | 260 + torch/testing/_internal/opinfo_helper.py | 139 - torch/testing/_internal/schema_check_mode.py | 12 +- torch/utils/_cuda_trace.py | 76 + torch/utils/_pytree.py | 64 +- torch/utils/_zip.py | 11 +- torch/utils/bottleneck/__main__.py | 2 +- torch/utils/checkpoint.py | 38 +- torch/utils/cpp_extension.py | 105 +- torch/utils/data/_utils/collate.py | 2 + torch/utils/data/_utils/pin_memory.py | 13 +- torch/utils/data/dataloader.py | 3 +- torch/utils/data/datapipes/_hook_iterator.py | 2 +- torch/utils/data/datapipes/_typing.py | 4 +- torch/utils/data/datapipes/datapipe.py | 2 + torch/utils/data/datapipes/gen_pyi.py | 6 +- torch/utils/data/datapipes/iter/callable.py | 6 +- .../data/datapipes/iter/combinatorics.py | 17 +- torch/utils/data/datapipes/iter/combining.py | 16 +- torch/utils/data/datapipes/iter/filelister.py | 1 + torch/utils/data/datapipes/iter/fileopener.py | 1 + torch/utils/data/datapipes/iter/grouping.py | 3 + torch/utils/data/datapipes/iter/selecting.py | 3 + .../utils/data/datapipes/iter/streamreader.py | 1 + torch/utils/data/datapipes/iter/utils.py | 1 + torch/utils/data/datapipes/map/__init__.py | 2 +- torch/utils/data/datapipes/map/callable.py | 1 + .../utils/data/datapipes/map/combinatorics.py | 100 +- torch/utils/data/datapipes/map/combining.py | 2 + torch/utils/data/datapipes/map/grouping.py | 1 + torch/utils/data/datapipes/map/utils.py | 1 + torch/utils/data/datapipes/utils/common.py | 82 +- torch/utils/data/dataset.py | 15 +- torch/utils/data/distributed.py | 1 + torch/utils/data/graph.py | 20 +- torch/utils/data/sampler.py | 1 + torch/utils/dlpack.py | 3 +- torch/utils/hipify/cuda_to_hip_mappings.py | 66 +- torch/utils/hipify/hipify_python.py | 3 +- torch/utils/hooks.py | 32 +- torch/utils/tensorboard/_pytorch_graph.py | 4 +- torch/utils/tensorboard/summary.py | 11 +- torch/utils/throughput_benchmark.py | 17 +- torchgen/api/autograd.py | 104 +- torchgen/api/python.py | 99 +- torchgen/api/types.py | 29 +- torchgen/context.py | 1 + torchgen/dest/lazy_ir.py | 9 +- torchgen/gen.py | 314 +- torchgen/gen_backend_stubs.py | 42 +- torchgen/gen_functionalization_type.py | 2 + torchgen/model.py | 185 +- torchgen/native_function_generation.py | 164 +- torchgen/selective_build/selector.py | 2 +- .../gen_jit_shape_functions.py | 31 +- torchgen/utils.py | 37 +- 1604 files changed, 95410 insertions(+), 74365 deletions(-) create mode 100644 .circleci/cimodel/data/simple/upload_test_stats_definition.py create mode 100755 .circleci/docker/common/install_ucc.sh delete mode 100644 .github/generated-ciflow-ruleset.json delete mode 100644 .github/merge_rules.json create mode 100644 .github/merge_rules.yaml create mode 100644 .github/requirements-gha-cache.txt create mode 100644 .github/scripts/comment_on_pr.py delete mode 100755 .github/scripts/lint_test_ownership.py create mode 100644 .github/scripts/trymerge_explainer.py rename .github/workflows/{_mac-test-arm64.yml => _mac-test-mps.yml} (98%) delete mode 100644 .github/workflows/cancel_redundant_workflows.yml create mode 100644 .github/workflows/docker-release.yml create mode 100644 .github/workflows/mac-mps.yml create mode 100644 .github/workflows/pull.yml create mode 100644 .github/workflows/push_nightly_docker_ghcr.yml delete mode 100644 .github/workflows/stale_pull_requests.yml rename .jenkins/pytorch/win-test-helpers/installation-helpers/{install_miniconda3.bat => activate_miniconda3.bat} (65%) delete mode 100644 aten/src/ATen/core/TorchDispatchModeTLS.cpp create mode 100644 aten/src/ATen/core/TorchDispatchUtils.cpp rename aten/src/ATen/core/{TorchDispatchModeTLS.h => TorchDispatchUtils.h} (55%) create mode 100644 aten/src/ATen/mps/IndexKernels.h create mode 100644 aten/src/ATen/native/cpu/CopyKernel.h create mode 100644 aten/src/ATen/native/cuda/Copy.h create mode 100644 aten/src/ATen/native/cuda/CumminmaxKernel.cu create mode 100644 aten/src/ATen/native/cuda/CumprodKernel.cu create mode 100644 aten/src/ATen/native/cuda/CumsumKernel.cu create mode 100644 aten/src/ATen/native/cuda/LogcumsumexpKernel.cu rename aten/src/ATen/native/cuda/{ScanKernels.cu => ScanUtils.cuh} (84%) create mode 100644 aten/src/ATen/native/mkldnn/Common.h create mode 100644 aten/src/ATen/native/mkldnn/ConvPrepack.cpp create mode 100644 aten/src/ATen/native/mkldnn/ConvPrepack.h create mode 100644 aten/src/ATen/native/mkldnn/OpContext.cpp create mode 100644 aten/src/ATen/native/mkldnn/OpContext.h create mode 100644 aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp create mode 100644 aten/src/ATen/native/mps/operations/BitwiseOps.mm create mode 100644 aten/src/ATen/native/mps/operations/Indexing.h create mode 100644 aten/src/ATen/native/mps/operations/Pad.mm delete mode 100644 aten/src/ATen/native/vulkan/api/vk_mem_alloc.h delete mode 100644 aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp delete mode 100644 aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h delete mode 100644 aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp delete mode 100644 aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h delete mode 100644 aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp delete mode 100644 aten/src/ATen/native/vulkan/ops/VulkanOpContext.h create mode 100644 aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h create mode 100644 aten/src/ATen/templates/RegisterDispatchDefinitions.ini create mode 100644 benchmarks/cpp/nvfuser/matmul.cpp create mode 100644 c10/core/impl/GPUTrace.cpp create mode 100644 c10/core/impl/GPUTrace.h create mode 100644 c10/core/impl/TorchDispatchModeTLS.cpp create mode 100644 c10/core/impl/TorchDispatchModeTLS.h create mode 100644 docs/source/masked.rst delete mode 100644 functorch/codegen/gen_functorch_lagging_op_db.py create mode 100644 functorch/functorch/csrc/dim/arena.h create mode 100644 functorch/functorch/csrc/dim/dim.cpp create mode 100644 functorch/functorch/csrc/dim/dim.h create mode 100644 functorch/functorch/csrc/dim/minpybind.h create mode 100644 functorch/functorch/csrc/dim/python_variable_simple.h create mode 100644 functorch/functorch/dim/README.md create mode 100644 functorch/functorch/dim/__init__.py create mode 100644 functorch/functorch/dim/batch_tensor.py create mode 100644 functorch/functorch/dim/delayed_mul_tensor.py create mode 100644 functorch/functorch/dim/dim.py create mode 100644 functorch/functorch/dim/magic_trace.py create mode 100644 functorch/functorch/dim/op_properties.py create mode 100644 functorch/functorch/dim/reference.py create mode 100644 functorch/functorch/dim/tree_map.py create mode 100644 functorch/functorch/dim/wrap_type.py create mode 100644 functorch/functorch/experimental/cond.py create mode 100644 functorch/functorch/experimental/ops.py create mode 100644 functorch/test/attn_ft.py create mode 100644 functorch/test/attn_positional.py delete mode 100644 functorch/test/functorch_lagging_op_db.py create mode 100644 functorch/test/test_control_flow.py create mode 100644 functorch/test/test_dims.py create mode 100644 test/cpp/c10d/ProcessGroupUCCTest.cpp create mode 100644 test/distributed/_shard/checkpoint/test_planner.py create mode 100644 test/distributed/elastic/timer/file_based_local_timer_test.py delete mode 100644 test/jit/test_legacy_upgraders.py create mode 100644 test/nn/test_packed_sequence.py create mode 100644 test/nn/test_pooling.py create mode 100644 test/onnx/internal/test_beartype.py create mode 100644 test/onnx/test_autograd_funs.py create mode 100644 test/quantization/core/experimental/apot_fx_graph_mode_ptq.py create mode 100644 test/quantization/core/experimental/apot_fx_graph_mode_qat.py rename test/quantization/core/experimental/{fx_graph_mode_apot.py => quantization_util.py} (52%) delete mode 100644 test/quantization/dbr/test_quantize_dbr.py create mode 100755 test/run_doctests.sh mode change 100644 => 100755 test/run_test.py create mode 100644 test/test_cuda_trace.py create mode 100644 test/test_dlpack.py delete mode 100644 test/test_dynamo_cudagraphs.py create mode 100644 test/test_maskedtensor.py create mode 100644 test/test_mkldnn_fusion.py rename test/{jit => }/test_schema_check.py (94%) create mode 160000 third_party/VulkanMemoryAllocator delete mode 100644 third_party/cpuinfo.BUILD create mode 100644 torch/_C/_profiler.pyi rename {test/quantization/dbr => torch/_dispatch}/__init__.py (100%) create mode 100644 torch/_dispatch/_dispatcher.py create mode 100644 torch/_prims/nvfuser_prims.py create mode 100644 torch/ao/nn/qat/__init__.py create mode 100644 torch/ao/nn/qat/dynamic/__init__.py create mode 100644 torch/ao/nn/qat/dynamic/modules/__init__.py create mode 100644 torch/ao/nn/qat/dynamic/modules/linear.py create mode 100644 torch/ao/nn/qat/modules/__init__.py create mode 100644 torch/ao/nn/qat/modules/conv.py create mode 100644 torch/ao/nn/qat/modules/embedding_ops.py create mode 100644 torch/ao/nn/qat/modules/linear.py create mode 100644 torch/ao/nn/quantizable/__init__.py create mode 100644 torch/ao/nn/quantizable/modules/__init__.py create mode 100644 torch/ao/nn/quantizable/modules/activation.py create mode 100644 torch/ao/nn/quantizable/modules/rnn.py create mode 100644 torch/ao/nn/quantized/__init__.py create mode 100644 torch/ao/nn/quantized/_reference/__init__.py create mode 100644 torch/ao/nn/quantized/_reference/modules/__init__.py create mode 100644 torch/ao/nn/quantized/_reference/modules/conv.py create mode 100644 torch/ao/nn/quantized/_reference/modules/linear.py create mode 100644 torch/ao/nn/quantized/_reference/modules/rnn.py create mode 100644 torch/ao/nn/quantized/_reference/modules/sparse.py create mode 100644 torch/ao/nn/quantized/_reference/modules/utils.py create mode 100644 torch/ao/nn/quantized/dynamic/__init__.py create mode 100644 torch/ao/nn/quantized/dynamic/modules/__init__.py create mode 100644 torch/ao/nn/quantized/dynamic/modules/conv.py create mode 100644 torch/ao/nn/quantized/dynamic/modules/linear.py create mode 100644 torch/ao/nn/quantized/dynamic/modules/rnn.py create mode 100644 torch/ao/nn/quantized/functional.py create mode 100644 torch/ao/nn/quantized/modules/__init__.py create mode 100644 torch/ao/nn/quantized/modules/activation.py create mode 100644 torch/ao/nn/quantized/modules/batchnorm.py create mode 100644 torch/ao/nn/quantized/modules/conv.py create mode 100644 torch/ao/nn/quantized/modules/dropout.py create mode 100644 torch/ao/nn/quantized/modules/embedding_ops.py create mode 100644 torch/ao/nn/quantized/modules/functional_modules.py create mode 100644 torch/ao/nn/quantized/modules/linear.py create mode 100644 torch/ao/nn/quantized/modules/normalization.py create mode 100644 torch/ao/nn/quantized/modules/rnn.py create mode 100644 torch/ao/nn/quantized/modules/utils.py delete mode 100644 torch/ao/ns/_numeric_suite_dbr.py delete mode 100644 torch/ao/quantization/_dbr/README.md delete mode 100644 torch/ao/quantization/_dbr/auto_trace.py delete mode 100644 torch/ao/quantization/_dbr/auto_trace_rewriter.py delete mode 100644 torch/ao/quantization/_dbr/function_fusion.py delete mode 100644 torch/ao/quantization/_dbr/fusion.py delete mode 100644 torch/ao/quantization/_dbr/mappings.py delete mode 100644 torch/ao/quantization/_dbr/model_utils.py delete mode 100644 torch/ao/quantization/_dbr/module_swap_utils.py delete mode 100644 torch/ao/quantization/_dbr/qconfig_mapping_utils.py delete mode 100644 torch/ao/quantization/_dbr/quantization_state.py delete mode 100644 torch/ao/quantization/_dbr/torchscript_utils.py delete mode 100644 torch/ao/quantization/_dbr/utils.py delete mode 100644 torch/ao/quantization/_quantize_dbr.py create mode 100644 torch/ao/quantization/backend_config/fbgemm.py create mode 100644 torch/ao/quantization/backend_config/qnnpack.py delete mode 100644 torch/ao/quantization/fx/common_quantization_patterns.py rename torch/ao/quantization/fx/{qconfig_utils.py => qconfig_mapping_utils.py} (95%) create mode 100644 torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py create mode 100644 torch/csrc/autograd/graph_task.h delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.cpp delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.h create mode 100644 torch/csrc/jit/codegen/cuda/inlining.cpp create mode 100644 torch/csrc/jit/codegen/cuda/inlining.h create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.h delete mode 100644 torch/csrc/jit/codegen/cuda/reference_tensor.h create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.h create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp delete mode 100644 torch/csrc/jit/operator_upgraders/upgraders_guard.cpp delete mode 100644 torch/csrc/jit/operator_upgraders/upgraders_guard.h create mode 100644 torch/csrc/jit/passes/mkldnn_rewrite.cpp create mode 100644 torch/csrc/jit/passes/mkldnn_rewrite.h create mode 100644 torch/csrc/jit/passes/onnx/naming.cpp create mode 100644 torch/csrc/jit/passes/onnx/naming.h create mode 100644 torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp create mode 100644 torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h create mode 100644 torch/csrc/profiler/python/init.cpp create mode 100644 torch/csrc/profiler/python/init.h create mode 100644 torch/cuda/_memory_viz.py create mode 100644 torch/distributed/_shard/checkpoint/default_planner.py create mode 100644 torch/distributed/_shard/checkpoint/planner.py create mode 100644 torch/distributed/_shard/checkpoint/planner_helpers.py create mode 100644 torch/distributed/elastic/timer/file_based_local_timer.py rename torch/distributed/fsdp/{shard_utils.py => _shard_utils.py} (64%) create mode 100644 torch/fx/passes/utils/matcher_utils.py create mode 100644 torch/fx/traceback.py create mode 100644 torch/masked/__init__.py create mode 100644 torch/masked/maskedtensor/__init__.py create mode 100644 torch/masked/maskedtensor/binary.py create mode 100644 torch/masked/maskedtensor/core.py create mode 100644 torch/masked/maskedtensor/creation.py create mode 100644 torch/masked/maskedtensor/passthrough.py create mode 100644 torch/masked/maskedtensor/unary.py create mode 100644 torch/nn/utils/_deprecation_utils.py rename torch/{ao/quantization/_dbr => onnx/_internal}/__init__.py (100%) create mode 100644 torch/onnx/_internal/_beartype.py create mode 100644 torch/onnx/_type_utils.py create mode 100644 torch/onnx/symbolic_opset17.py create mode 100644 torch/testing/_internal/opinfo/__init__.py create mode 100644 torch/testing/_internal/opinfo/core.py create mode 100644 torch/testing/_internal/opinfo/definitions/__init__.py create mode 100644 torch/testing/_internal/opinfo/definitions/_masked.py create mode 100644 torch/testing/_internal/opinfo/definitions/fft.py create mode 100644 torch/testing/_internal/opinfo/definitions/linalg.py create mode 100644 torch/testing/_internal/opinfo/definitions/special.py create mode 100644 torch/testing/_internal/opinfo/refs.py create mode 100644 torch/testing/_internal/opinfo/utils.py delete mode 100644 torch/testing/_internal/opinfo_helper.py create mode 100644 torch/utils/_cuda_trace.py diff --git a/.circleci/cimodel/data/dimensions.py b/.circleci/cimodel/data/dimensions.py index 7f9ebccbcc898..5841b3806b135 100644 --- a/.circleci/cimodel/data/dimensions.py +++ b/.circleci/cimodel/data/dimensions.py @@ -4,6 +4,7 @@ "102", "113", "116", + "117", ] ROCM_VERSIONS = [ diff --git a/.circleci/cimodel/data/simple/ios_definitions.py b/.circleci/cimodel/data/simple/ios_definitions.py index a01a2db8229fc..5dfb84d6e5da1 100644 --- a/.circleci/cimodel/data/simple/ios_definitions.py +++ b/.circleci/cimodel/data/simple/ios_definitions.py @@ -11,7 +11,7 @@ def __init__(self, name, custom_build_name=""): def render(self): extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else [] - return "_".join([self.name] + extra_parts) + return "-".join([self.name] + extra_parts).replace("_", "-") def get_platform(arch_variant_name): @@ -25,30 +25,25 @@ def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extr self.is_org_member_context = is_org_member_context self.extra_props = extra_props - def gen_name_parts(self, with_version_dots): - - version_parts = self.xcode_version.render_dots_or_parts(with_version_dots) - build_variant_suffix = "_".join([self.arch_variant.render(), "build"]) - + def gen_name_parts(self): + version_parts = self.xcode_version.render_dots_or_parts("-") + build_variant_suffix = self.arch_variant.render() return [ - "pytorch", "ios", ] + version_parts + [ build_variant_suffix, ] def gen_job_name(self): - return "_".join(self.gen_name_parts(False)) + return "-".join(self.gen_name_parts()) def gen_tree(self): - platform_name = get_platform(self.arch_variant.name) - props_dict = { - "build_environment": "-".join(self.gen_name_parts(True)), + "name": self.gen_job_name(), + "build_environment": self.gen_job_name(), "ios_arch": self.arch_variant.name, "ios_platform": platform_name, - "name": self.gen_job_name(), } if self.is_org_member_context: @@ -63,16 +58,12 @@ def gen_tree(self): WORKFLOW_DATA = [ IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={ "lite_interpreter": miniutils.quote(str(int(True)))}), - IOSJob(XCODE_VERSION, ArchVariant("x86_64", "full_jit"), is_org_member_context=False, extra_props={ - "lite_interpreter": miniutils.quote(str(int(False)))}), IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={ "lite_interpreter": miniutils.quote(str(int(True)))}), IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={ "use_metal": miniutils.quote(str(int(True))), "lite_interpreter": miniutils.quote(str(int(True)))}), - IOSJob(XCODE_VERSION, ArchVariant("arm64", "full_jit"), extra_props={ - "lite_interpreter": miniutils.quote(str(int(False)))}), - IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={ + IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={ "op_list": "mobilenetv2.yaml", "lite_interpreter": miniutils.quote(str(int(True)))}), IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={ diff --git a/.circleci/cimodel/data/simple/macos_definitions.py b/.circleci/cimodel/data/simple/macos_definitions.py index 371c8b694cf3b..d0b4df0f906cb 100644 --- a/.circleci/cimodel/data/simple/macos_definitions.py +++ b/.circleci/cimodel/data/simple/macos_definitions.py @@ -1,3 +1,7 @@ +from collections import OrderedDict +from cimodel.lib.miniutils import quote + + class MacOsJob: def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()): # extra_props is tuple type, because mutable data structures for argument defaults @@ -11,10 +15,14 @@ def gen_tree(self): non_phase_parts = ["pytorch", "macos", self.os_version, "py3"] extra_name_list = [name for name, exist in self.extra_props.items() if exist] - full_job_name_list = non_phase_parts + extra_name_list + [ - 'build' if self.is_build else None, - 'test' if self.is_test else None, - ] + full_job_name_list = ( + non_phase_parts + + extra_name_list + + [ + "build" if self.is_build else None, + "test" if self.is_test else None, + ] + ) full_job_name = "_".join(list(filter(None, full_job_name_list))) @@ -41,12 +49,93 @@ def gen_tree(self): "10_13", is_build=True, is_test=True, - extra_props=tuple({ - "lite_interpreter": True - }.items()), - ) + extra_props=tuple({"lite_interpreter": True}.items()), + ), ] +def get_new_workflow_jobs(): + return [ + OrderedDict( + { + "mac_build": OrderedDict( + { + "name": "macos-12-py3-x86-64-build", + "build-environment": "macos-12-py3-x86-64", + "xcode-version": quote("13.3.1"), + } + ) + } + ), + OrderedDict( + { + "mac_test": OrderedDict( + { + "name": "macos-12-py3-x86-64-test-1-2-default", + "build-environment": "macos-12-py3-x86-64", + "xcode-version": quote("13.3.1"), + "shard-number": quote("1"), + "num-test-shards": quote("2"), + "requires": ["macos-12-py3-x86-64-build"], + } + ) + } + ), + OrderedDict( + { + "mac_test": OrderedDict( + { + "name": "macos-12-py3-x86-64-test-2-2-default", + "build-environment": "macos-12-py3-x86-64", + "xcode-version": quote("13.3.1"), + "shard-number": quote("2"), + "num-test-shards": quote("2"), + "requires": ["macos-12-py3-x86-64-build"], + } + ) + } + ), + OrderedDict( + { + "mac_test": OrderedDict( + { + "name": "macos-12-py3-x86-64-test-1-1-functorch", + "build-environment": "macos-12-py3-x86-64", + "xcode-version": quote("13.3.1"), + "shard-number": quote("1"), + "num-test-shards": quote("1"), + "test-config": "functorch", + "requires": ["macos-12-py3-x86-64-build"], + } + ) + } + ), + OrderedDict( + { + "mac_build": OrderedDict( + { + "name": "macos-12-py3-x86-64-lite-interpreter-build-test", + "build-environment": "macos-12-py3-lite-interpreter-x86-64", + "xcode-version": quote("13.3.1"), + "build-generates-artifacts": "false", + } + ) + } + ), + OrderedDict( + { + "mac_build": OrderedDict( + { + "name": "macos-12-py3-arm64-build", + "build-environment": "macos-12-py3-arm64", + "xcode-version": quote("13.3.1"), + "python-version": quote("3.9.12"), + } + ) + } + ), + ] + + def get_workflow_jobs(): return [item.gen_tree() for item in WORKFLOW_DATA] diff --git a/.circleci/cimodel/data/simple/nightly_ios.py b/.circleci/cimodel/data/simple/nightly_ios.py index 941a61a73b91e..f75bcb4bfe218 100644 --- a/.circleci/cimodel/data/simple/nightly_ios.py +++ b/.circleci/cimodel/data/simple/nightly_ios.py @@ -15,7 +15,7 @@ def __init__(self, def get_phase_name(self): return "upload" if self.is_upload else "build" - def get_common_name_pieces(self, with_version_dots): + def get_common_name_pieces(self, sep): extra_name_suffix = [self.get_phase_name()] if self.is_upload else [] @@ -24,7 +24,7 @@ def get_common_name_pieces(self, with_version_dots): common_name_pieces = [ "ios", ] + extra_name + [ - ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(with_version_dots) + [ + ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep) + [ "nightly", self.variant, "build", @@ -33,14 +33,14 @@ def get_common_name_pieces(self, with_version_dots): return common_name_pieces def gen_job_name(self): - return "_".join(["pytorch"] + self.get_common_name_pieces(False)) + return "_".join(["pytorch"] + self.get_common_name_pieces(None)) def gen_tree(self): build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else [] props_dict = { - "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(True)), + "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(".")), "requires": extra_requires, "context": "org-member", "filters": {"branches": {"only": "nightly"}}, diff --git a/.circleci/cimodel/data/simple/upload_test_stats_definition.py b/.circleci/cimodel/data/simple/upload_test_stats_definition.py new file mode 100644 index 0000000000000..0d51add5551ce --- /dev/null +++ b/.circleci/cimodel/data/simple/upload_test_stats_definition.py @@ -0,0 +1,20 @@ +from typing import OrderedDict + + +def get_workflow_job(): + return [ + OrderedDict( + { + "upload_test_stats": OrderedDict( + { + "name": "upload test status", + "requires": [ + "macos-12-py3-x86-64-test-1-2-default", + "macos-12-py3-x86-64-test-2-2-default", + "macos-12-py3-x86-64-test-1-1-functorch", + ], + } + ) + } + ), + ] diff --git a/.circleci/cimodel/data/simple/util/versions.py b/.circleci/cimodel/data/simple/util/versions.py index 53d3a837248c1..518feb2e38691 100644 --- a/.circleci/cimodel/data/simple/util/versions.py +++ b/.circleci/cimodel/data/simple/util/versions.py @@ -1,3 +1,6 @@ +from typing import Optional + + class MultiPartVersion: def __init__(self, parts, prefix=""): self.parts = parts @@ -13,14 +16,11 @@ def prefixed_parts(self): else: return [self.prefix] - def render_dots(self): - return ".".join(self.prefixed_parts()) - - def render_dots_or_parts(self, with_dots): - if with_dots: - return [self.render_dots()] - else: + def render_dots_or_parts(self, sep: Optional[str] = None): + if sep is None: return self.prefixed_parts() + else: + return [sep.join(self.prefixed_parts())] class CudaVersion(MultiPartVersion): diff --git a/.circleci/config.yml b/.circleci/config.yml index 4ca08b1b7c181..0b742215880ad 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -570,6 +570,196 @@ jobs: paths: - miniconda3 + mac_build: + parameters: + build-environment: + type: string + description: Top-level label for what's being built/tested. + xcode-version: + type: string + default: "13.3.1" + description: What xcode version to build with. + build-generates-artifacts: + type: boolean + default: true + description: if the build generates build artifacts + python-version: + type: string + default: "3.8" + macos: + xcode: << parameters.xcode-version >> + resource_class: medium + environment: + BUILD_ENVIRONMENT: << parameters.build-environment >> + AWS_REGION: us-east-1 + steps: + + - checkout + - run_brew_for_macos_build + + - run: + name: Install sccache + command: | + sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache + sudo chmod +x /usr/local/bin/sccache + echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}" + echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}" + + set +x + echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}" + echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}" + set -x + + - run: + name: Get workflow job id + command: | + echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}" + + - run: + name: Build + command: | + set -x + + git submodule sync + git submodule update --init --recursive --depth 1 --jobs 0 + + export PATH="/usr/local/bin:$PATH" + export WORKSPACE_DIR="${HOME}/workspace" + mkdir -p "${WORKSPACE_DIR}" + MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh" + if [ << parameters.python-version >> == 3.9.12 ]; then + MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh" + fi + + # If a local installation of conda doesn't exist, we download and install conda + if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then + mkdir -p "${WORKSPACE_DIR}" + curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh + bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3 + fi + export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH" + # shellcheck disable=SC1091 + source "${WORKSPACE_DIR}"/miniconda3/bin/activate + + echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}" + .jenkins/pytorch/macos-build.sh + + - when: + condition: << parameters.build-generates-artifacts >> + steps: + - run: + name: Archive artifacts into zip + command: | + zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json + cp artifacts.zip /Users/distiller/workspace + + - persist_to_workspace: + root: /Users/distiller/workspace/ + paths: + - miniconda3 + - artifacts.zip + + - store_artifacts: + path: /Users/distiller/project/artifacts.zip + + mac_test: + parameters: + build-environment: + type: string + shard-number: + type: string + num-test-shards: + type: string + xcode-version: + type: string + test-config: + type: string + default: 'default' + + macos: + xcode: << parameters.xcode-version >> + environment: + GIT_DEFAULT_BRANCH: 'master' + BUILD_ENVIRONMENT: << parameters.build-environment >> + TEST_CONFIG: << parameters.test-config >> + SHARD_NUMBER: << parameters.shard-number >> + NUM_TEST_SHARDS: << parameters.num-test-shards >> + PYTORCH_RETRY_TEST_CASES: 1 + PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1 + steps: + - checkout + - attach_workspace: + at: ~/workspace + - run_brew_for_macos_build + - run: + name: Test + no_output_timeout: "1h" + command: | + set -x + + git submodule sync --recursive + git submodule update --init --recursive + + mv ~/workspace/artifacts.zip . + unzip artifacts.zip + + export IN_CI=1 + + COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}") + + export PATH="/usr/local/bin:$PATH" + export WORKSPACE_DIR="${HOME}/workspace" + mkdir -p "${WORKSPACE_DIR}" + + export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH" + source "${WORKSPACE_DIR}"/miniconda3/bin/activate + + # sanitize the input commit message and PR body here: + + # trim all new lines from commit messages to avoid issues with batch environment + # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028 + COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}" + + # then trim all special characters like single and double quotes to avoid unescaped inputs to + # wreak havoc internally + export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}" + + python3 -mpip install dist/*.whl + .jenkins/pytorch/macos-test.sh + - run: + name: Copy files for uploading test stats + command: | + # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace + mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports + cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports + - store_test_results: + path: test/test-reports + - persist_to_workspace: + root: /Users/distiller/project/ + paths: + - test-reports + + upload_test_stats: + machine: # executor type + image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4 + steps: + - checkout + - attach_workspace: + at: ~/workspace + - run: + name: upload + command: | + set -ex + if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then + echo "No credentials found, cannot upload test stats (are you on a fork?)" + exit 0 + fi + cp -r ~/workspace/test-reports/* ~/project + pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0 + export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} + export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} + # i dont know how to get the run attempt number for reruns so default to 1 + python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci pytorch_macos_10_13_py3_test: environment: BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test @@ -911,12 +1101,13 @@ jobs: cd ${PROJ_ROOT}/ios/TestApp/benchmark mkdir -p ../models if [ ${USE_COREML_DELEGATE} == 1 ]; then - pip install coremltools==5.0b5 - pip install six + pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0 python coreml_backend.py else - python trace_model.py + cd "${PROJ_ROOT}" + python test/mobile/model_test/gen_test_model.py ios-test fi + cd "${PROJ_ROOT}/ios/TestApp/benchmark" if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then echo "Setting up the TestApp for LiteInterpreter" ruby setup.rb --lite 1 @@ -924,10 +1115,10 @@ jobs: echo "Setting up the TestApp for Full JIT" ruby setup.rb fi - cd ${PROJ_ROOT}/ios/TestApp - instruments -s -devices - if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then - if [ ${USE_COREML_DELEGATE} == 1 ]; then + cd "${PROJ_ROOT}/ios/TestApp" + # instruments -s -devices + if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then + if [ "${USE_COREML_DELEGATE}" == 1 ]; then fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML else fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter @@ -1241,4 +1432,93 @@ workflows: branches: only: - postnightly + - mac_build: + name: macos-12-py3-x86-64-build + build-environment: macos-12-py3-x86-64 + xcode-version: "13.3.1" + - mac_test: + name: macos-12-py3-x86-64-test-1-2-default + build-environment: macos-12-py3-x86-64 + xcode-version: "13.3.1" + shard-number: "1" + num-test-shards: "2" + requires: + - macos-12-py3-x86-64-build + - mac_test: + name: macos-12-py3-x86-64-test-2-2-default + build-environment: macos-12-py3-x86-64 + xcode-version: "13.3.1" + shard-number: "2" + num-test-shards: "2" + requires: + - macos-12-py3-x86-64-build + - mac_test: + name: macos-12-py3-x86-64-test-1-1-functorch + build-environment: macos-12-py3-x86-64 + xcode-version: "13.3.1" + shard-number: "1" + num-test-shards: "1" + test-config: functorch + requires: + - macos-12-py3-x86-64-build + - mac_build: + name: macos-12-py3-x86-64-lite-interpreter-build-test + build-environment: macos-12-py3-lite-interpreter-x86-64 + xcode-version: "13.3.1" + build-generates-artifacts: false + - mac_build: + name: macos-12-py3-arm64-build + build-environment: macos-12-py3-arm64 + xcode-version: "13.3.1" + python-version: "3.9.12" + - upload_test_stats: + name: upload test status + requires: + - macos-12-py3-x86-64-test-1-2-default + - macos-12-py3-x86-64-test-2-2-default + - macos-12-py3-x86-64-test-1-1-functorch + - pytorch_ios_build: + build_environment: ios-12-5-1-x86-64 + ios_arch: x86_64 + ios_platform: SIMULATOR + lite_interpreter: "1" + name: ios-12-5-1-x86-64 + - pytorch_ios_build: + build_environment: ios-12-5-1-arm64 + context: org-member + ios_arch: arm64 + ios_platform: OS + lite_interpreter: "1" + name: ios-12-5-1-arm64 + - pytorch_ios_build: + build_environment: ios-12-5-1-arm64-metal + context: org-member + ios_arch: arm64 + ios_platform: OS + lite_interpreter: "1" + name: ios-12-5-1-arm64-metal + use_metal: "1" + - pytorch_ios_build: + build_environment: ios-12-5-1-arm64-custom-ops + context: org-member + ios_arch: arm64 + ios_platform: OS + lite_interpreter: "1" + name: ios-12-5-1-arm64-custom-ops + op_list: mobilenetv2.yaml + - pytorch_ios_build: + build_environment: ios-12-5-1-x86-64-coreml + ios_arch: x86_64 + ios_platform: SIMULATOR + lite_interpreter: "1" + name: ios-12-5-1-x86-64-coreml + use_coreml: "1" + - pytorch_ios_build: + build_environment: ios-12-5-1-arm64-coreml + context: org-member + ios_arch: arm64 + ios_platform: OS + lite_interpreter: "1" + name: ios-12-5-1-arm64-coreml + use_coreml: "1" when: << pipeline.parameters.run_build >> diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh index ee785bbc95039..b7fef829b798e 100755 --- a/.circleci/docker/build.sh +++ b/.circleci/docker/build.sh @@ -84,6 +84,8 @@ if [[ "$image" == *xenial* ]] || [[ "$image" == *bionic* ]]; then fi TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64" +_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab +_UCC_COMMIT=12944da33f911daf505d9bbc51411233d0ed85e1 # It's annoying to rename jobs every time you want to rewrite a # configuration, so we hardcode everything here rather than do it @@ -147,6 +149,8 @@ case "$image" in DB=yes VISION=yes KATEX=yes + UCX_COMMIT=${_UCX_COMMIT} + UCC_COMMIT=${_UCC_COMMIT} ;; pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7) CUDA_VERSION=11.7.0 @@ -157,6 +161,8 @@ case "$image" in DB=yes VISION=yes KATEX=yes + UCX_COMMIT=${_UCX_COMMIT} + UCC_COMMIT=${_UCC_COMMIT} ;; pytorch-linux-xenial-py3-clang5-asan) ANACONDA_PYTHON_VERSION=3.7 @@ -262,7 +268,7 @@ case "$image" in ;; pytorch-linux-focal-py3.7-gcc7) ANACONDA_PYTHON_VERSION=3.7 - CMAKE_VERSION=3.12.4 # To make sure XNNPACK is enabled for the BACKWARDS_COMPAT_TEST used with this image + CMAKE_VERSION=3.16.9 # Required for precompiled header support GCC_VERSION=7 PROTOBUF=yes DB=yes @@ -375,6 +381,8 @@ docker build \ --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \ --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \ --build-arg "IMAGE_NAME=${IMAGE_NAME}" \ + --build-arg "UCX_COMMIT=${UCX_COMMIT}" \ + --build-arg "UCC_COMMIT=${UCC_COMMIT}" \ -f $(dirname ${DOCKERFILE})/Dockerfile \ -t "$tmp_tag" \ "$@" \ diff --git a/.circleci/docker/common/install_base.sh b/.circleci/docker/common/install_base.sh index 26ca9d79cedeb..6724031c0a447 100755 --- a/.circleci/docker/common/install_base.sh +++ b/.circleci/docker/common/install_base.sh @@ -67,7 +67,8 @@ install_ubuntu() { wget \ sudo \ vim \ - jq + jq \ + libtool # Should resolve issues related to various apt package repository cert issues # see: https://github.com/pytorch/pytorch/issues/65931 diff --git a/.circleci/docker/common/install_conda.sh b/.circleci/docker/common/install_conda.sh index 49afcb5aef423..3626d0cc33d4c 100755 --- a/.circleci/docker/common/install_conda.sh +++ b/.circleci/docker/common/install_conda.sh @@ -55,8 +55,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then # Ensure we run conda in a directory that jenkins has write access to pushd /opt/conda - # Track latest conda update - as_jenkins conda update -y -n base conda + # Prevent conda from updating to 4.14.0, which causes docker build failures + # See https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d + # Uncomment the below when resolved to track the latest conda update + # as_jenkins conda update -y -n base conda # Install correct Python version as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION" diff --git a/.circleci/docker/common/install_ucc.sh b/.circleci/docker/common/install_ucc.sh new file mode 100755 index 0000000000000..4d691ebb5e9ed --- /dev/null +++ b/.circleci/docker/common/install_ucc.sh @@ -0,0 +1,48 @@ +#!/bin/bash + +set -ex + +if [[ -d "/usr/local/cuda/" ]]; then + with_cuda=/usr/local/cuda/ +else + with_cuda=no +fi + +function install_ucx() { + set -ex + git clone --recursive https://github.com/openucx/ucx.git + pushd ucx + git checkout ${UCX_COMMIT} + git submodule update --init --recursive + + ./autogen.sh + ./configure --prefix=$UCX_HOME \ + --enable-mt \ + --with-cuda=$with_cuda \ + --enable-profiling \ + --enable-stats + time make -j + sudo make install + + popd + rm -rf ucx +} + +function install_ucc() { + set -ex + git clone --recursive https://github.com/openucx/ucc.git + pushd ucc + git checkout ${UCC_COMMIT} + git submodule update --init --recursive + + ./autogen.sh + ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-nccl=no --with-cuda=$with_cuda + time make -j + sudo make install + + popd + rm -rf ucc +} + +install_ucx +install_ucc diff --git a/.circleci/docker/requirements-ci.txt b/.circleci/docker/requirements-ci.txt index 451bd39467c37..5662eadc4f661 100644 --- a/.circleci/docker/requirements-ci.txt +++ b/.circleci/docker/requirements-ci.txt @@ -164,6 +164,16 @@ pytest-rerunfailures #Pinned versions: #test that import: +xdoctest==1.0.2 +#Description: runs doctests in pytest +#Pinned versions: 1.0.2 +#test that import: + +pygments==2.12.0 +#Description: support doctest highlighting +#Pinned versions: 2.12.0 +#test that import: the doctests + #PyYAML #Description: data serialization format #Pinned versions: diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile index f7674987a0c3e..a3a623996ad02 100644 --- a/.circleci/docker/ubuntu-cuda/Dockerfile +++ b/.circleci/docker/ubuntu-cuda/Dockerfile @@ -62,6 +62,17 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi RUN rm install_vision.sh ENV INSTALLED_VISION ${VISION} +# (optional) Install UCC +ARG UCX_COMMIT +ARG UCC_COMMIT +ENV UCX_COMMIT $UCX_COMMIT +ENV UCC_COMMIT $UCC_COMMIT +ENV UCX_HOME /usr +ENV UCC_HOME /usr +ADD ./common/install_ucc.sh install_ucc.sh +RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi +RUN rm install_ucc.sh + COPY ./common/install_openssl.sh install_openssl.sh ENV OPENSSL_ROOT_DIR /opt/openssl RUN bash ./install_openssl.sh diff --git a/.circleci/docker/ubuntu/Dockerfile b/.circleci/docker/ubuntu/Dockerfile index 22592534c20f0..e86baf0d6690e 100644 --- a/.circleci/docker/ubuntu/Dockerfile +++ b/.circleci/docker/ubuntu/Dockerfile @@ -58,6 +58,17 @@ RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh ENV DESIRED_CUDA ${CUDA_VERSION} ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH +# (optional) Install UCC +ARG UCX_COMMIT +ARG UCC_COMMIT +ENV UCX_COMMIT $UCX_COMMIT +ENV UCC_COMMIT $UCC_COMMIT +ENV UCX_HOME /usr +ENV UCC_HOME /usr +ADD ./common/install_ucc.sh install_ucc.sh +RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi +RUN rm install_ucc.sh + # (optional) Install protobuf for ONNX ARG PROTOBUF COPY ./common/install_protobuf.sh install_protobuf.sh diff --git a/.circleci/generate_config_yml.py b/.circleci/generate_config_yml.py index e068dd98fd8ea..c2e90a4b824fd 100755 --- a/.circleci/generate_config_yml.py +++ b/.circleci/generate_config_yml.py @@ -14,6 +14,9 @@ import cimodel.data.simple.mobile_definitions import cimodel.data.simple.nightly_ios import cimodel.data.simple.anaconda_prune_defintions +import cimodel.data.simple.macos_definitions +import cimodel.data.simple.ios_definitions +import cimodel.data.simple.upload_test_stats_definition import cimodel.lib.miniutils as miniutils import cimodel.lib.miniyaml as miniyaml @@ -70,6 +73,7 @@ def write(self, output_filehandle): for line in filter(None, lines): output_filehandle.write(line + "\n") + def _for_all_items(items, functor) -> None: if isinstance(items, list): for item in items: @@ -78,6 +82,7 @@ def _for_all_items(items, functor) -> None: item_type, item = next(iter(items.items())) functor(item_type, item) + def filter_master_only_jobs(items): def _is_main_or_master_item(item): filters = item.get('filters', None) @@ -116,6 +121,7 @@ def _do_filtering(items): _for_all_items(items, _save_requires_if_master) return _do_filtering(items) + def generate_required_docker_images(items): required_docker_images = set() @@ -131,11 +137,15 @@ def _requires_docker_image(item_type, item): _for_all_items(items, _requires_docker_image) return required_docker_images + def gen_build_workflows_tree(): build_workflows_functions = [ cimodel.data.simple.mobile_definitions.get_workflow_jobs, cimodel.data.simple.nightly_ios.get_workflow_jobs, cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs, + cimodel.data.simple.macos_definitions.get_new_workflow_jobs, + cimodel.data.simple.upload_test_stats_definition.get_workflow_job, + cimodel.data.simple.ios_definitions.get_workflow_jobs, ] build_jobs = [f() for f in build_workflows_functions] build_jobs.extend( diff --git a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml index 180ea014db6d3..bb6155fb7ab50 100644 --- a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml +++ b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml @@ -95,6 +95,196 @@ paths: - miniconda3 + mac_build: + parameters: + build-environment: + type: string + description: Top-level label for what's being built/tested. + xcode-version: + type: string + default: "13.3.1" + description: What xcode version to build with. + build-generates-artifacts: + type: boolean + default: true + description: if the build generates build artifacts + python-version: + type: string + default: "3.8" + macos: + xcode: << parameters.xcode-version >> + resource_class: medium + environment: + BUILD_ENVIRONMENT: << parameters.build-environment >> + AWS_REGION: us-east-1 + steps: + + - checkout + - run_brew_for_macos_build + + - run: + name: Install sccache + command: | + sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache + sudo chmod +x /usr/local/bin/sccache + echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}" + echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}" + + set +x + echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}" + echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}" + set -x + + - run: + name: Get workflow job id + command: | + echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}" + + - run: + name: Build + command: | + set -x + + git submodule sync + git submodule update --init --recursive --depth 1 --jobs 0 + + export PATH="/usr/local/bin:$PATH" + export WORKSPACE_DIR="${HOME}/workspace" + mkdir -p "${WORKSPACE_DIR}" + MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh" + if [ << parameters.python-version >> == 3.9.12 ]; then + MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh" + fi + + # If a local installation of conda doesn't exist, we download and install conda + if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then + mkdir -p "${WORKSPACE_DIR}" + curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh + bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3 + fi + export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH" + # shellcheck disable=SC1091 + source "${WORKSPACE_DIR}"/miniconda3/bin/activate + + echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}" + .jenkins/pytorch/macos-build.sh + + - when: + condition: << parameters.build-generates-artifacts >> + steps: + - run: + name: Archive artifacts into zip + command: | + zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json + cp artifacts.zip /Users/distiller/workspace + + - persist_to_workspace: + root: /Users/distiller/workspace/ + paths: + - miniconda3 + - artifacts.zip + + - store_artifacts: + path: /Users/distiller/project/artifacts.zip + + mac_test: + parameters: + build-environment: + type: string + shard-number: + type: string + num-test-shards: + type: string + xcode-version: + type: string + test-config: + type: string + default: 'default' + + macos: + xcode: << parameters.xcode-version >> + environment: + GIT_DEFAULT_BRANCH: 'master' + BUILD_ENVIRONMENT: << parameters.build-environment >> + TEST_CONFIG: << parameters.test-config >> + SHARD_NUMBER: << parameters.shard-number >> + NUM_TEST_SHARDS: << parameters.num-test-shards >> + PYTORCH_RETRY_TEST_CASES: 1 + PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1 + steps: + - checkout + - attach_workspace: + at: ~/workspace + - run_brew_for_macos_build + - run: + name: Test + no_output_timeout: "1h" + command: | + set -x + + git submodule sync --recursive + git submodule update --init --recursive + + mv ~/workspace/artifacts.zip . + unzip artifacts.zip + + export IN_CI=1 + + COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}") + + export PATH="/usr/local/bin:$PATH" + export WORKSPACE_DIR="${HOME}/workspace" + mkdir -p "${WORKSPACE_DIR}" + + export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH" + source "${WORKSPACE_DIR}"/miniconda3/bin/activate + + # sanitize the input commit message and PR body here: + + # trim all new lines from commit messages to avoid issues with batch environment + # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028 + COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}" + + # then trim all special characters like single and double quotes to avoid unescaped inputs to + # wreak havoc internally + export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}" + + python3 -mpip install dist/*.whl + .jenkins/pytorch/macos-test.sh + - run: + name: Copy files for uploading test stats + command: | + # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace + mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports + cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports + - store_test_results: + path: test/test-reports + - persist_to_workspace: + root: /Users/distiller/project/ + paths: + - test-reports + + upload_test_stats: + machine: # executor type + image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4 + steps: + - checkout + - attach_workspace: + at: ~/workspace + - run: + name: upload + command: | + set -ex + if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then + echo "No credentials found, cannot upload test stats (are you on a fork?)" + exit 0 + fi + cp -r ~/workspace/test-reports/* ~/project + pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0 + export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} + export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} + # i dont know how to get the run attempt number for reruns so default to 1 + python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci pytorch_macos_10_13_py3_test: environment: BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test @@ -436,12 +626,13 @@ cd ${PROJ_ROOT}/ios/TestApp/benchmark mkdir -p ../models if [ ${USE_COREML_DELEGATE} == 1 ]; then - pip install coremltools==5.0b5 - pip install six + pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0 python coreml_backend.py else - python trace_model.py + cd "${PROJ_ROOT}" + python test/mobile/model_test/gen_test_model.py ios-test fi + cd "${PROJ_ROOT}/ios/TestApp/benchmark" if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then echo "Setting up the TestApp for LiteInterpreter" ruby setup.rb --lite 1 @@ -449,10 +640,10 @@ echo "Setting up the TestApp for Full JIT" ruby setup.rb fi - cd ${PROJ_ROOT}/ios/TestApp - instruments -s -devices - if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then - if [ ${USE_COREML_DELEGATE} == 1 ]; then + cd "${PROJ_ROOT}/ios/TestApp" + # instruments -s -devices + if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then + if [ "${USE_COREML_DELEGATE}" == 1 ]; then fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML else fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter diff --git a/.github/ISSUE_TEMPLATE/ci-sev.md b/.github/ISSUE_TEMPLATE/ci-sev.md index 8178c68d978b7..2b6bbfc982c95 100644 --- a/.github/ISSUE_TEMPLATE/ci-sev.md +++ b/.github/ISSUE_TEMPLATE/ci-sev.md @@ -5,6 +5,8 @@ about: Tracking incidents for PyTorch's CI infra. > NOTE: Remember to label this issue with "`ci: sev`" +**MERGE BLOCKING** + ## Current Status *Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index fc203f1e0d6ce..7d428014cd79c 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,8 +1 @@ -### Description - - -### Issue - - -### Testing - +Fixes #ISSUE_NUMBER diff --git a/.github/actions/get-workflow-job-id/action.yml b/.github/actions/get-workflow-job-id/action.yml index 34863677407af..4dc6ba90c3961 100644 --- a/.github/actions/get-workflow-job-id/action.yml +++ b/.github/actions/get-workflow-job-id/action.yml @@ -15,7 +15,7 @@ outputs: runs: using: composite steps: - - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + - uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 id: get-job-id env: GITHUB_TOKEN: ${{ inputs.github-token }} diff --git a/.github/actions/setup-win/action.yml b/.github/actions/setup-win/action.yml index 12f287b230898..c5f1cac550f68 100644 --- a/.github/actions/setup-win/action.yml +++ b/.github/actions/setup-win/action.yml @@ -58,3 +58,8 @@ runs: uses: actions/setup-python@v2 with: python-version: "3.x" + cache: pip + cache-dependency-path: | + **/requirements.txt + **/.circleci/docker/requirements-ci.txt + **/.github/requirements-gha-cache.txt diff --git a/.github/ci_commit_pins/torchdynamo.txt b/.github/ci_commit_pins/torchdynamo.txt index 4bca66289a606..a0bcc4dd6c4bc 100644 --- a/.github/ci_commit_pins/torchdynamo.txt +++ b/.github/ci_commit_pins/torchdynamo.txt @@ -1 +1 @@ -a43631c54014b2e68a09b39658cbf515875394f6 +058b3581bde241ed72b4092d92e561dd9d82fff0 diff --git a/.github/ci_commit_pins/vision.txt b/.github/ci_commit_pins/vision.txt index 7406c775afed2..34c3bd4731e26 100644 --- a/.github/ci_commit_pins/vision.txt +++ b/.github/ci_commit_pins/vision.txt @@ -1 +1 @@ -1a1d509c8e6578584e7e9e4bd442654bf39149c8 +9c3e2bf46bc49997679785d76b7d0a9fea0223c7 diff --git a/.github/ci_commit_pins/xla.txt b/.github/ci_commit_pins/xla.txt index 6f0f5eab8182e..170afa2afb3c5 100644 --- a/.github/ci_commit_pins/xla.txt +++ b/.github/ci_commit_pins/xla.txt @@ -1 +1 @@ -73c64a55fb096f1e132029d3decbb6f4e532cc7b +9b2f7929c2dae841888a836449c25b04c8cf4045 diff --git a/.github/generated-ciflow-ruleset.json b/.github/generated-ciflow-ruleset.json deleted file mode 100644 index 7605e17918849..0000000000000 --- a/.github/generated-ciflow-ruleset.json +++ /dev/null @@ -1,5 +0,0 @@ -{ - "__comment": "@generated DO NOT EDIT MANUALLY, Generation script: .github/scripts/generate_ci_workflows.py", - "label_rules": {}, - "version": "v1" -} diff --git a/.github/merge_rules.json b/.github/merge_rules.json deleted file mode 100644 index 704e1a5d96509..0000000000000 --- a/.github/merge_rules.json +++ /dev/null @@ -1,230 +0,0 @@ -[ - { - "name": "ONNX exporter", - "patterns": [ - ".jenkins/caffe2/*", - "aten/src/ATen/core/interned_strings.h", - "docs/source/onnx.rst", - "docs/source/scripts/onnx/**", - "scripts/onnx/**", - "test/jit/test_export_modes.py", - "test/onnx/**", - "tools/onnx/**", - "torch/_C/__init__.pyi.in", - "torch/csrc/jit/passes/onnx.*", - "torch/csrc/jit/passes/onnx/**", - "torch/csrc/jit/serialization/export.*", - "torch/csrc/jit/serialization/onnx.*", - "torch/csrc/onnx/**", - "torch/onnx/**" - ], - "approved_by": ["BowenBao", "garymm"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "NVFuser", - "patterns": [ - "test/test_jit_cuda_fuser.py", - "torch/csrc/jit/codegen/fuser/cuda/**", - "torch/csrc/jit/codegen/cuda/**", - "benchmarks/cpp/nvfuser/**" - ], - "approved_by": ["csarofeen", "ngimel", "jjsjann123"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "OSS CI", - "patterns": [".github/**", ".circleci/**", ".jenkins/**", "scripts/**", "tools/**"], - "approved_by": ["ezyang", "pytorch/pytorch-dev-infra"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "CI Pinned Hashes", - "patterns": [ - ".github/ci_commit_pins/vision.txt", - ".github/ci_commit_pins/torchdynamo.txt" - ], - "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "XLA hash pin update", - "patterns": [".github/ci_commit_pins/xla.txt"], - "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull / linux-bionic-py3_7-clang8-xla / build", - "pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)" - ] - }, - { - "name": "Documentation", - "patterns": ["docs/**", "torch/*docs.py"], - "approved_by": ["mruberry", "ngimel", "janeyx99", "svekars"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "Mobile", - "patterns": ["ios/**", "android/**", "test/mobile/**"], - "approved_by": ["linbinyu", "kit1980", "IvanKobzarev", "dreiss"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "Linear Algebra", - "patterns": [ - "aten/src/ATen/native/cuda/linalg/**", - "aten/src/ATen/LinalgBackend.h", - "aten/src/ATen/native/**LinearAlgebra*", - "docs/source/linalg.rst", - "torch/linalg/**", - "torch/_linalg_utils.py", - "torch/**python_linalg_functions.*", - "torch/**linalg.h", - "tools/autograd/templates/python_linalg_functions.cpp", - "test/test_linalg.py" - ], - "approved_by": ["nikitaved", "mruberry", "pearu", "Lezcano", "IvanYashchuk"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "FFT", - "patterns": [ - "aten/src/ATen/native/cuda/*FFT*.h", - "aten/src/ATen/native/SpectralOps.cpp", - "aten/src/ATen/native/mkl/SpectralOps.cpp", - "aten/src/ATen/native/cuda/SpectralOps.*", - "docs/source/fft.rst", - "torch/fft/**", - "torch/csrc/api/include/torch/fft.h", - "torch/**python_fft_functions.*", - "tools/autograd/templates/python_fft_functions.cpp", - "test/cpp/api/fft.cpp" - ], - "approved_by": ["mruberry", "peterbell10"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "Sparse", - "patterns": [ - "benchmarks/sparse", - "c10/util/sparse_bitset.h", - "docs/source/sparse.rst", - "torch/**sparse/**", - "torch/**sparse*", - "torch/optim/sparse*", - "torch/ao/nn/sparse/**", - "torch/utils/benchmark/**sparse*", - "aten/src/ATen/native/ao_sparse/**", - "aten/src/ATen/native/sparse/**", - "aten/src/ATen/**Sparse*", - "aten/src/ATen/*Sparse*", - "torch/_masked/**", - "test/*_masked*", - "test/**sparse*" - ], - "approved_by": ["nikitaved", "cpuhrsch", "pearu", "IvanYashchuk"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "MPS", - "patterns": [ - "test/test_mps.py", - "aten/src/ATen/native/native_functions.yaml", - "aten/src/ATen/mps/**", - "aten/src/ATen/native/mps/**" - ], - "approved_by": ["kulinseth", "razarmehr"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "Distributions", - "patterns": [ - "torch/distributions/**", - "test/distributions/**" - ], - "approved_by": ["fritzo", "neerajprad", "alicanb", "vishwakftw"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "Distributed", - "patterns": [ - "docs/source/pipeline.rst", - "docs/source/distributed*", - "docs/source/rpc.rst", - "docs/source/rpc/**", - "docs/source/_static/img/rpc*", - "docs/source/_static/img/*distributed*", - "docs/source/elastic/**", - "benchmarks/distributed/**", - "torch/distributed/**", - "torch/nn/parallel/distributed*", - "torch/_C/_distributed*", - "torch/csrc/distributed/**", - "torch/testing/_internal/distributed/**", - "test/distributed/**", - "test/cpp/dist_autograd/**", - "test/cpp/rpc/**" - ], - "approved_by": ["mrshenli", "pritamdamania87", "d4l3k", "kiukchung", "pietern"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - }, - { - "name": "superuser", - "patterns": ["*"], - "approved_by": ["pytorch/metamates"], - "mandatory_checks_name": [ - "Facebook CLA Check", - "Lint", - "pull" - ] - } -] diff --git a/.github/merge_rules.yaml b/.github/merge_rules.yaml new file mode 100644 index 0000000000000..5557926cc2116 --- /dev/null +++ b/.github/merge_rules.yaml @@ -0,0 +1,342 @@ +- name: Core Maintainers + patterns: + - '*' + approved_by: + - soumith + - gchanan + - ezyang + - dzhulgakov + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: ONNX exporter + patterns: + - .jenkins/caffe2/* + - aten/src/ATen/core/interned_strings.h + - docs/source/onnx.rst + - docs/source/scripts/onnx/** + - scripts/onnx/** + - test/jit/test_export_modes.py + - test/onnx/** + - tools/onnx/** + - torch/_C/__init__.pyi.in + - torch/csrc/jit/passes/onnx.* + - torch/csrc/jit/passes/onnx/** + - torch/csrc/jit/serialization/export.* + - torch/csrc/jit/serialization/onnx.* + - torch/csrc/onnx/** + - torch/onnx/** + approved_by: + - BowenBao + - abock + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: NVFuser + patterns: + - test/test_jit_cuda_fuser.py + - torch/csrc/jit/codegen/fuser/cuda/** + - torch/csrc/jit/codegen/cuda/** + - benchmarks/cpp/nvfuser/** + approved_by: + - csarofeen + - ngimel + - jjsjann123 + - ptrblck + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: OSS CI + patterns: + - .github/** + - .circleci/** + - .jenkins/** + - scripts/** + - tools/** + approved_by: + - alband + - dagitses + - pytorch/pytorch-dev-infra + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: OSS CI / pytorchbot + patterns: + - .github/ci_commit_pins/vision.txt + - .github/ci_commit_pins/torchdynamo.txt + approved_by: + - pytorchbot + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: OSS CI / pytorchbot / XLA + patterns: + - .github/ci_commit_pins/xla.txt + approved_by: + - pytorchbot + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull / linux-bionic-py3_7-clang8-xla / build + - pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) + +- name: Documentation + patterns: + - docs/** + - torch/*docs.py + approved_by: + - svekars + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Mobile + patterns: + - ios/** + - android/** + - test/mobile/** + approved_by: + - linbinyu + - IvanKobzarev + - dreiss + - raziel + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Linear Algebra + patterns: + - aten/src/ATen/native/cuda/linalg/** + - aten/src/ATen/LinalgBackend.h + - aten/src/ATen/native/**LinearAlgebra* + - docs/source/linalg.rst + - torch/linalg/** + - torch/_linalg_utils.py + - torch/**python_linalg_functions.* + - torch/**linalg.h + - tools/autograd/templates/python_linalg_functions.cpp + - test/test_linalg.py + approved_by: + - mruberry + - Lezcano + - IvanYashchuk + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: FFT + patterns: + - aten/src/ATen/native/cuda/*FFT*.h + - aten/src/ATen/native/SpectralOps.cpp + - aten/src/ATen/native/mkl/SpectralOps.cpp + - aten/src/ATen/native/cuda/SpectralOps.* + - docs/source/fft.rst + - torch/fft/** + - torch/csrc/api/include/torch/fft.h + - torch/**python_fft_functions.* + - tools/autograd/templates/python_fft_functions.cpp + - test/cpp/api/fft.cpp + approved_by: + - mruberry + - peterbell10 + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Sparse + patterns: + - benchmarks/sparse + - c10/util/sparse_bitset.h + - docs/source/sparse.rst + - torch/**sparse/** + - torch/**sparse* + - torch/optim/sparse* + - torch/ao/nn/sparse/** + - torch/utils/benchmark/**sparse* + - aten/src/ATen/native/ao_sparse/** + - aten/src/ATen/native/sparse/** + - aten/src/ATen/**Sparse* + - aten/src/ATen/*Sparse* + - torch/_masked/** + - test/*_masked* + - test/**sparse* + approved_by: + - nikitaved + - cpuhrsch + - pearu + - IvanYashchuk + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: MPS + patterns: + - test/test_mps.py + - aten/src/ATen/native/native_functions.yaml + - aten/src/ATen/mps/** + - aten/src/ATen/native/mps/** + approved_by: + - kulinseth + - alband + - malfet + - razarmehr + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull +- name: Distributions + patterns: + - torch/distributions/** + - test/distributions/** + approved_by: + - fritzo + - neerajprad + - alicanb + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Distributed + patterns: + - docs/source/pipeline.rst + - docs/source/distributed* + - docs/source/rpc.rst + - docs/source/rpc/** + - docs/source/_static/img/rpc* + - docs/source/_static/img/*distributed* + - docs/source/elastic/** + - benchmarks/distributed/** + - torch/distributed/** + - torch/nn/parallel/distributed* + - torch/_C/_distributed* + - torch/csrc/distributed/** + - torch/testing/_internal/distributed/** + - test/distributed/** + - test/cpp/dist_autograd/** + - test/cpp/rpc/** + approved_by: + - mrshenli + - pritamdamania87 + - zhaojuanmao + - rohan-varma + - wanchaol + - fduwjj + - H-Huang + - d4l3k + - aazzolini + - kwen2501 + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: IDEEP + patterns: + - third_party/ideep + - caffe2/ideep/** + - caffe2/python/ideep/** + approved_by: + - XiaobingSuper + - yanbing-j + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: oneDNN graph + patterns: + - torch/csrc/jit/codegen/onednn/** + - test/test_jit_llga_fuser.py + approved_by: + - sanchitintel + - chunyuan-w + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: CPU ATen backend + patterns: + - aten/src/ATen/cpu/** + - aten/src/ATen/native/cpu/** + - aten/src/ATen/native/quantized/cpu/** + - aten/src/ATen/native/Convolution*.cpp + - aten/src/ATen/native/mkldnn/** + approved_by: + - mingfeima + - XiaobingSuper + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: CPU frontend + patterns: + - torch/cpu/** + - torch/utils/mkldnn.py + - test/test_mkldnn.py + approved_by: + - leslie-fang-intel + - CaoE + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Autocast + patterns: + - torch/amp/** + - aten/src/ATen/autocast_mode.* + - torch/csrc/jit/passes/autocast.cpp + - test/test_autocast.py + approved_by: + - leslie-fang-intel + - CaoE + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: Lazy Tensor + patterns: + - torch/csrc/lazy/** + - test/cpp/lazy/** + - test/lazy/** + - codegen/api/lazy.py + - codegen/dest/lazy_ir.py + - codegen/dest/lazy_ts_lowering.py + - codegen/gen_lazy_tensor.py + - aten/src/ATen/native/ts_native_functions.yaml + approved_by: + - alanwaketan + - JackCaoG + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull + +- name: superuser + patterns: + - '*' + approved_by: + - pytorch/metamates + mandatory_checks_name: + - Facebook CLA Check + - Lint + - pull diff --git a/.github/requirements-gha-cache.txt b/.github/requirements-gha-cache.txt new file mode 100644 index 0000000000000..4e757f9307fff --- /dev/null +++ b/.github/requirements-gha-cache.txt @@ -0,0 +1,16 @@ +# This file is to cache other dependencies not specified elsewhere in: +# requirement.txt +# requirements-flake8.txt +# docs/requirements.txt +# docs/cpp/requirements.txt +# functorch/docs/requirements.txt +# .circleci/docker/requirements-ci.txt +cffi==1.15.0 +dataclasses==0.6 +jinja2==3.0.1 +lintrunner=0.9.2 +ninja==1.10.0.post1 +pynvml==11.4.1 +requests==2.26 +rich==10.9.0 +rockset==0.8.10 diff --git a/.github/scale-config.yml b/.github/scale-config.yml index 931ca0ef5f1e2..1cf99b326ba81 100644 --- a/.github/scale-config.yml +++ b/.github/scale-config.yml @@ -65,5 +65,5 @@ runner_types: windows.8xlarge.nvidia.gpu: instance_type: p3.2xlarge os: windows - max_available: 50 + max_available: 100 disk_size: 256 diff --git a/.github/scripts/comment_on_pr.py b/.github/scripts/comment_on_pr.py new file mode 100644 index 0000000000000..06b2eefe09884 --- /dev/null +++ b/.github/scripts/comment_on_pr.py @@ -0,0 +1,34 @@ +from typing import Any +from trymerge import gh_post_pr_comment +from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo +from trymerge_explainer import BOT_COMMANDS_WIKI +import os + + +def parse_args() -> Any: + from argparse import ArgumentParser + + parser = ArgumentParser("Comment on a PR") + parser.add_argument("pr_num", type=int) + parser.add_argument("action", type=str) + return parser.parse_args() + + +def main() -> None: + args = parse_args() + repo = GitRepo(get_git_repo_dir(), get_git_remote_name(), debug=True) + org, project = repo.gh_owner_and_name() + run_url = os.environ.get("GH_RUN_URL") + + job_link = f"[job]({run_url})" if run_url is not None else "job" + msg = ( + f"The {args.action} {job_link} was canceled. If you believe this is a mistake," + + f"then you can re trigger it through [pytorch-bot]({BOT_COMMANDS_WIKI})." + ) + + gh_post_pr_comment(org, project, args.pr_num, msg) + print(org, project, args.pr_num, msg) + + +if __name__ == "__main__": + main() diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py index 4549a16f7a808..b1e3b46bda344 100644 --- a/.github/scripts/generate_binary_build_matrix.py +++ b/.github/scripts/generate_binary_build_matrix.py @@ -16,7 +16,7 @@ CUDA_ARCHES = ["10.2", "11.3", "11.6", "11.7"] -ROCM_ARCHES = ["5.0", "5.1.1"] +ROCM_ARCHES = ["5.1.1", "5.2"] def arch_type(arch_version: str) -> str: diff --git a/.github/scripts/get_workflow_job_id.py b/.github/scripts/get_workflow_job_id.py index 72aed91d55ca9..e3005a735250f 100644 --- a/.github/scripts/get_workflow_job_id.py +++ b/.github/scripts/get_workflow_job_id.py @@ -31,7 +31,9 @@ args = parser.parse_args() -PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch" +# From https://docs.github.com/en/actions/learn-github-actions/environment-variables +PYTORCH_REPO = os.environ.get("GITHUB_REPOSITORY", "pytorch/pytorch") +PYTORCH_GITHUB_API = f"https://api.github.com/repos/{PYTORCH_REPO}" GITHUB_TOKEN = os.environ["GITHUB_TOKEN"] REQUEST_HEADERS = { "Accept": "application/vnd.github.v3+json", @@ -39,7 +41,7 @@ } response = requests.get( - f"{PYTORCH_REPO}/actions/runs/{args.workflow_run_id}/jobs?per_page=100", + f"{PYTORCH_GITHUB_API}/actions/runs/{args.workflow_run_id}/jobs?per_page=100", headers=REQUEST_HEADERS, ) diff --git a/.github/scripts/install_nvidia_utils_linux.sh b/.github/scripts/install_nvidia_utils_linux.sh index b854320c9eaa4..b5274fb5805fb 100755 --- a/.github/scripts/install_nvidia_utils_linux.sh +++ b/.github/scripts/install_nvidia_utils_linux.sh @@ -2,8 +2,9 @@ set -eou pipefail + DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) \ -DRIVER_FN="NVIDIA-Linux-x86_64-510.60.02.run" +DRIVER_FN="NVIDIA-Linux-x86_64-515.57.run" YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" install_nvidia_docker2_amzn2() { @@ -24,6 +25,7 @@ install_nvidia_driver_amzn2() { # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" + sudo modprobe backlight sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" sudo /bin/bash /tmp/nvidia_driver -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false) sudo rm -fv /tmp/nvidia_driver diff --git a/.github/scripts/lint_test_ownership.py b/.github/scripts/lint_test_ownership.py deleted file mode 100755 index 270019c0f5634..0000000000000 --- a/.github/scripts/lint_test_ownership.py +++ /dev/null @@ -1,88 +0,0 @@ -#!/usr/bin/env python3 -''' -Test ownership was introduced in https://github.com/pytorch/pytorch/issues/66232. - -This lint verifies that every Python test file (file that matches test_*.py or *_test.py in the test folder) -has valid ownership information in a comment header. Valid means: - - The format of the header follows the pattern "# Owner(s): ["list", "of owner", "labels"] - - Each owner label actually exists in PyTorch - - Each owner label starts with "module: " or "oncall: " or is in ACCEPTABLE_OWNER_LABELS - -This file is expected to run in the root directory of pytorch/pytorch. -''' -import boto3 # type: ignore[import] -import botocore # type: ignore[import] -import fnmatch -import json -import sys -from pathlib import Path -from typing import List, Any - - -# Team/owner labels usually start with "module: " or "oncall: ", but the following are acceptable exceptions -ACCEPTABLE_OWNER_LABELS = ["NNC", "high priority"] -GLOB_EXCEPTIONS = [ - "**/test/run_test.py" -] - -PYTORCH_ROOT = Path(__file__).resolve().parent.parent.parent -TEST_DIR = PYTORCH_ROOT / "test" -CURRENT_FILE_NAME = Path(__file__).resolve().relative_to(PYTORCH_ROOT) - -S3_RESOURCE_READ_ONLY = boto3.resource("s3", config=botocore.config.Config(signature_version=botocore.UNSIGNED)) - - -def get_all_test_files() -> List[Path]: - test_files = list(TEST_DIR.glob("**/test_*.py")) - test_files.extend(list(TEST_DIR.glob("**/*_test.py"))) - return [f for f in test_files if not any([fnmatch.fnmatch(str(f), g) for g in GLOB_EXCEPTIONS])] - - -def get_pytorch_labels() -> Any: - bucket = S3_RESOURCE_READ_ONLY.Bucket("ossci-metrics") - summaries = bucket.objects.filter(Prefix="pytorch_labels.json") - for summary in summaries: - labels = summary.get()["Body"].read() - return json.loads(labels) - - -# Returns a string denoting the error invalidating the label OR an empty string if nothing is wrong -def validate_label(label: str, pytorch_labels: List[str]) -> str: - if label not in pytorch_labels: - return f"{label} is not a PyTorch label (please choose from https://github.com/pytorch/pytorch/labels)" - if label.startswith("module:") or label.startswith("oncall:") or label in ACCEPTABLE_OWNER_LABELS: - return "" - return f"{label} is not an acceptable owner (please update to another label or edit ACCEPTABLE_OWNERS_LABELS " \ - "in {CURRENT_FILE_NAME}" - - -# Returns a string denoting the error invalidating the file OR an empty string if nothing is wrong -def validate_file(filename: Path, pytorch_labels: List[str]) -> str: - prefix = "# Owner(s): " - relative_name = Path(filename).relative_to(PYTORCH_ROOT) - with open(filename) as f: - for line in f.readlines(): - if line.startswith(prefix): - labels = json.loads(line[len(prefix):]) - labels_msgs = [validate_label(label, pytorch_labels) for label in labels] - file_msg = ", ".join([x for x in labels_msgs if x != ""]) - return f"{relative_name}: {file_msg}" if file_msg != "" else "" - return f"{relative_name}: missing a comment header with ownership information." - - -def main() -> None: - test_file_paths = get_all_test_files() - pytorch_labels = get_pytorch_labels() - - file_msgs = [validate_file(f, pytorch_labels) for f in test_file_paths] - err_msg = "\n".join([x for x in file_msgs if x != ""]) - if err_msg != "": - err_msg = err_msg + "\n\nIf you see files with missing ownership information above, " \ - "please add the following line\n\n# Owner(s): [\"\"]\n\nto the top of each test file. " \ - "The owner should be an existing pytorch/pytorch label." - print(err_msg) - sys.exit(1) - - -if __name__ == '__main__': - main() diff --git a/.github/scripts/test_trymerge.py b/.github/scripts/test_trymerge.py index af3faf8cd0948..572863098da72 100755 --- a/.github/scripts/test_trymerge.py +++ b/.github/scripts/test_trymerge.py @@ -18,9 +18,12 @@ gh_get_team_members, read_merge_rules, validate_revert, + filter_pending_checks, + filter_failed_checks, GitHubPR, MergeRule, MandatoryChecksMissingError, + WorkflowCheckState, main as trymerge_main) from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo from typing import Any, List, Optional @@ -139,7 +142,7 @@ def commit_message(self, ref: str) -> str: class TestGitHubPR(TestCase): def test_merge_rules_valid(self) -> None: - "Test that merge_rules.json can be parsed" + "Test that merge_rules.yaml can be parsed" repo = DummyGitRepo() self.assertGreater(len(read_merge_rules(repo, "pytorch", "pytorch")), 1) @@ -337,5 +340,21 @@ def test_revert_rules(self, mock_gql: Any) -> None: repo = DummyGitRepo() self.assertIsNotNone(validate_revert(repo, pr, comment_id=1189459845)) + def test_checks_filter(self) -> None: + checks = [ + WorkflowCheckState(name="check0", status="SUCCESS", url="url0"), + WorkflowCheckState(name="check1", status="FAILURE", url="url1"), + WorkflowCheckState(name="check2", status="STARTUP_FAILURE", url="url2"), + WorkflowCheckState(name="check3", status=None, url="url3"), + ] + + checks_dict = {check.name : check for check in checks} + + pending_checks = filter_pending_checks(checks_dict) + failing_checks = filter_failed_checks(checks_dict) + + self.assertListEqual(failing_checks, [checks[1], checks[2]]) + self.assertListEqual(pending_checks, [checks[3]]) + if __name__ == "__main__": main() diff --git a/.github/scripts/trymerge.py b/.github/scripts/trymerge.py index 9e23869cb3804..e64d9c9ea16df 100755 --- a/.github/scripts/trymerge.py +++ b/.github/scripts/trymerge.py @@ -6,15 +6,43 @@ import re import time import urllib.parse -from datetime import datetime from dataclasses import dataclass -from urllib.request import urlopen, Request -from urllib.error import HTTPError -from typing import Iterable, Pattern, cast, Any, Callable, Dict, List, Optional, Tuple, Union -from gitutils import get_git_remote_name, get_git_repo_dir, patterns_to_regex, GitRepo +from datetime import datetime from functools import lru_cache +import yaml +from typing import ( + Any, + Callable, + Dict, + Iterable, + List, + Optional, + Pattern, + Tuple, + Union, + cast, + NamedTuple +) +from urllib.error import HTTPError +from urllib.request import Request, urlopen from warnings import warn +from gitutils import ( + GitRepo, + get_git_remote_name, + get_git_repo_dir, + patterns_to_regex, +) +from trymerge_explainer import ( + TryMergeExplainer, + get_land_check_troubleshooting_message, + get_revert_message, +) + +class WorkflowCheckState(NamedTuple): + status: Optional[str] + url: str + name: str GH_PR_REVIEWS_FRAGMENT = """ fragment PRReviews on PullRequestReviewConnection { @@ -373,7 +401,6 @@ RE_DIFF_REV = re.compile(r'^Differential Revision:.+?(D[0-9]+)', re.MULTILINE) CIFLOW_LABEL = re.compile(r"^ciflow/.+") CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk") -BOT_COMMANDS_WIKI = 'https://github.com/pytorch/pytorch/wiki/Bot-commands' def _fetch_url(url: str, *, headers: Optional[Dict[str, str]] = None, @@ -465,12 +492,11 @@ def get_check_run_name_prefix(workflow_run: Any) -> str: else: return f'{workflow_run["workflow"]["name"]} / ' - def add_workflow_conclusions( checksuites: Any, get_next_checkruns_page: Callable[[List[Dict[str, Dict[str, Any]]], int, Any], Any], get_next_checksuites: Callable[[Any], Any] -) -> Dict[str, Tuple[str, str]]: +) -> Dict[str, WorkflowCheckState]: conclusions = {} def add_conclusions(edges: Any) -> None: @@ -484,14 +510,23 @@ def add_conclusions(edges: Any) -> None: # Do not override existing status with cancelled if workflow_conclusion == "CANCELLED" and workflow_name in conclusions: continue - conclusions[workflow_name] = (workflow_conclusion, node["url"]) + conclusions[workflow_name] = WorkflowCheckState( + name=workflow_name, + status=workflow_conclusion, + url=node["url"]) has_failing_check = False while checkruns is not None: for checkrun_node in checkruns["nodes"]: + if not isinstance(checkrun_node, dict): + warn(f"Expected dictionary, but got {type(checkrun_node)}") + continue if checkrun_node["conclusion"] == 'FAILURE': has_failing_check = True - conclusions[f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'] = ( - checkrun_node["conclusion"], checkrun_node["detailsUrl"] + checkrun_name = f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}' + conclusions[checkrun_name] = WorkflowCheckState( + name=checkrun_name, + status=checkrun_node["conclusion"], + url=checkrun_node["detailsUrl"] ) if bool(checkruns["pageInfo"]["hasNextPage"]): checkruns = get_next_checkruns_page(edges, edge_idx, checkruns) @@ -499,7 +534,11 @@ def add_conclusions(edges: Any) -> None: checkruns = None # Github doesn't set conclusion to failure if a job is still pending if workflow_run is not None and has_failing_check: - conclusions[workflow_run["workflow"]["name"]] = ("FAILURE", node["url"]) + workflow_name = workflow_run["workflow"]["name"] + conclusions[workflow_name] = WorkflowCheckState( + name=workflow_name, + status="FAILURE", + url=node["url"]) add_conclusions(checksuites["edges"]) while bool(checksuites["pageInfo"]["hasNextPage"]): @@ -550,7 +589,7 @@ def __init__(self, org: str, project: str, pr_num: int) -> None: self.info = gh_get_pr_info(org, project, pr_num) self.changed_files: Optional[List[str]] = None self.labels: Optional[List[str]] = None - self.conclusions: Optional[Dict[str, Tuple[str, str]]] = None + self.conclusions: Optional[Dict[str, WorkflowCheckState]] = None self.comments: Optional[List[GitHubComment]] = None self._authors: Optional[List[Tuple[str, str]]] = None self._reviews: Optional[List[Tuple[str, str]]] = None @@ -678,7 +717,7 @@ def get_labels(self) -> List[str]: self.labels = labels return self.labels - def get_checkrun_conclusions(self) -> Dict[str, Tuple[str, str]]: + def get_checkrun_conclusions(self) -> Dict[str, WorkflowCheckState]: """ Returns dict of checkrun -> [conclusion, url] """ if self.conclusions is not None: return self.conclusions @@ -803,9 +842,15 @@ def has_internal_changes(self) -> bool: checks = self.get_checkrun_conclusions() if checks is None or checkrun_name not in checks: return False - return checks[checkrun_name][0] != "SUCCESS" - - def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[int] = None) -> None: + return checks[checkrun_name].status != "SUCCESS" + + def merge_ghstack_into( + self, + repo: GitRepo, + force: bool, + comment_id: Optional[int] = None, + land_check_commit: Optional[str] = None + ) -> None: assert self.is_ghstack_pr() # For ghstack, cherry-pick commits based from origin orig_ref = f"{repo.remote}/{re.sub(r'/head$', '/orig', self.head_ref())}" @@ -826,7 +871,12 @@ def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[in continue commit_msg = pr.gen_commit_message(filter_ghstack=True) # Raises exception if matching rule is not found - find_matching_merge_rule(pr, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id)) + find_matching_merge_rule( + pr, + repo, + force=force, + skip_internal_checks=can_skip_internal_checks(self, comment_id), + land_check_commit=land_check_commit) repo.cherry_pick(rev) repo.amend_commit_message(commit_msg) @@ -846,10 +896,16 @@ def gen_commit_message(self, filter_ghstack: bool = False) -> str: def merge_into(self, repo: GitRepo, *, force: bool = False, dry_run: bool = False, - comment_id: Optional[int] = None) -> None: + comment_id: Optional[int] = None, + land_check_commit: Optional[str] = None) -> None: # Raises exception if matching rule is not found - find_matching_merge_rule(self, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id)) - self.merge_changes(repo, force, comment_id) + find_matching_merge_rule( + self, + repo, + force=force, + skip_internal_checks=can_skip_internal_checks(self, comment_id), + land_check_commit=land_check_commit) + self.merge_changes(repo, force, comment_id, land_check_commit=land_check_commit) repo.push(self.default_branch(), dry_run) if not dry_run: @@ -859,6 +915,7 @@ def merge_changes(self, repo: GitRepo, force: bool = False, comment_id: Optional[int] = None, + land_check_commit: Optional[str] = None, branch: Optional[str] = None) -> None: branch_to_merge_into = self.default_branch() if branch is None else branch if repo.current_branch() != branch_to_merge_into: @@ -870,13 +927,14 @@ def merge_changes(self, repo._run_git("merge", "--squash", pr_branch_name) repo._run_git("commit", f"--author=\"{self.get_author()}\"", "-m", msg) else: - self.merge_ghstack_into(repo, force, comment_id=comment_id) + self.merge_ghstack_into(repo, force, comment_id=comment_id, land_check_commit=land_check_commit) def create_land_time_check_branch(self, repo: GitRepo, branch: str, force: bool = False, comment_id: Optional[int] = None,) -> str: + orig_branch = repo.current_branch() self.merge_changes(repo, branch=branch, force=force, comment_id=comment_id) land_check_branch = f'landchecks/{self.pr_num}' try: @@ -886,11 +944,9 @@ def create_land_time_check_branch(self, repo._run_git('checkout', "-b", land_check_branch) repo._run_git('push', '-u', 'origin', land_check_branch, '--force') commit = repo.get_commit('HEAD').commit_hash - gh_post_pr_comment(self.org, self.project, self.pr_num, - '@pytorchbot successfully started a merge and created land time checks.' + - f' See merge status [here]({os.getenv("GH_RUN_URL")}) ' + - f'and [land check]({BOT_COMMANDS_WIKI}) ' - f'progress [here](https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit}).') + # Important, return to original branch + if repo.current_branch() != orig_branch: + repo.checkout(orig_branch) return commit @@ -912,7 +968,7 @@ class MergeRule: def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[MergeRule]: from pathlib import Path - repo_relative_rules_path = Path(".github") / "merge_rules.json" + repo_relative_rules_path = Path(".github") / "merge_rules.yaml" if repo is None: json_data = _fetch_url( f"https://api.github.com/repos/{org}/{project}/contents/{repo_relative_rules_path}", @@ -920,21 +976,22 @@ def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[Me reader=json.load, ) content = base64.b64decode(json_data["content"]) - return cast(List[MergeRule], json.loads(content, object_hook=lambda x: MergeRule(**x))) + return [MergeRule(**x) for x in yaml.safe_load(content)] else: rules_path = Path(repo.repo_dir) / repo_relative_rules_path if not rules_path.exists(): print(f"{rules_path} does not exist, returning empty rules") return [] with open(rules_path) as fp: - rc = json.load(fp, object_hook=lambda x: MergeRule(**x)) - return cast(List[MergeRule], rc) + rc = yaml.safe_load(fp) + return [MergeRule(**x) for x in rc] def find_matching_merge_rule(pr: GitHubPR, repo: Optional[GitRepo] = None, force: bool = False, - skip_internal_checks: bool = False + skip_internal_checks: bool = False, + land_check_commit: Optional[str] = None, ) -> MergeRule: """Returns merge rule matching to this pr or raises an exception""" changed_files = pr.get_changed_files() @@ -984,21 +1041,27 @@ def find_matching_merge_rule(pr: GitHubPR, f"{', '.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}") continue mandatory_checks = rule.mandatory_checks_name if rule.mandatory_checks_name is not None else [] - checks = pr.get_checkrun_conclusions() + checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit) required_checks = filter(lambda x: force is False or "CLA Check" in x, mandatory_checks) [pending_checks, failed_checks] = categorize_checks(checks, required_checks) if len(failed_checks) > 0: if reject_reason_score < 30000: reject_reason_score = 30000 - reject_reason = ("Refusing to merge as mandatory check(s) " + - checks_to_str(failed_checks) + f" failed for rule {rule_name}") + reject_reason = ( + f"[View failures on hud](https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}). " + + f"Refusing to merge as mandatory check(s) {checks_to_str(failed_checks)} failed for " + + f"rule {rule_name}." + ) continue elif len(pending_checks) > 0: if reject_reason_score < 20000: reject_reason_score = 20000 - reject_reason = f"Refusing to merge as mandatory check(s) {checks_to_str(pending_checks)}" - reject_reason += f" are pending/not yet run for rule {rule_name}" + reject_reason = ( + f"[View pending jobs on hud](https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}). " + + f"Refusing to merge as mandatory check(s) {checks_to_str(pending_checks)} are pending/not yet run for " + + f"rule {rule_name}." + ) continue if not skip_internal_checks and pr.has_internal_changes(): raise RuntimeError("This PR has internal changes and must be landed via Phabricator") @@ -1008,7 +1071,7 @@ def find_matching_merge_rule(pr: GitHubPR, raise RuntimeError(reject_reason) -def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, Tuple[str, str]]: +def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, WorkflowCheckState]: def get_commit_next_check_runs(edges: List[Dict[str, Dict[str, Any]]], edge_idx: int, checkruns: Any) -> Any: rc = gh_graphql(GH_GET_COMMIT_NEXT_CHECK_RUNS, @@ -1037,18 +1100,44 @@ def get_commit_next_checksuites(checksuites: Any) -> Any: def checks_to_str(checks: List[Tuple[str, Optional[str]]]) -> str: return ", ".join(f"[{c[0]}]({c[1]})" if c[1] is not None else c[0] for c in checks) -def pr_get_checks_with_lambda(pr: GitHubPR, status_check: Callable[[Optional[str]], bool]) -> List[Tuple[str, str]]: - checks = pr.get_checkrun_conclusions() - return [(name, status[1]) for name, status in checks.items() if status_check(status[0])] +def get_combined_checks_from_pr_and_land_validation( + pr: GitHubPR, + land_check_commit: Optional[str] +) -> Dict[str, WorkflowCheckState]: + """ + Combines checks from both the PR and land validation to get a holistic view + of all checks. + + This helps us cover the corner case where certain workflows may have been + requested on the PR but are not part of land validation (e.g. nightly + builds) or are implicitly run on PRs but not on land validation branches + (like CLA Checks). + + At the same time, we prioritize the signal workflows which do run on land + validation. + E.g. if a workflow fails on the PR but passes on land validation then we'd + use the successful result from the land validation. + """ -def pr_get_pending_checks(pr: GitHubPR) -> List[Tuple[str, str]]: - return pr_get_checks_with_lambda(pr, lambda x: x is None) + pr_checks = pr.get_checkrun_conclusions() + land_validation_checks = get_land_checkrun_conclusions(pr.org, pr.project, land_check_commit) if land_check_commit else {} + # Merge the two checks together. Land validation check results (if any) overwrite pr check results + merged_checks = {**pr_checks, **land_validation_checks} # explanation: https://stackoverflow.com/a/26853961/21539 + return merged_checks -def pr_get_failed_checks(pr: GitHubPR) -> List[Tuple[str, str]]: - return pr_get_checks_with_lambda(pr, lambda x: x in ["FAILURE", "STARTUP_FAILURE"]) +def filter_checks_with_lambda( + checks: Dict[str, WorkflowCheckState], + status_filter: Callable[[Optional[str]], bool] +) -> List[WorkflowCheckState]: + return [check for check in checks.values() if status_filter(check.status)] +def filter_pending_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]: + return filter_checks_with_lambda(checks, lambda x: x is None) + +def filter_failed_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]: + return filter_checks_with_lambda(checks, lambda x: x in ["FAILURE", "STARTUP_FAILURE"]) def validate_revert(repo: GitRepo, pr: GitHubPR, *, comment_id: Optional[int] = None) -> Tuple[str, str]: @@ -1133,7 +1222,7 @@ def check_for_sev(org: str, project: str, force: bool) -> None: def validate_land_time_checks(org: str, project: str, commit: str) -> None: checks = get_land_checkrun_conclusions(org, project, commit) - if(len(checks) == 0): + if len(checks) == 0: raise MandatoryChecksMissingError("Refusing to merge as land check(s) are not yet run") [pending_checks, failed_checks] = categorize_checks(checks, checks) @@ -1146,19 +1235,17 @@ def validate_land_time_checks(org: str, project: str, commit: str) -> None: def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool: return len(list(filter(pattern.match, labels))) > 0 -def categorize_checks(check_runs: Dict[str, Tuple[str, str]], +def categorize_checks(check_runs: Dict[str, WorkflowCheckState], required_checks: Iterable[str]) -> Tuple[List[Tuple[str, Optional[str]]], List[Tuple[str, Optional[str]]]]: pending_checks: List[Tuple[str, Optional[str]]] = [] failed_checks: List[Tuple[str, Optional[str]]] = [] for checkname in required_checks: if checkname not in check_runs: pending_checks.append((checkname, None)) - elif check_runs[checkname][0] is None: - pending_checks.append((checkname, check_runs[checkname][1])) - elif (check_runs[checkname][0].upper() != 'SUCCESS' - and check_runs[checkname][0].upper() != 'SKIPPED' - and check_runs[checkname][0].upper() != 'NEUTRAL'): - failed_checks.append((checkname, check_runs[checkname][1])) + elif check_runs[checkname].status is None: + pending_checks.append((checkname, check_runs[checkname].url)) + elif (str(check_runs[checkname].status).upper() not in ['SUCCESS', 'SKIPPED', 'NEUTRAL']): + failed_checks.append((checkname, check_runs[checkname].url)) return (pending_checks, failed_checks) def merge(pr_num: int, repo: GitRepo, @@ -1174,16 +1261,24 @@ def merge(pr_num: int, repo: GitRepo, org, project = repo.gh_owner_and_name() pr = GitHubPR(org, project, pr_num) initial_commit_sha = pr.last_commit()['oid'] + explainer = TryMergeExplainer(force, on_green, land_checks, pr.get_labels(), pr.pr_num, org, project) + on_green, land_checks = explainer.get_flags() + land_check_commit = None + check_for_sev(org, project, force) + if force or can_skip_internal_checks(pr, comment_id): # do not wait for any pending signals if PR is closed as part of co-development process + gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message()) return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id) - if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days: - raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.") if land_checks: land_check_commit = pr.create_land_time_check_branch(repo, 'viable/strict', force=force, comment_id=comment_id) + gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message(land_check_commit)) + if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days: + raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.") + start_time = time.time() last_exception = '' elapsed_time = 0.0 @@ -1197,26 +1292,27 @@ def merge(pr_num: int, repo: GitRepo, raise RuntimeError("New commits were pushed while merging. Please rerun the merge command.") try: find_matching_merge_rule(pr, repo) - pending = pr_get_pending_checks(pr) - failing = pr_get_failed_checks(pr) + checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit) + pending = filter_pending_checks(checks) + failing = filter_failed_checks(checks) # HACK until GitHub will be better about surfacing those - startup_failures = pr_get_checks_with_lambda(pr, lambda x: x == "STARTUP_FAILURE") + startup_failures = filter_checks_with_lambda(checks, lambda status: status == "STARTUP_FAILURE") if len(startup_failures) > 0: raise RuntimeError(f"{len(failing)} STARTUP failures reported, please check workflows syntax! " + - ' ,'.join(f"[{x[0]}]({x[1]})" for x in startup_failures[:5])) + ' ,'.join(f"[{x.name}]({x.url})" for x in startup_failures[:5])) # END of HACK if (not mandatory_only and on_green) and len(failing) > 0: raise RuntimeError(f"{len(failing)} additional jobs have failed, first few of them are: " + - ' ,'.join(f"[{x[0]}]({x[1]})" for x in failing[:5])) + ' ,'.join(f"[{x.name}]({x.url})" for x in failing[:5])) if (not mandatory_only and on_green) and len(pending) > 0: raise MandatoryChecksMissingError(f"Still waiting for {len(pending)} additional jobs to finish, " + - f"first few of them are: {' ,'.join(x[0] for x in pending[:5])}") - if land_checks: + f"first few of them are: {' ,'.join(x.name for x in pending[:5])}") + if land_checks and land_check_commit is not None: validate_land_time_checks(org, project, land_check_commit) - return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id) + return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id, land_check_commit=land_check_commit) except MandatoryChecksMissingError as ex: last_exception = str(ex) print(f"Merge of https://github.com/{org}/{project}/pull/{pr_num} failed due to: {ex}. Retrying in 5 min") @@ -1233,28 +1329,21 @@ def main() -> None: repo = GitRepo(get_git_repo_dir(), get_git_remote_name()) org, project = repo.gh_owner_and_name() pr = GitHubPR(org, project, args.pr_num) - land_checks = args.land_checks and not has_label(pr.get_labels(), CIFLOW_TRUNK_LABEL) def handle_exception(e: Exception, msg: str = "Merge failed") -> None: - msg += f" due to {e}" + msg += f"\nReason: {e}" run_url = os.getenv("GH_RUN_URL") if run_url is not None: - msg += f"\nRaised by {run_url}" - if land_checks: - msg += (" If you believe this is an error, you can use the old behavior with `@pytorchbot merge -g`" + - ' (optionally with the "ciflow/trunk" to get land signals)' + - ' or use `@pytorchbot merge -f "some reason here"`.' + - f" For more information, see the [bot wiki]({BOT_COMMANDS_WIKI}).") + msg += f"\nRaised by [workflow job]({run_url})" + if args.land_checks: + msg += get_land_check_troubleshooting_message() gh_post_pr_comment(org, project, args.pr_num, msg, dry_run=args.dry_run) import traceback traceback.print_exc() - if not land_checks: - msg = f"@pytorchbot successfully started a {'revert' if args.revert else 'merge'} job." - msg += f" Check the current status [here]({os.getenv('GH_RUN_URL')})" - gh_post_pr_comment(org, project, args.pr_num, msg, dry_run=args.dry_run) if args.revert: try: + gh_post_pr_comment(org, project, args.pr_num, get_revert_message(org, project, pr.pr_num), args.dry_run) try_revert(repo, pr, dry_run=args.dry_run, comment_id=args.comment_id, reason=args.reason) except Exception as e: handle_exception(e, f"Reverting PR {args.pr_num} failed") @@ -1269,14 +1358,13 @@ def handle_exception(e: Exception, msg: str = "Merge failed") -> None: return try: - on_green = args.on_green or has_label(pr.get_labels(), CIFLOW_LABEL) merge(args.pr_num, repo, dry_run=args.dry_run, force=args.force, comment_id=args.comment_id, - on_green=on_green, + on_green=args.on_green, mandatory_only=args.on_mandatory, - land_checks=land_checks) + land_checks=args.land_checks) except Exception as e: handle_exception(e) diff --git a/.github/scripts/trymerge_explainer.py b/.github/scripts/trymerge_explainer.py new file mode 100644 index 0000000000000..e59307f10854c --- /dev/null +++ b/.github/scripts/trymerge_explainer.py @@ -0,0 +1,146 @@ +import os +import re +from typing import List, Pattern, Tuple, Optional + + +BOT_COMMANDS_WIKI = "https://github.com/pytorch/pytorch/wiki/Bot-commands" + +CIFLOW_LABEL = re.compile(r"^ciflow/.+") +CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk") + +OFFICE_HOURS_LINK = "https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours" +CONTACT_US = f"Please reach out to the [PyTorch DevX Team]({OFFICE_HOURS_LINK}) with feedback or questions!" +ALTERNATIVES = ( + "If this is not the intended behavior, feel free to use some " + + f"of the other merge options in the [wiki]({BOT_COMMANDS_WIKI})." +) +LAND_CHECK_ROLLOUT = "https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/rolloutUtils.ts#L1-L34" + + +def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool: + return len(list(filter(pattern.match, labels))) > 0 + + +class TryMergeExplainer(object): + force: bool + on_green: bool + land_checks: bool + labels: List[str] + pr_num: int + org: str + project: str + + has_trunk_label: bool + has_ciflow_label: bool + + def __init__( + self, + force: bool, + on_green: bool, + land_checks: bool, + labels: List[str], + pr_num: int, + org: str, + project: str, + ): + self.force = force + self.on_green = on_green + self.land_checks = land_checks + self.labels = labels + self.pr_num = pr_num + self.org = org + self.project = project + self.get_flags() + + def get_flags(self) -> Tuple[bool, bool]: + self.has_trunk_label = has_label(self.labels, CIFLOW_TRUNK_LABEL) + self.has_ciflow_label = has_label(self.labels, CIFLOW_LABEL) + should_check_land_branch = self.land_checks and not self.has_trunk_label + should_check_green = self.on_green or self.has_ciflow_label + + return (should_check_green, should_check_land_branch) + + def _get_flag_msg(self) -> str: + if self.force: + return " the force (-f) flag." + elif self.on_green: + return " the green (-g) flag." + elif self.land_checks: + return ( + " the land checks (-l) flag." + + " If you did not specify this flag yourself, " + + f" you are likely enrolled in the [land checks rollout]({LAND_CHECK_ROLLOUT})." + ) + else: + return "out a flag." + + def _get_land_check_progress(self, commit: Optional[str]) -> str: + if commit is not None: + return ( + " and land check " + + f"progress [here](https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit})" + ) + else: + return "" + + def _get_flag_explanation_message(self) -> str: + if self.force: + return "This means your change will be merged **immediately**, bypassing any CI checks (ETA: 1-5 minutes)." + elif self.on_green: + return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)." + elif self.land_checks: + if self.has_trunk_label: + land_check_msg_suffix = "have passed since you have added the `ciflow/trunk` label to your PR (ETA 0-4 Hours)." + else: + land_check_msg_suffix = ( + "and the land checks have passed (**ETA 4 Hours**). " + ) + land_check_msg_suffix += "If you need to coordinate lands between different changes and cannot risk a land race, " + land_check_msg_suffix += "please add the `ciflow/trunk` label to your PR and wait for signal to complete, " + land_check_msg_suffix += "and then land your changes in proper order." + land_check_msg_suffix += ( + " Having `trunk`, `pull`, and `Lint` pre-run on a " + ) + land_check_msg_suffix += ( + "PR will bypass land checks and the ETA should be immediate." + ) + + return ( + "This means that your change will be merged once all checks on your PR " + + land_check_msg_suffix + ) + else: + return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)." + + def get_merge_message(self, commit: Optional[str] = None) -> str: + message_prefix = "@pytorchbot successfully started a merge job." + progress_links = f"Check the current status [here]({os.getenv('GH_RUN_URL')}){self._get_land_check_progress(commit)}." + flag_message = f"The merge job was triggered with{self._get_flag_msg()}" + explanation_message = self._get_flag_explanation_message() + + msg = message_prefix + " " + msg += progress_links + "\n" + msg += flag_message + " " + msg += explanation_message + " " + msg += ALTERNATIVES + "\n" + msg += CONTACT_US + return msg + + +def get_revert_message(org: str, project: str, pr_num: int) -> str: + msg = ( + "@pytorchbot successfully started a revert job." + + f" Check the current status [here]({os.getenv('GH_RUN_URL')}).\n" + ) + msg += CONTACT_US + return msg + + +def get_land_check_troubleshooting_message() -> str: + return ( + " If you believe this is an error, you can use the old behavior with `@pytorchbot merge -g`" + + " (optionally with the `ciflow/trunk` to get land checks)" + + ' or use `@pytorchbot merge -f "some reason here"`.' + + f" For more information, see the [bot wiki]({BOT_COMMANDS_WIKI}). \n" + + CONTACT_US + ) diff --git a/.github/scripts/update_commit_hashes.py b/.github/scripts/update_commit_hashes.py index 5dad5877ca4ae..4b638cf11c90c 100644 --- a/.github/scripts/update_commit_hashes.py +++ b/.github/scripts/update_commit_hashes.py @@ -136,6 +136,7 @@ def main() -> None: ) with open(f".github/ci_commit_pins/{args.repo_name}.txt", "r+") as f: old_hash = f.read().strip() + subprocess.run(f"git checkout {old_hash}".split(), cwd=args.repo_name) f.seek(0) f.truncate() f.write(f"{hash}\n") diff --git a/.github/templates/common.yml.j2 b/.github/templates/common.yml.j2 index f0f3e3a430f7d..b80b82f5d610d 100644 --- a/.github/templates/common.yml.j2 +++ b/.github/templates/common.yml.j2 @@ -1,5 +1,7 @@ {%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%} {%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%} +{%- set upload_artifact_action = "actions/upload-artifact@v3" -%} +{%- set download_artifact_action = "actions/download-artifact@v3" -%} {# squid_proxy is an private ELB that only available for GHA custom runners #} {%- set squid_proxy = "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -%} diff --git a/.github/templates/linux_binary_build_workflow.yml.j2 b/.github/templates/linux_binary_build_workflow.yml.j2 index 2879da9dad9c2..072da90789ef3 100644 --- a/.github/templates/linux_binary_build_workflow.yml.j2 +++ b/.github/templates/linux_binary_build_workflow.yml.j2 @@ -78,7 +78,7 @@ jobs: !{{ upload.binary_env(config) }} steps: !{{ common.setup_rocm_linux() }} - - uses: !{{ common.download_artifact_s3_action }} + - uses: !{{ common.download_artifact_action }} name: Download Build Artifacts with: name: !{{ config["build_name"] }} diff --git a/.github/templates/windows_binary_build_workflow.yml.j2 b/.github/templates/windows_binary_build_workflow.yml.j2 index 6b0cbbd187403..9f68df06b704f 100644 --- a/.github/templates/windows_binary_build_workflow.yml.j2 +++ b/.github/templates/windows_binary_build_workflow.yml.j2 @@ -72,7 +72,7 @@ jobs: shell: bash run: | "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh" - - uses: !{{ common.upload_artifact_s3_action }} + - uses: !{{ common.upload_artifact_action }} if: always() with: name: !{{ config["build_name"] }} @@ -93,7 +93,7 @@ jobs: steps: !{{ common.setup_ec2_windows() }} !{{ set_runner_specific_vars() }} - - uses: !{{ common.download_artifact_s3_action }} + - uses: !{{ common.download_artifact_action }} name: Download Build Artifacts with: name: !{{ config["build_name"] }} diff --git a/.github/workflows/_android-full-build-test.yml b/.github/workflows/_android-full-build-test.yml index efc66846db7a3..02c6a9d890212 100644 --- a/.github/workflows/_android-full-build-test.yml +++ b/.github/workflows/_android-full-build-test.yml @@ -19,23 +19,6 @@ on: If this is set, our linter will use this to make sure that every other job with the same `sync-tag` is identical. - secrets: - SONATYPE_NEXUS_USERNAME: - description: nexus user - required: true - SONATYPE_NEXUS_PASSWORD: - description: nexus pass - required: true - ANDROID_SIGN_KEY: - description: android key - required: true - ANDROID_SIGN_PASS: - description: android pass - required: true - SCRIBE_GRAPHQL_ACCESS_TOKEN: - description: token for writing to scribe/scuba - required: true - env: GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }} @@ -160,25 +143,6 @@ jobs: mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts" docker cp "${ID_X86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/" - - name: Publish android snapshot - if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }} - env: - SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }} - SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }} - ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }} - ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }} - ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }} - run: | - set -eux - (echo "./.circleci/scripts/publish_android_snapshot.sh" | docker exec \ - -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot" \ - -e SONATYPE_NEXUS_USERNAME \ - -e SONATYPE_NEXUS_PASSWORD \ - -e ANDROID_SIGN_KEY \ - -e ANDROID_SIGN_PASS \ - -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \ - -u jenkins -i "${ID_X86_32}" bash) 2>&1 - - name: Store PyTorch Android Build Artifacts on S3 uses: seemethere/upload-artifact-s3@v5 with: diff --git a/.github/workflows/_binary-build-linux.yml b/.github/workflows/_binary-build-linux.yml index b1b88a5b32f80..dc69e3a82258f 100644 --- a/.github/workflows/_binary-build-linux.yml +++ b/.github/workflows/_binary-build-linux.yml @@ -63,7 +63,7 @@ on: jobs: build: runs-on: linux.4xlarge - timeout-minutes: 240 + timeout-minutes: 270 env: PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }} BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }} @@ -209,10 +209,9 @@ jobs: # Ensure the working directory gets chowned back to the current user docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" . - - uses: seemethere/upload-artifact-s3@v5 + - uses: actions/upload-artifact@v3 with: name: ${{ inputs.build_name }} - retention-days: 14 if-no-files-found: error path: ${{ runner.temp }}/artifacts/* diff --git a/.github/workflows/_binary-test-linux.yml b/.github/workflows/_binary-test-linux.yml index 5c29288b82462..e8749c59d58c7 100644 --- a/.github/workflows/_binary-test-linux.yml +++ b/.github/workflows/_binary-test-linux.yml @@ -139,7 +139,7 @@ jobs: rm -rf "${GITHUB_WORKSPACE}" mkdir "${GITHUB_WORKSPACE}" - - uses: seemethere/download-artifact-s3@v4 + - uses: actions/download-artifact@v3 name: Download Build Artifacts with: name: ${{ inputs.build_name }} @@ -172,7 +172,7 @@ jobs: working-directory: builder - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' }} with: timeout_minutes: 10 diff --git a/.github/workflows/_binary-upload.yml b/.github/workflows/_binary-upload.yml index cf47de9ccf212..eddc3abc7f2db 100644 --- a/.github/workflows/_binary-upload.yml +++ b/.github/workflows/_binary-upload.yml @@ -70,7 +70,9 @@ on: description: Conda PyTorchBot token jobs: build: - runs-on: linux.2xlarge + runs-on: ubuntu-22.04 + container: + image: continuumio/miniconda3:4.12.0 env: PYTORCH_ROOT: /pytorch BUILDER_ROOT: /builder @@ -86,40 +88,20 @@ jobs: LIBTORCH_VARIANT: ${{ inputs.LIBTORCH_VARIANT }} DESIRED_DEVTOOLSET: ${{ inputs.DESIRED_DEVTOOLSET }} DESIRED_PYTHON: ${{ inputs.DESIRED_PYTHON }} - # Needed for conda builds - ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine" ANACONDA_USER: pytorch - AWS_DEFAULT_REGION: us-east-1 BINARY_ENV_FILE: /tmp/env GITHUB_TOKEN: ${{ secrets.github-token }} PR_NUMBER: ${{ github.event.pull_request.number }} PYTORCH_FINAL_PACKAGE_DIR: /artifacts SHA1: ${{ github.event.pull_request.head.sha || github.sha }} steps: - - name: List the env - shell: bash - run: env - name: Checkout PyTorch uses: pytorch/pytorch/.github/actions/checkout-pytorch@master - - name: Setup Linux - uses: ./.github/actions/setup-linux - - name: Chown workspace - uses: ./.github/actions/chown-workspace - - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)" - uses: ./.github/actions/setup-ssh with: - github-secret: ${{ secrets.github-token }} + no-sudo: true - - name: Download Build Artifacts with S3 - uses: seemethere/download-artifact-s3@v4 - if: ${{ inputs.use_s3 }} - with: - name: ${{ inputs.build_name }} - path: "${{ runner.temp }}/artifacts/" - - - name: Download Build Artifacts without S3 + - name: Download Build Artifacts uses: actions/download-artifact@v2 - if: ${{ !inputs.use_s3 }} with: name: ${{ inputs.build_name }} path: "${{ runner.temp }}/artifacts/" @@ -144,35 +126,4 @@ jobs: AWS_SECRET_ACCESS_KEY: ${{ secrets.aws-pytorch-uploader-secret-access-key }} ANACONDA_API_TOKEN: ${{ secrets.conda-pytorchbot-token }} run: | - docker run --rm -i \ - -e ANACONDA_API_TOKEN \ - -e AWS_ACCESS_KEY_ID \ - -e AWS_SECRET_ACCESS_KEY \ - -e DRY_RUN \ - -e PACKAGE_TYPE \ - -e PKG_DIR=/artifacts \ - -e UPLOAD_CHANNEL \ - -e UPLOAD_SUBFOLDER \ - -v "${RUNNER_TEMP}/artifacts:/artifacts" \ - -v "${GITHUB_WORKSPACE}:/v" \ - -w /v \ - 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \ - bash -c '.circleci/scripts/binary_upload.sh' - - - name: Hold runner for 2 hours or until ssh sessions have drained - # Always hold for active ssh sessions - if: always() - run: .github/scripts/wait_for_ssh_to_drain.sh - - name: Chown workspace - if: always() - run: | - # Ensure the working directory gets chowned back to the current user - docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" . - - name: Kill containers, clean up images - if: always() - run: | - # ignore expansion of "docker ps -q" since it could be empty - # shellcheck disable=SC2046 - docker stop $(docker ps -q) || true - # Prune all of the docker images - docker system prune -af + bash .circleci/scripts/binary_upload.sh diff --git a/.github/workflows/_buck-build-test.yml b/.github/workflows/_buck-build-test.yml index ae7f7517e2eda..221ca9adcd442 100644 --- a/.github/workflows/_buck-build-test.yml +++ b/.github/workflows/_buck-build-test.yml @@ -28,7 +28,7 @@ jobs: activate-environment: build - name: Install dependencies - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 with: timeout_minutes: 10 max_attempts: 5 @@ -46,16 +46,17 @@ jobs: typing_extensions - name: Install Buck - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 with: timeout_minutes: 10 max_attempts: 5 command: | - wget https://github.com/facebook/buck/releases/download/v2021.01.12.01/buck.2021.01.12.01_all.deb + sudo apt update -q + wget -q https://github.com/facebook/buck/releases/download/v2021.01.12.01/buck.2021.01.12.01_all.deb sudo apt install ./buck.2021.01.12.01_all.deb - name: Download third party libraries and generate wrappers - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 with: timeout_minutes: 10 max_attempts: 5 diff --git a/.github/workflows/_docs.yml b/.github/workflows/_docs.yml index de28790f8c5e9..a70605f2f5aa4 100644 --- a/.github/workflows/_docs.yml +++ b/.github/workflows/_docs.yml @@ -38,10 +38,16 @@ jobs: build-docs: # Don't run on forked repos. if: github.repository_owner == 'pytorch' - runs-on: [self-hosted, linux.2xlarge] + runs-on: [self-hosted, linux.4xlarge] strategy: matrix: - docs_type: [cpp, python] + include: + - docs_type: cpp + # Nightly cpp docs take about 150m to finish, and the number is stable + timeout-minutes: 180 + - docs_type: python + # It takes less than 30m to finish python docs unless there are issues + timeout-minutes: 30 steps: # [see note: pytorch repo ref] - name: Checkout PyTorch @@ -76,6 +82,8 @@ jobs: echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc" - name: Build ${{ matrix.docs_type }} docs + timeout-minutes: ${{ matrix.timeout-minutes }} + id: build-docs env: WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }} DOCKER_IMAGE: ${{ inputs.docker-image }} @@ -118,7 +126,7 @@ jobs: - name: Upload Python Docs Preview uses: seemethere/upload-artifact-s3@v5 - if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }} + if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }} with: retention-days: 14 s3-bucket: doc-previews @@ -128,7 +136,7 @@ jobs: - name: Upload C++ Docs Preview uses: seemethere/upload-artifact-s3@v5 - if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }} + if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' && steps.build-docs.outcome == 'success' }} with: retention-days: 14 if-no-files-found: error diff --git a/.github/workflows/_ios-build-test.yml b/.github/workflows/_ios-build-test.yml index 189e21d210e59..56443419ef1d8 100644 --- a/.github/workflows/_ios-build-test.yml +++ b/.github/workflows/_ios-build-test.yml @@ -140,6 +140,7 @@ jobs: scripts/build_ios.sh - name: Run Build Test + timeout-minutes: 5 run: | PROFILE=PyTorch_CI_2022 # run the ruby build script @@ -190,3 +191,9 @@ jobs: else bundle exec fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT fi + - name: Dump Simulator Tests On a Failure + if: | + failure() && inputs.ios-platform == 'SIMULATOR' + run: | + echo "Simulator Tests Logs:" + cat /Users/runner/Library/Logs/scan/*.log diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml index aa81647c53fcf..f4cd8376883eb 100644 --- a/.github/workflows/_linux-test.yml +++ b/.github/workflows/_linux-test.yml @@ -53,7 +53,7 @@ jobs: docker-image: ${{ inputs.docker-image }} - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a + uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') with: timeout_minutes: 10 @@ -178,6 +178,7 @@ jobs: - name: Stop monitoring script if: always() && steps.monitor-script.outputs.monitor-script-pid shell: bash + continue-on-error: true env: MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }} run: | diff --git a/.github/workflows/_mac-build.yml b/.github/workflows/_mac-build.yml index f17bd649c7131..316656b6ec9b2 100644 --- a/.github/workflows/_mac-build.yml +++ b/.github/workflows/_mac-build.yml @@ -27,6 +27,12 @@ on: description: | If this is set, our linter will use this to make sure that every other job with the same `sync-tag` is identical. + python_version: + required: false + type: string + default: "3.8" + description: | + The python version to be used. Will be 3.8 by default secrets: MACOS_SCCACHE_S3_ACCESS_KEY_ID: @@ -68,7 +74,7 @@ jobs: uses: conda-incubator/setup-miniconda@v2 with: auto-update-conda: true - python-version: 3.8 + python-version: ${{ inputs.python_version }} activate-environment: build miniconda-version: 4.7.12 diff --git a/.github/workflows/_mac-test-arm64.yml b/.github/workflows/_mac-test-mps.yml similarity index 98% rename from .github/workflows/_mac-test-arm64.yml rename to .github/workflows/_mac-test-mps.yml index 14502a32ad684..fa189307358a6 100644 --- a/.github/workflows/_mac-test-arm64.yml +++ b/.github/workflows/_mac-test-mps.yml @@ -41,7 +41,7 @@ jobs: - name: Install PyTorch env: ENV_NAME: conda-test-env-${{ github.run_id }} - PY_VERS: 3.8 + PY_VERS: 3.9 shell: arch -arch arm64 bash {0} run: | # shellcheck disable=SC1090 diff --git a/.github/workflows/_mac-test.yml b/.github/workflows/_mac-test.yml index e919bef85a67a..36a0149795dc7 100644 --- a/.github/workflows/_mac-test.yml +++ b/.github/workflows/_mac-test.yml @@ -18,6 +18,11 @@ on: description: | If this is set, our linter will use this to make sure that every other job with the same `sync-tag` is identical. + arch: + required: true + type: string + description: | + Contains the architecture to run the tests with secrets: AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: @@ -27,15 +32,15 @@ on: required: true description: secret acess key for test stats upload -# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179 -defaults: - run: - shell: bash -e -l {0} - jobs: test: # Don't run on forked repos. if: github.repository_owner == 'pytorch' + # For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179 + # Also ensure that we always run with the right architecture + defaults: + run: + shell: arch -arch ${{ inputs.arch }} bash -e -l {0} strategy: matrix: ${{ fromJSON(inputs.test-matrix) }} fail-fast: false @@ -57,7 +62,6 @@ jobs: - name: Start monitoring script id: monitor-script - shell: bash run: | python3 -m pip install psutil==5.9.1 python3 -m pip install pynvml==11.4.1 @@ -70,7 +74,8 @@ jobs: name: ${{ inputs.build-environment }} use-gha: true - - name: Setup miniconda + - name: Setup miniconda for x86 + if: inputs.build-environment == 'macos-12-py3-x86-64' uses: conda-incubator/setup-miniconda@v2 with: auto-update-conda: true @@ -78,6 +83,16 @@ jobs: activate-environment: build miniconda-version: 4.7.12 + - name: Setup miniconda for arm64 + if: inputs.build-environment == 'macos-12-py3-arm64' + run: | + # Conda is already installed and setup for bash here + # Cleanup lingering conda environment and create + # a new one for this run + conda env remove -n build + conda create -n build python=3.9.12 + conda list + - name: Install macOS homebrew dependencies run: | # Install dependencies @@ -87,6 +102,12 @@ jobs: id: parse-ref run: .github/scripts/parse_ref.py + - name: Pre-process arm64 wheels + if: inputs.build-environment == 'macos-12-py3-arm64' + run: | + # As wheels are cross-compiled they are reported as x86_64 ones + ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv "${ORIG_WHLNAME}" "${ARM_WHLNAME}" + - name: Test id: test run: | @@ -103,10 +124,21 @@ jobs: # wreak havoc internally export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}" export PR_BODY="${PR_BODY//[\'\"]}" + arch + + # This is a no-op for x86 + conda activate build python3 -mpip install dist/*.whl .jenkins/pytorch/macos-test.sh + - name: Cleanup miniconda for arm64 + if: inputs.build-environment == 'macos-12-py3-arm64' + run: | + # Cleanup conda env + conda deactivate + conda env remove -n build + - name: Get workflow job id id: get-job-id uses: ./.github/actions/get-workflow-job-id @@ -115,8 +147,8 @@ jobs: github-token: ${{ secrets.GITHUB_TOKEN }} - name: Stop monitoring script - if: always() && steps.monitor-script.outputs.monitor-script-pid - shell: bash + if: always() && ${{ steps.monitor-script.outputs.monitor-script-pid }} + continue-on-error: true env: MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }} run: | @@ -148,7 +180,6 @@ jobs: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }} GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }} - shell: bash run: | set -x python3 -m pip install -r requirements.txt diff --git a/.github/workflows/_rocm-test.yml b/.github/workflows/_rocm-test.yml index b5550fdda7f0a..f65e1464998a3 100644 --- a/.github/workflows/_rocm-test.yml +++ b/.github/workflows/_rocm-test.yml @@ -179,6 +179,7 @@ jobs: - name: Stop monitoring script if: always() && steps.monitor-script.outputs.monitor-script-pid shell: bash + continue-on-error: true env: MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }} run: | diff --git a/.github/workflows/_win-test.yml b/.github/workflows/_win-test.yml index 560c0fe84e1d4..243bd7563639a 100644 --- a/.github/workflows/_win-test.yml +++ b/.github/workflows/_win-test.yml @@ -124,6 +124,7 @@ jobs: - name: Stop monitoring script if: always() && steps.monitor-script.outputs.monitor-script-pid shell: bash + continue-on-error: true env: MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }} run: | diff --git a/.github/workflows/cancel_redundant_workflows.yml b/.github/workflows/cancel_redundant_workflows.yml deleted file mode 100644 index c6755dd25f37d..0000000000000 --- a/.github/workflows/cancel_redundant_workflows.yml +++ /dev/null @@ -1,23 +0,0 @@ -name: Cancel redundant workflows -on: - workflow_run: - types: - - requested - # NOTE: Make sure to add to this list as you add more workflows running on 'pull_request' - workflows: - - Lint - - Test tools - - TorchBench CI (pytorch-linux-py3.7-cu102) - - clang-format -jobs: - cancel: - # We do not want to cancel reruns on master - if: github.event.workflow_run.head_branch != 'master' - runs-on: ubuntu-18.04 - steps: - - name: Cancel duplicate workflow runs - uses: potiuk/cancel-workflow-runs@a81b3c4d59c61e27484cfacdc13897dd908419c9 - with: - cancelMode: duplicates - token: ${{ secrets.GITHUB_TOKEN }} - sourceRunId: ${{ github.event.workflow_run.id }} diff --git a/.github/workflows/docker-release.yml b/.github/workflows/docker-release.yml new file mode 100644 index 0000000000000..6c4dbe5ef773d --- /dev/null +++ b/.github/workflows/docker-release.yml @@ -0,0 +1,89 @@ +name: Build Official Docker Images + +on: + workflow_dispatch: + pull_request: + paths: + - Dockerfile + - docker.Makefile + push: + branches: + - nightly + tags: + # Release candidate tags look like: v1.11.0-rc1 + - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+ + - ciflow/nightly/* + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }} + cancel-in-progress: true + +env: + BUILD_PROGRESS: plain + BUILD_TYPE: official + DOCKER_ORG: pytorch + DOCKER_REGISTRY: ghcr.io + NO_BUILD_SUFFIX: true + USE_BUILDX: 1 + WITH_PUSH: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }} + +jobs: + build: + if: ${{ github.repository == 'pytorch/pytorch' }} + runs-on: [self-hosted, linux.2xlarge] + timeout-minutes: 240 + strategy: + matrix: + include: + # nvidia specific images don't exist for arm64 so only build the runtime image + - image_type: runtime + platform: linux/arm64,linux/amd64 + - image_type: devel + platform: linux/amd64 + env: + BUILD_IMAGE_TYPE: ${{ matrix.image_type }} + BUILD_PLATFORMS: ${{ matrix.platform }} + steps: + # [see note: pytorch repo ref] + # deep clone (fetch-depth 0) required for git merge-base + - name: Checkout PyTorch + uses: pytorch/pytorch/.github/actions/checkout-pytorch@master + - name: Setup Linux + uses: ./.github/actions/setup-linux + - name: Setup SSH (Click me for login details) + uses: ./.github/actions/setup-ssh + with: + github-secret: ${{ secrets.GITHUB_TOKEN }} + - name: Login to GitHub Container Registry + if: ${{ env.WITH_PUSH == 'true' }} + uses: docker/login-action@v2 + with: + registry: ghcr.io + username: pytorch + password: ${{ secrets.GHCR_PAT }} + # Setup multi-arch image builds + - name: Set up QEMU + uses: docker/setup-qemu-action@v2 + env: + QEMU_BINARY_PATH: ${{ runner.temp }}/bin + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + - name: Setup job specific variables + run: | + set -eou pipefail + # To get QEMU binaries in our PATh + echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}" + # Generate PyTorch version to use + echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py)" >> "${GITHUB_ENV}" + - name: Setup nightly specific variables + if: ${{ github.event.ref == 'refs/heads/nightly' }} + run: | + # Use nightly image if building for nightly + echo "DOCKER_IMAGE=pytorch-nightly" >> "${GITHUB_ENV}" + - name: Run docker build / push + # WITH_PUSH is used here to determine whether or not to add the --push flag + run: | + make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image" + - name: Teardown Linux + uses: ./.github/actions/teardown-linux + if: always() diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml index 002f25561c358..f283a6b5c3a8d 100644 --- a/.github/workflows/lint.yml +++ b/.github/workflows/lint.yml @@ -14,19 +14,23 @@ jobs: lintrunner: runs-on: ubuntu-18.04 steps: + - name: Checkout PyTorch + uses: csarofeen/pytorch/.github/actions/checkout-pytorch@master + with: + submodules: false + fetch-depth: 1 + - name: Setup Python uses: actions/setup-python@v2 with: python-version: 3.8 architecture: x64 - - - name: Checkout PyTorch - uses: csarofeen/pytorch/.github/actions/checkout-pytorch@master - with: - submodules: false + cache: pip + cache-dependency-path: | + **/.github/requirements-gha-cache.txt - name: Install lintrunner - run: pip install lintrunner==0.9.* + run: pip install lintrunner==0.9.2 - name: Initialize lint dependencies run: lintrunner init diff --git a/.github/workflows/mac-mps.yml b/.github/workflows/mac-mps.yml new file mode 100644 index 0000000000000..8fc2dd8336bff --- /dev/null +++ b/.github/workflows/mac-mps.yml @@ -0,0 +1,35 @@ +name: Mac MPS + +on: + push: + tags: + - ciflow/mps/* + workflow_dispatch: + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }} + cancel-in-progress: true + +jobs: + macos-12-py3-arm64-build: + name: macos-12-py3-arm64 + uses: ./.github/workflows/_mac-build.yml + with: + sync-tag: macos-12-py3-arm64-build + build-environment: macos-12-py3-arm64 + xcode-version: "13.3.1" + runner-type: macos-12-xl + build-generates-artifacts: true + # To match the one pre-installed in the m1 runners + python_version: 3.9.12 + secrets: + MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }} + MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }} + + macos-12-py3-arm64-mps-test: + name: macos-12-py3-arm64-mps + uses: ./.github/workflows/_mac-test-mps.yml + needs: macos-12-py3-arm64-build + with: + sync-tag: macos-12-py3-arm64-mps-test + build-environment: macos-12-py3-arm64 diff --git a/.github/workflows/periodic.yml b/.github/workflows/periodic.yml index 0e3e565deb914..7fbd04f8f161f 100644 --- a/.github/workflows/periodic.yml +++ b/.github/workflows/periodic.yml @@ -120,6 +120,58 @@ jobs: { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, ]} + linux-bionic-cuda11_7-py3_7-gcc7-debug-build: + name: linux-bionic-cuda11.7-py3.7-gcc7-debug + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug + docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7 + build-with-debug: true + + linux-bionic-cuda11_7-py3_7-gcc7-debug-test: + name: linux-bionic-cuda11.7-py3.7-gcc7-debug + uses: ./.github/workflows/_linux-test.yml + needs: linux-bionic-cuda11_7-py3_7-gcc7-debug-build + with: + build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug + docker-image: ${{ needs.linux-bionic-cuda11_7-py3_7-gcc7-debug-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + ]} + + libtorch-linux-bionic-cuda11_7-py3_7-gcc7-build: + name: libtorch-linux-bionic-cuda11.7-py3.7-gcc7 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: libtorch-linux-bionic-cuda11.7-py3.7-gcc7 + docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7 + build-generates-artifacts: false + + win-vs2019-cuda11_7-py3-build: + name: win-vs2019-cuda11.7-py3 + uses: ./.github/workflows/_win-build.yml + with: + build-environment: win-vs2019-cuda11.7-py3 + cuda-version: "11.7" + + win-vs2019-cuda11_7-py3-test: + name: win-vs2019-cuda11.7-py3 + uses: ./.github/workflows/_win-test.yml + needs: win-vs2019-cuda11_7-py3-build + with: + build-environment: win-vs2019-cuda11.7-py3 + cuda-version: "11.7" + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" }, + { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" }, + { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" }, + ]} + ios-12-5-1-x86-64-coreml: name: ios-12-5-1-x86-64-coreml uses: ./.github/workflows/_ios-build-test.yml diff --git a/.github/workflows/pr-labels.yml b/.github/workflows/pr-labels.yml index 7313d0b8e9682..41c91b05b1c29 100644 --- a/.github/workflows/pr-labels.yml +++ b/.github/workflows/pr-labels.yml @@ -11,14 +11,18 @@ jobs: runs-on: ubuntu-latest steps: + - name: Checkout repository + uses: actions/checkout@v2 + - name: Set up python uses: actions/setup-python@v2 + with: + cache: pip + cache-dependency-path: | + **/.github/requirements-gha-cache.txt - name: Install requests - run: pip3 install requests==2.26 - - - name: Checkout repository - uses: actions/checkout@v2 + run: pip install requests==2.26 - name: Process commit and find merger responsible for labeling id: commit diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml new file mode 100644 index 0000000000000..b5d545844fc5b --- /dev/null +++ b/.github/workflows/pull.yml @@ -0,0 +1,312 @@ +name: pull + +on: + pull_request: + push: + branches: + - master + - main + - release/* + - landchecks/* + workflow_dispatch: + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }} + cancel-in-progress: true + +jobs: + linux-focal-py3_7-gcc7-build: + name: linux-focal-py3.7-gcc7 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-gcc7 + docker-image-name: pytorch-linux-focal-py3.7-gcc7 + + linux-focal-py3_7-gcc7-test: + name: linux-focal-py3.7-gcc7 + uses: ./.github/workflows/_linux-test.yml + needs: linux-focal-py3_7-gcc7-build + with: + build-environment: linux-focal-py3.7-gcc7 + docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" }, + { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" }, + { config: "distributed", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + { config: "docs_test", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + { config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + ]} + + linux-docs: + name: linux-docs + uses: ./.github/workflows/_docs.yml + needs: linux-focal-py3_7-gcc7-build + with: + build-environment: linux-focal-py3.7-gcc7 + docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }} + + linux-focal-py3_7-gcc7-no-ops: + name: linux-focal-py3.7-gcc7-no-ops + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-gcc7-no-ops + docker-image-name: pytorch-linux-focal-py3.7-gcc7 + + linux-focal-py3_7-gcc7-pch: + name: linux-focal-py3.7-gcc7-pch + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-gcc7-pch + docker-image-name: pytorch-linux-focal-py3.7-gcc7 + + linux-focal-py3_7-clang7-asan-build: + name: linux-focal-py3.7-clang7-asan + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-clang7-asan + docker-image-name: pytorch-linux-focal-py3-clang7-asan + + linux-focal-py3_7-clang7-asan-test: + name: linux-focal-py3.7-clang7-asan + uses: ./.github/workflows/_linux-test.yml + needs: linux-focal-py3_7-clang7-asan-build + with: + build-environment: linux-focal-py3.7-clang7-asan + docker-image: ${{ needs.linux-focal-py3_7-clang7-asan-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 5, runner: "linux.2xlarge" }, + { config: "default", shard: 2, num_shards: 5, runner: "linux.2xlarge" }, + { config: "default", shard: 3, num_shards: 5, runner: "linux.2xlarge" }, + { config: "default", shard: 4, num_shards: 5, runner: "linux.2xlarge" }, + { config: "default", shard: 5, num_shards: 5, runner: "linux.2xlarge" }, + ]} + + linux-focal-py3_7-clang10-onnx-build: + name: linux-focal-py3.7-clang10-onnx + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-clang10-onnx + docker-image-name: pytorch-linux-focal-py3-clang10-onnx + + linux-focal-py3_7-clang10-onnx-test: + name: linux-focal-py3.7-clang10-onnx + uses: ./.github/workflows/_linux-test.yml + needs: linux-focal-py3_7-clang10-onnx-build + with: + build-environment: linux-focal-py3.7-clang10-onnx + docker-image: ${{ needs.linux-focal-py3_7-clang10-onnx-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" }, + { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" }, + ]} + + linux-bionic-py3_7-clang9-build: + name: linux-bionic-py3.7-clang9 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-py3.7-clang9 + docker-image-name: pytorch-linux-bionic-py3.7-clang9 + + linux-bionic-py3_7-clang9-test: + name: linux-bionic-py3.7-clang9 + uses: ./.github/workflows/_linux-test.yml + needs: linux-bionic-py3_7-clang9-build + with: + build-environment: linux-bionic-py3.7-clang9 + docker-image: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" }, + { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" }, + { config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" }, + { config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" }, + { config: "dynamo", shard: 1, num_shards: 2, runner: "linux.2xlarge" }, + { config: "dynamo", shard: 2, num_shards: 2, runner: "linux.2xlarge" }, + { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + ]} + + linux-bionic-cuda11_3-py3_7-clang9-build: + name: linux-bionic-cuda11.3-py3.7-clang9 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-cuda11.3-py3.7-clang9 + docker-image-name: pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9 + + linux-vulkan-bionic-py3_7-clang9-build: + name: linux-vulkan-bionic-py3.7-clang9 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-vulkan-bionic-py3.7-clang9 + docker-image-name: pytorch-linux-bionic-py3.7-clang9 + + linux-vulkan-bionic-py3_7-clang9-test: + name: linux-vulkan-bionic-py3.7-clang9 + uses: ./.github/workflows/_linux-test.yml + needs: linux-vulkan-bionic-py3_7-clang9-build + with: + build-environment: linux-vulkan-bionic-py3.7-clang9 + docker-image: ${{ needs.linux-vulkan-bionic-py3_7-clang9-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + ]} + + linux-bionic-cuda11_6-py3_10-gcc7-build: + name: linux-bionic-cuda11.6-py3.10-gcc7 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-cuda11.6-py3.10-gcc7 + docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7 + + linux-bionic-cuda11_6-py3_10-gcc7-test: + name: linux-bionic-cuda11.6-py3.10-gcc7 + uses: ./.github/workflows/_linux-test.yml + needs: linux-bionic-cuda11_6-py3_10-gcc7-build + with: + build-environment: linux-bionic-cuda11.6-py3.10-gcc7 + docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" }, + { config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" }, + { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" }, + ]} + + linux-xenial-py3-clang5-mobile-build: + name: linux-xenial-py3-clang5-mobile-build + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-xenial-py3-clang5-mobile-build + docker-image-name: pytorch-linux-xenial-py3-clang5-asan + build-generates-artifacts: false + + linux-jammy-cuda-11_6-cudnn8-py3_8-clang12-build: + name: linux-jammy-cuda11.6-cudnn8-py3.8-clang12 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-jammy-cuda11.6-cudnn8-py3.8-clang12 + docker-image-name: pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12 + + linux-xenial-py3-clang5-mobile-custom-build-static: + name: linux-xenial-py3-clang5-mobile-custom-build-static + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-xenial-py3-clang5-mobile-custom-build-static + docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c + build-generates-artifacts: false + + linux-bionic-py3_7-clang8-xla-build: + name: linux-bionic-py3_7-clang8-xla + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-py3_7-clang8-xla + docker-image-name: xla_base + + linux-bionic-py3_7-clang8-xla-test: + name: linux-bionic-py3_7-clang8-xla + uses: ./.github/workflows/_linux-test.yml + needs: linux-bionic-py3_7-clang8-xla-build + with: + build-environment: linux-bionic-py3_7-clang8-xla + docker-image: ${{ needs.linux-bionic-py3_7-clang8-xla-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "xla", shard: 1, num_shards: 1, runner: "linux.2xlarge" }, + ]} + + win-vs2019-cpu-py3-build: + name: win-vs2019-cpu-py3 + uses: ./.github/workflows/_win-build.yml + with: + build-environment: win-vs2019-cpu-py3 + cuda-version: cpu + + win-vs2019-cpu-py3-test: + name: win-vs2019-cpu-py3 + uses: ./.github/workflows/_win-test.yml + needs: win-vs2019-cpu-py3-build + with: + build-environment: win-vs2019-cpu-py3 + cuda-version: cpu + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "windows.4xlarge" }, + { config: "default", shard: 2, num_shards: 2, runner: "windows.4xlarge" }, + { config: "functorch", shard: 1, num_shards: 1, runner: "windows.4xlarge" }, + ]} + + win-vs2019-cuda11_6-py3-build: + if: github.event_name == 'pull_request' + name: win-vs2019-cuda11.6-py3 + uses: ./.github/workflows/_win-build.yml + with: + build-environment: win-vs2019-cuda11.6-py3 + cuda-version: "11.6" + sync-tag: win-cuda-build + + linux-xenial-cuda11_3-py3_7-gcc7-bazel-test: + name: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test + uses: ./.github/workflows/_bazel-build-test.yml + with: + build-environment: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test + docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7 + + linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single: + name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single + uses: ./.github/workflows/_android-build-test.yml + with: + build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single + docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c + + linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit: + name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit + uses: ./.github/workflows/_android-build-test.yml + with: + build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit + docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c + + linux-focal-py3_7-gcc7-mobile-lightweight-dispatch-build: + name: linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build + docker-image-name: pytorch-linux-focal-py3.7-gcc7 + build-generates-artifacts: false + + linux-bionic-cuda11_6-py3_10-gcc7-deploy-build: + name: linux-bionic-cuda11_6-py3_10-gcc7-deploy + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-bionic-cuda11.6-py3.10-gcc7-deploy + docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7 + + deploy-linux-bionic-cuda11_6-py3_10-gcc7-test: + name: linux-bionic-cuda11_6-py3_10-gcc7-deploy + uses: ./.github/workflows/_linux-test.yml + needs: linux-bionic-cuda11_6-py3_10-gcc7-deploy-build + with: + build-environment: linux-bionic-cuda11.6-py3.10-gcc7-deploy + docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-deploy-build.outputs.docker-image }} + test-matrix: | + { include: [ + { config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" }, + ]} + + linux-focal-rocm5_2-py3_7-build: + # don't run build twice on master + if: github.event_name == 'pull_request' + name: linux-focal-rocm5.2-py3.7 + uses: ./.github/workflows/_linux-build.yml + with: + build-environment: linux-focal-rocm5.2-py3.7 + docker-image-name: pytorch-linux-focal-rocm5.2-py3.7 + sync-tag: rocm-build diff --git a/.github/workflows/push_nightly_docker_ghcr.yml b/.github/workflows/push_nightly_docker_ghcr.yml new file mode 100644 index 0000000000000..bdcc6e05dc593 --- /dev/null +++ b/.github/workflows/push_nightly_docker_ghcr.yml @@ -0,0 +1,39 @@ +name: docker-release-builds +on: + schedule: + # Push the nightly docker daily at 1 PM UTC + - cron: '0 13 * * *' + # Trigger when we modify something related to these images + pull_request: + paths: + - .github/scripts/build_publish_nightly_docker.sh + - .github/workflows/push_nightly_docker_ghcr.yml + - Dockerfile + - docker.Makefile + # Have the ability to trigger this job manually using the API as well + workflow_dispatch: + +jobs: + docker-release-build: + if: ${{ github.repository == 'pytorch/pytorch' }} + runs-on: linux.2xlarge + env: + GHCR_PAT: ${{ secrets.GHCR_PAT }} + WITH_PUSH: ${{ github.event_name == 'schedule' }} + steps: + - name: Checkout PyTorch + uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9 + with: + ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} + - uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767 + name: Build and upload nightly docker + with: + timeout_minutes: 10 + max_attempts: 3 + command: | + set -ex + bash .github/scripts/build_publish_nightly_docker.sh + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }} + cancel-in-progress: true diff --git a/.github/workflows/revert.yml b/.github/workflows/revert.yml index 1fbdacc82071e..d207840f383b4 100644 --- a/.github/workflows/revert.yml +++ b/.github/workflows/revert.yml @@ -8,18 +8,24 @@ jobs: do_revert: name: try_revert_pr_${{ github.event.client_payload.pr_num }} runs-on: linux.20_04.4x + env: + GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} steps: - - name: Setup Python - uses: actions/setup-python@v2 - with: - python-version: 3.8 - architecture: x64 - name: Checkout repo uses: actions/checkout@v2 + id: checkout with: fetch-depth: 0 token: ${{ secrets.MERGEBOT_TOKEN }} + - name: Setup Python + uses: actions/setup-python@v4 + with: + python-version: 3.8 + architecture: x64 + cache: 'pip' + - run: pip install pyyaml==6.0 + - name: Setup committer id run: | git config --global user.email "pytorchmergebot@users.noreply.github.com" @@ -30,7 +36,6 @@ jobs: PR_NUM: ${{ github.event.client_payload.pr_num }} COMMENT_ID: ${{ github.event.client_payload.comment_id }} REASON: ${{ github.event.client_payload.reason }} - GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} run: | set -ex if [ -n "${COMMENT_ID}" ]; then @@ -46,5 +51,14 @@ jobs: python3 .github/scripts/trymerge.py --revert "${PR_NUM}" fi fi + - name: Comment on Canceled + if: ${{ cancelled() && steps.checkout.outcome == 'success' }} + continue-on-error: true + env: + GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }} + PR_NUM: ${{ github.event.client_payload.pr_num }} + run: | + set -ex + python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "revert" concurrency: try-revert diff --git a/.github/workflows/stale_pull_requests.yml b/.github/workflows/stale_pull_requests.yml deleted file mode 100644 index a65e52c27a7c4..0000000000000 --- a/.github/workflows/stale_pull_requests.yml +++ /dev/null @@ -1,42 +0,0 @@ -name: 'Close stale pull requests' -on: - schedule: - # TODO: Reduce frequency once we work through the backlog of pull requests - - cron: '0 * * * *' - workflow_dispatch: - -jobs: - stale: - if: ${{ github.repository == 'pytorch/pytorch' }} - runs-on: ubuntu-18.04 - steps: - - uses: actions/stale@v4.1.0 - with: - stale-pr-message: > - Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`.
- Feel free to remove the `Stale` label if you feel this was a mistake.
- `Stale` pull requests will automatically be closed 30 days after being marked `Stale`
- exempt-pr-labels: "no-stale,open source,high priority" - days-before-pr-stale: 60 - days-before-pr-close: 90 - days-before-issue-stale: -1 - days-before-issue-close: -1 - ascending: true - stale-open-source: - if: ${{ github.repository == 'pytorch/pytorch' }} - runs-on: ubuntu-18.04 - steps: - - uses: actions/stale@v4.1.0 - with: - stale-pr-message: > - Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`.
- Feel free to remove the `Stale` label if you feel this was a mistake.
- If you are unable to remove the `Stale` label please contact a maintainer in order to do so.
- `Stale` pull requests will automatically be closed 30 days after being marked `Stale`
- exempt-pr-labels: "no-stale,high priority" - only-labels: "open source" - days-before-pr-stale: 60 - days-before-pr-close: 90 - days-before-issue-stale: -1 - days-before-issue-close: -1 - ascending: true diff --git a/.github/workflows/trunk.yml b/.github/workflows/trunk.yml index a31111ecf885f..c4298bcb7acae 100644 --- a/.github/workflows/trunk.yml +++ b/.github/workflows/trunk.yml @@ -60,8 +60,10 @@ jobs: docker-image: ${{ needs.linux-bionic-cuda10_2-py3_9-gcc7-build.outputs.docker-image }} test-matrix: | { include: [ - { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" }, - { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, + { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" }, { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" }, { config: "slow", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" }, { config: "slow", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" }, @@ -94,12 +96,6 @@ jobs: with: build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c - secrets: - SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }} - SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }} - ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }} - ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }} - SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }} linux-bionic-py3_7-clang9-slow-build: name: linux-bionic-py3.7-clang9-slow @@ -157,6 +153,7 @@ jobs: { config: "default", shard: 2, num_shards: 2, runner: "macos-12" }, { config: "functorch", shard: 1, num_shards: 1, runner: "macos-12" }, ]} + arch: x86_64 secrets: AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }} AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }} @@ -177,20 +174,40 @@ jobs: name: macos-12-py3-arm64 uses: ./.github/workflows/_mac-build.yml with: + sync-tag: macos-12-py3-arm64-build build-environment: macos-12-py3-arm64 xcode-version: "13.3.1" runner-type: macos-12-xl build-generates-artifacts: true + # To match the one pre-installed in the m1 runners + python_version: 3.9.12 secrets: MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }} MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }} macos-12-py3-arm64-mps-test: + name: macos-12-py3-arm64-mps + uses: ./.github/workflows/_mac-test-mps.yml + needs: macos-12-py3-arm64-build + with: + sync-tag: macos-12-py3-arm64-mps-test + build-environment: macos-12-py3-arm64 + + macos-12-py3-arm64-test: name: macos-12-py3-arm64 - uses: ./.github/workflows/_mac-test-arm64.yml + uses: ./.github/workflows/_mac-test.yml needs: macos-12-py3-arm64-build with: build-environment: macos-12-py3-arm64 + test-matrix: | + { include: [ + { config: "default", shard: 1, num_shards: 2, runner: "macos-m1-12" }, + { config: "default", shard: 2, num_shards: 2, runner: "macos-m1-12" }, + ]} + arch: arm64 + secrets: + AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }} + AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }} win-vs2019-cuda11_6-py3-build: name: win-vs2019-cuda11.6-py3 diff --git a/.github/workflows/trymerge.yml b/.github/workflows/trymerge.yml index 8db7b0c97c5c9..9ba29af660023 100644 --- a/.github/workflows/trymerge.yml +++ b/.github/workflows/trymerge.yml @@ -8,18 +8,24 @@ jobs: do_merge: name: try_merge_pr_${{ github.event.client_payload.pr_num }} runs-on: linux.20_04.4x + env: + GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} steps: - - name: Setup Python - uses: actions/setup-python@v2 - with: - python-version: 3.8 - architecture: x64 - name: Checkout repo + id: checkout uses: actions/checkout@v2 with: fetch-depth: 0 token: ${{ secrets.MERGEBOT_TOKEN }} + - name: Setup Python + uses: actions/setup-python@v4 + with: + python-version: 3.8 + cache: 'pip' + architecture: x64 + - run: pip install pyyaml==6.0 + - name: Setup committer id run: | git config --global user.email "pytorchmergebot@users.noreply.github.com" @@ -28,7 +34,6 @@ jobs: env: GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }} PR_NUM: ${{ github.event.client_payload.pr_num }} - GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} FORCE: ${{ github.event.client_payload.force}} ON_GREEN: ${{ github.event.client_payload.on_green}} LAND_CHECKS: ${{ github.event.client_payload.land_checks }} @@ -50,6 +55,15 @@ jobs: else python3 .github/scripts/trymerge.py "${PR_NUM}" fi + - name: Comment on Canceled + if: ${{ cancelled() && steps.checkout.outcome == 'success' }} + continue-on-error: true + env: + GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }} + PR_NUM: ${{ github.event.client_payload.pr_num }} + run: | + set -ex + python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "merge" # We want newer merge commands to supercede old ones concurrency: diff --git a/.github/workflows/tryrebase.yml b/.github/workflows/tryrebase.yml index 748127ff2d626..fed9000c420e9 100644 --- a/.github/workflows/tryrebase.yml +++ b/.github/workflows/tryrebase.yml @@ -7,19 +7,24 @@ on: jobs: do_rebase: runs-on: ubuntu-20.04 + env: + GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} steps: - - name: Setup Python - uses: actions/setup-python@v2 - with: - python-version: 3.8 - architecture: x64 - - name: Checkout repo + id: checkout uses: actions/checkout@v2 with: fetch-depth: 0 token: ${{ secrets.MERGEBOT_TOKEN }} + - name: Setup Python + uses: actions/setup-python@v4 + with: + python-version: 3.8 + architecture: x64 + cache: 'pip' + - run: pip install pyyaml==6.0 + - name: Setup committer id run: | git config --global user.email "pytorchmergebot@users.noreply.github.com" @@ -29,7 +34,6 @@ jobs: env: GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }} PR_NUM: ${{ github.event.client_payload.pr_num }} - GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} BRANCH: ${{ github.event.client_payload.branch }} run: | set -ex @@ -38,3 +42,12 @@ jobs: else python3 .github/scripts/tryrebase.py "${PR_NUM}" fi + - name: Comment on Canceled + if: ${{ cancelled() && steps.checkout.outcome == 'success' }} + continue-on-error: true + env: + GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }} + PR_NUM: ${{ github.event.client_payload.pr_num }} + run: | + set -ex + python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "rebase" diff --git a/.github/workflows/update-viablestrict.yml b/.github/workflows/update-viablestrict.yml index 872d8f5c14285..c7bbb17b4fe96 100644 --- a/.github/workflows/update-viablestrict.yml +++ b/.github/workflows/update-viablestrict.yml @@ -13,18 +13,22 @@ jobs: do_update_viablestrict: runs-on: ubuntu-20.04 steps: - - name: Setup Python - uses: actions/setup-python@v2 - with: - python-version: 3.8 - architecture: x64 - - name: Checkout repo uses: actions/checkout@v2 with: fetch-depth: 0 token: ${{ secrets.MERGEBOT_TOKEN }} + - name: Setup Python + uses: actions/setup-python@v2 + with: + python-version: 3.8 + architecture: x64 + cache: pip + cache-dependency-path: | + **/.circleci/docker/requirements-ci.txt + **/.github/requirements-gha-cache.txt + - name: Install Python Packages run: | pip3 install rockset==0.8.10 diff --git a/.github/workflows/update_pytorch_labels.yml b/.github/workflows/update_pytorch_labels.yml index f19347070ecef..31bbab78e2f9a 100644 --- a/.github/workflows/update_pytorch_labels.yml +++ b/.github/workflows/update_pytorch_labels.yml @@ -10,7 +10,7 @@ concurrency: jobs: update-labels-in-S3: - runs-on: ubuntu-18.04 + runs-on: ubuntu-22.04 if: ${{ github.repository == 'pytorch/pytorch' }} steps: - name: Checkout PyTorch diff --git a/.github/workflows/update_s3_htmls.yml b/.github/workflows/update_s3_htmls.yml index 5f3ff056c5a4a..d68b58911bed2 100644 --- a/.github/workflows/update_s3_htmls.yml +++ b/.github/workflows/update_s3_htmls.yml @@ -8,7 +8,7 @@ on: jobs: update-html: - runs-on: ubuntu-18.04 + runs-on: ubuntu-22.04 if: ${{ github.repository == 'pytorch/pytorch' }} strategy: matrix: diff --git a/.gitmodules b/.gitmodules index 538967d317641..32c0c205948a3 100644 --- a/.gitmodules +++ b/.gitmodules @@ -148,3 +148,6 @@ [submodule "third_party/nlohmann"] path = third_party/nlohmann url = https://github.com/nlohmann/json.git +[submodule "third_party/VulkanMemoryAllocator"] + path = third_party/VulkanMemoryAllocator + url = https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git diff --git a/.jenkins/caffe2/test.sh b/.jenkins/caffe2/test.sh index 6016911941b5d..0204907ee865d 100755 --- a/.jenkins/caffe2/test.sh +++ b/.jenkins/caffe2/test.sh @@ -173,7 +173,7 @@ fi ############## if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)" - pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.11.0 + pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1 beartype==0.10.4 # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21. # We don't actually need it for our tests, but it's imported if it's present, so uninstall. pip uninstall -q --yes numba diff --git a/.jenkins/pytorch/build.sh b/.jenkins/pytorch/build.sh index d442a4ebd41c2..a215459fcc7e1 100755 --- a/.jenkins/pytorch/build.sh +++ b/.jenkins/pytorch/build.sh @@ -45,6 +45,12 @@ fi if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then # enable split torch_cuda build option in CMake export BUILD_SPLIT_CUDA=ON + if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then + # TODO: there is a linking issue when building with UCC using clang, + # disable it for now and to be fix later. + export USE_UCC=1 + export USE_SYSTEM_UCC=1 + fi fi if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then @@ -169,6 +175,10 @@ if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then export USE_PER_OPERATOR_HEADERS=0 fi +if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then + export USE_PRECOMPILED_HEADERS=1 +fi + if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then export USE_GLOO_WITH_OPENSSL=ON fi diff --git a/.jenkins/pytorch/common_utils.sh b/.jenkins/pytorch/common_utils.sh index 0584ddab9e2a0..7b592d57c280b 100644 --- a/.jenkins/pytorch/common_utils.sh +++ b/.jenkins/pytorch/common_utils.sh @@ -117,6 +117,8 @@ function clone_pytorch_xla() { pushd xla # pin the xla hash so that we don't get broken by changes to xla git checkout "$(cat ../.github/ci_commit_pins/xla.txt)" + git submodule sync + git submodule update --init --recursive popd fi } diff --git a/.jenkins/pytorch/macos-build.sh b/.jenkins/pytorch/macos-build.sh index db33e2dedf95b..d40ec521520ba 100755 --- a/.jenkins/pytorch/macos-build.sh +++ b/.jenkins/pytorch/macos-build.sh @@ -35,11 +35,11 @@ fi cross_compile_arm64() { # Cross compilation for arm64 - USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF BUILD_TEST=OFF python setup.py bdist_wheel + USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF python setup.py bdist_wheel } compile_x86_64() { - USE_DISTRIBUTED=1 python setup.py bdist_wheel + USE_DISTRIBUTED=1 WERROR=1 python setup.py bdist_wheel } build_lite_interpreter() { diff --git a/.jenkins/pytorch/macos-common.sh b/.jenkins/pytorch/macos-common.sh index 3dc7d0f17e167..4df378d505ecb 100755 --- a/.jenkins/pytorch/macos-common.sh +++ b/.jenkins/pytorch/macos-common.sh @@ -7,19 +7,34 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh" sysctl -a | grep machdep.cpu -# NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build -retry conda install -y \ - mkl=2021.2.0 \ - mkl-include=2021.2.0 \ - numpy=1.18.5 \ - pyyaml=5.3 \ - setuptools=46.0.0 \ - cmake=3.19 \ - cffi \ - ninja \ - typing_extensions \ - dataclasses \ - pip +if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then + # We use different versions here as the arm build/tests runs on python 3.9 + # while the x86 one runs on python 3.8 + retry conda install -y \ + numpy=1.22.3 \ + pyyaml=6.0 \ + setuptools=61.2.0 \ + cmake=3.22.1 \ + cffi \ + ninja \ + typing_extensions \ + dataclasses \ + pip +else + # NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build + retry conda install -y \ + mkl=2021.2.0 \ + mkl-include=2021.2.0 \ + numpy=1.18.5 \ + pyyaml=5.3 \ + setuptools=46.0.0 \ + cmake=3.19 \ + cffi \ + ninja \ + typing_extensions \ + dataclasses \ + pip +fi # The torch.hub tests make requests to GitHub. # diff --git a/.jenkins/pytorch/macos-test.sh b/.jenkins/pytorch/macos-test.sh index 1b15fab1ed205..a30e16ba942ee 100755 --- a/.jenkins/pytorch/macos-test.sh +++ b/.jenkins/pytorch/macos-test.sh @@ -5,14 +5,20 @@ source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh" conda install -y six -pip install -q hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3" +if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then + pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba==0.56.0" psutil "scipy==1.9.0" +else + pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3" +fi # TODO move this to docker # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014 pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \ pytest \ pytest-xdist \ - pytest-rerunfailures + pytest-rerunfailures \ + "xdoctest==1.0.2" \ + "pygments==2.12.0" if [ -z "${CI}" ]; then rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch* @@ -32,14 +38,15 @@ if [ -z "${CI}" ]; then 7z x "${IMAGE_COMMIT_TAG}".7z -o"${WORKSPACE_DIR}/miniconda3/lib/python3.6/site-packages" fi -# Test that OpenMP is enabled -pushd test -if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then - echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False" - exit 1 +# Test that OpenMP is enabled for non-arm64 build +if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then + pushd test + if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then + echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False" + exit 1 + fi + popd fi -popd - setup_test_python() { # The CircleCI worker hostname doesn't resolve to an address. @@ -165,7 +172,7 @@ test_jit_hooks() { test_dynamo() { pushd ../torchdynamo - pytest tests + pytest test popd } diff --git a/.jenkins/pytorch/multigpu-test.sh b/.jenkins/pytorch/multigpu-test.sh index d75d701e8e18b..bbd1c370a638e 100755 --- a/.jenkins/pytorch/multigpu-test.sh +++ b/.jenkins/pytorch/multigpu-test.sh @@ -7,7 +7,7 @@ # shellcheck source=./common.sh source "$(dirname "${BASH_SOURCE[0]}")/common.sh" -echo "Testing pytorch (distributed only)" +echo "Testing pytorch" if [ -n "${CI}" ]; then # TODO move this to docker # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014 @@ -48,4 +48,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/ time python test/run_test.py --verbose -i distributed/_shard/sharded_optim/test_sharded_optim time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor +# Other tests +time python test/run_test.py --verbose -i test_cuda_primary_ctx assert_git_not_dirty diff --git a/.jenkins/pytorch/test.sh b/.jenkins/pytorch/test.sh index b476d25250791..51bf9c8f98fc4 100755 --- a/.jenkins/pytorch/test.sh +++ b/.jenkins/pytorch/test.sh @@ -163,7 +163,9 @@ test_python_shard() { echo "NUM_TEST_SHARDS must be defined to run a Python test shard" exit 1 fi + time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS" --verbose + assert_git_not_dirty } @@ -178,6 +180,8 @@ test_dynamo_shard() { echo "NUM_TEST_SHARDS must be defined to run a Python test shard" exit 1 fi + # Temporarily disable test_fx for dynamo pending the investigation on TTS + # regression in https://github.com/pytorch/torchdynamo/issues/784 time python test/run_test.py \ --exclude-jit-executor \ --exclude-distributed-tests \ @@ -194,6 +198,9 @@ test_dynamo_shard() { test_profiler_tree \ test_overrides \ test_python_dispatch \ + test_fx \ + test_package \ + test_vmap \ --shard "$1" "$NUM_TEST_SHARDS" \ --verbose assert_git_not_dirty @@ -592,7 +599,7 @@ test_vec256() { test_dynamo() { pushd ../torchdynamo - pytest tests + pytest test popd } diff --git a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat index b954430734b02..7edeca96ed8d0 100644 --- a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat +++ b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat @@ -29,7 +29,9 @@ call %INSTALLER_DIR%\install_sccache.bat if errorlevel 1 exit /b if not errorlevel 0 exit /b -call %INSTALLER_DIR%\install_miniconda3.bat +:: Miniconda has been installed as part of the Windows AMI with all the dependencies. +:: We just need to activate it here +call %INSTALLER_DIR%\activate_miniconda3.bat if errorlevel 1 exit /b if not errorlevel 0 exit /b diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat similarity index 65% rename from .jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat rename to .jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat index 54b954a0503f1..e6660a17b3890 100644 --- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat +++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat @@ -4,24 +4,32 @@ if "%BUILD_ENVIRONMENT%"=="" ( set CONDA_PARENT_DIR=C:\Jenkins ) -if "%REBUILD%"=="" set INSTALL_FRESH_CONDA=1 -if NOT "%BUILD_ENVIRONMENT%"=="" set INSTALL_FRESH_CONDA=1 + +:: Be conservative here when rolling out the new AMI with conda. This will try +:: to install conda as before if it couldn't find the conda installation. This +:: can be removed eventually after we gain enough confidence in the AMI +if not exist %CONDA_PARENT_DIR%\Miniconda3 ( + set INSTALL_FRESH_CONDA=1 +) if "%INSTALL_FRESH_CONDA%"=="1" ( - IF EXIST %CONDA_PARENT_DIR%\Miniconda3 ( rd /s /q %CONDA_PARENT_DIR%\Miniconda3 ) curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe if errorlevel 1 exit /b if not errorlevel 0 exit /b + %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3 if errorlevel 1 exit /b if not errorlevel 0 exit /b ) +:: Activate conda so that we can use its commands, i.e. conda, python, pip call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3 + if "%INSTALL_FRESH_CONDA%"=="1" ( - call conda install -y -q python=%PYTHON_VERSION% numpy"<1.23" cffi pyyaml boto3 libuv + call conda install -y -q numpy"<1.23" cffi pyyaml boto3 libuv if errorlevel 1 exit /b if not errorlevel 0 exit /b + call conda install -y -q -c conda-forge cmake=3.22.3 if errorlevel 1 exit /b if not errorlevel 0 exit /b diff --git a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat index 90725b7666a33..79e8aedfab75c 100644 --- a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat +++ b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat @@ -7,10 +7,10 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol :: Install Miniconda3 set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers -call :retry %INSTALLER_DIR%\install_miniconda3.bat -:retry -call %* || (powershell -nop -c "& {sleep 1}" && call %*) || (powershell -nop -c "& {sleep 2}" && call %*) +:: Miniconda has been installed as part of the Windows AMI with all the dependencies. +:: We just need to activate it here +call %INSTALLER_DIR%\activate_miniconda3.bat if errorlevel 1 exit /b if not errorlevel 0 exit /b @@ -36,7 +36,7 @@ popd ======= :: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014 -pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-rerunfailures +pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-rerunfailures "xdoctest==1.0.2" "pygments==2.12.0" if errorlevel 1 exit /b if not errorlevel 0 exit /b diff --git a/.lintrunner.toml b/.lintrunner.toml index 02b02d1aaf06e..b2fa676f8e13c 100644 --- a/.lintrunner.toml +++ b/.lintrunner.toml @@ -102,6 +102,7 @@ exclude_patterns = [ 'torch/distributed/elastic/agent/server/api.py', 'torch/testing/_internal/**', 'torch/distributed/fsdp/fully_sharded_data_parallel.py', + 'torch/distributed/distributed_c10d.py', # TODO(suo): these exclusions were added just to get lint clean on master. # Follow up to do more target suppressions and remove them. 'torch/distributed/fsdp/flatten_params_wrapper.py', @@ -718,7 +719,10 @@ include_patterns = [ 'torch/_refs/**/*.py', 'torch/_subclasses/**/*.py', 'torch/_*.py', + 'torch/testing/_internal/opinfo/**/*.py', 'torchgen/**/*.py', + 'functorch/functorch/_src/aot_autograd.py', + 'functorch/functorch/_src/compilers.py', ] command = [ 'python3', diff --git a/BUILD.bazel b/BUILD.bazel index 823a59bb63b75..4c0791bffbb4a 100644 --- a/BUILD.bazel +++ b/BUILD.bazel @@ -1877,6 +1877,7 @@ test_suite( "aten/src/ATen/templates/LazyIr.h", "aten/src/ATen/templates/LazyNonNativeIr.h", "aten/src/ATen/templates/RegisterDispatchKey.cpp", + "aten/src/ATen/templates/RegisterDispatchDefinitions.ini", "aten/src/ATen/native/native_functions.yaml", "aten/src/ATen/native/tags.yaml", "aten/src/ATen/native/ts_native_functions.yaml", diff --git a/CMakeLists.txt b/CMakeLists.txt index 38a430ee7287c..9b6fedca3f719 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -43,7 +43,7 @@ set(CMAKE_C_STANDARD 11 CACHE STRING "The C standard whose features are reques if(DEFINED GLIBCXX_USE_CXX11_ABI) if(${GLIBCXX_USE_CXX11_ABI} EQUAL 1) set(CXX_STANDARD_REQUIRED ON) - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1") + string(APPEND CMAKE_CXX_FLAGS " -D_GLIBCXX_USE_CXX11_ABI=1") else() # Please note this is required in order to ensure compatibility between gcc 9 and gcc 7 # This could be removed when all Linux PyTorch binary builds are compiled by the same toolchain again @@ -186,7 +186,7 @@ cmake_dependent_option( INSTALL_TEST "Install test binaries if BUILD_TEST is on" ON "BUILD_TEST" OFF) option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF) -option(COLORIZE_OUTPUT "Colorize output during compilation" ON) +option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON) option(USE_ASAN "Use Address Sanitizer" OFF) option(USE_TSAN "Use Thread Sanitizer" OFF) option(USE_CUDA "Use CUDA" ON) @@ -209,8 +209,8 @@ cmake_dependent_option( USE_STATIC_CUDNN "Use cuDNN static libraries" OFF "USE_CUDNN" OFF) cmake_dependent_option( - BUILD_NVFUSER_BENCHMARK "Build C++ binaries for nvfuser benchmarks" ON - "USE_CUDA;BUILD_TEST" OFF) + BUILD_NVFUSER_BENCHMARK "Build C++ binaries for nvfuser benchmarks" OFF + "USE_CUDA" OFF) cmake_dependent_option( USE_EXPERIMENTAL_CUDNN_V8_API "Use experimental cuDNN v8 API" ON "USE_CUDNN" OFF) @@ -799,22 +799,22 @@ if(NOT MSVC) # Details at http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1459 string(APPEND CMAKE_CXX_FLAGS " -Wall") string(APPEND CMAKE_CXX_FLAGS " -Wextra") - string(APPEND CMAKE_CXX_FLAGS " -Werror=return-type") + append_cxx_flag_if_supported("-Werror=return-type" CMAKE_CXX_FLAGS) if(NOT USE_CUDNN) # Temporary fix to ignore non virtual dtor error if cudnn is used. A # separate PR to cudnn_frontend is needed to address this later on - string(APPEND CMAKE_CXX_FLAGS " -Werror=non-virtual-dtor") + append_cxx_flag_if_supported("-Werror=non-virtual-dtor" CMAKE_CXX_FLAGS) endif() - string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-field-initializers") - string(APPEND CMAKE_CXX_FLAGS " -Wno-type-limits") - string(APPEND CMAKE_CXX_FLAGS " -Wno-array-bounds") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unknown-pragmas") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-parameter") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-function") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-result") - string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-overflow") - string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-aliasing") - string(APPEND CMAKE_CXX_FLAGS " -Wno-error=deprecated-declarations") + append_cxx_flag_if_supported("-Wno-missing-field-initializers" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-type-limits" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-array-bounds" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-unknown-pragmas" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-unused-parameter" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-unused-function" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-unused-result" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-strict-overflow" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-strict-aliasing" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-error=deprecated-declarations" CMAKE_CXX_FLAGS) if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang") string(APPEND CMAKE_CXX_FLAGS " -Wno-range-loop-analysis") string(APPEND CMAKE_CXX_FLAGS " -Wno-pass-failed") @@ -855,32 +855,31 @@ if(NOT MSVC) endif() endif() - string(APPEND CMAKE_CXX_FLAGS " -Wno-error=pedantic") - string(APPEND CMAKE_CXX_FLAGS " -Wno-error=redundant-decls") - string(APPEND CMAKE_CXX_FLAGS " -Wno-error=old-style-cast") + append_cxx_flag_if_supported("-Wno-error=pedantic" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-error=redundant-decls" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-error=old-style-cast" CMAKE_CXX_FLAGS) # These flags are not available in GCC-4.8.5. Set only when using clang. # Compared against https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Option-Summary.html if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang") - string(APPEND CMAKE_CXX_FLAGS " -Wconstant-conversion") - string(APPEND CMAKE_CXX_FLAGS " -Wno-invalid-partial-specialization") - string(APPEND CMAKE_CXX_FLAGS " -Wno-typedef-redefinition") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unknown-warning-option") - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-private-field") - string(APPEND CMAKE_CXX_FLAGS " -Wno-inconsistent-missing-override") - string(APPEND CMAKE_CXX_FLAGS " -Wno-aligned-allocation-unavailable") - string(APPEND CMAKE_CXX_FLAGS " -Wno-c++14-extensions") - string(APPEND CMAKE_CXX_FLAGS " -Wno-constexpr-not-const") - string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-braces") - string(APPEND CMAKE_CXX_FLAGS " -Qunused-arguments") - if(${COLORIZE_OUTPUT}) - string(APPEND CMAKE_CXX_FLAGS " -fcolor-diagnostics") + append_cxx_flag_if_supported("-Wconstant-conversion" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-invalid-partial-specialization" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-typedef-redefinition" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-unused-private-field" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-inconsistent-missing-override" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-aligned-allocation-unavailable" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-c++14-extensions" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-constexpr-not-const" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Qunused-arguments" CMAKE_CXX_FLAGS) + if(${USE_COLORIZE_OUTPUT}) endif() endif() - if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 4.9) - if(${COLORIZE_OUTPUT}) - string(APPEND CMAKE_CXX_FLAGS " -fdiagnostics-color=always") - endif() + + if(${USE_COLORIZE_OUTPUT}) + append_cxx_flag_if_supported("-fcolor-diagnostics" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-fdiagnostics-color=always" CMAKE_CXX_FLAGS) endif() + if((APPLE AND (NOT ("${CLANG_VERSION_STRING}" VERSION_LESS "9.0"))) OR(CMAKE_COMPILER_IS_GNUCXX AND(CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0 AND NOT APPLE))) @@ -895,21 +894,15 @@ if(NOT MSVC) endif() endif(WERROR) if(NOT APPLE) - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-but-set-variable") - string(APPEND CMAKE_CXX_FLAGS " -Wno-maybe-uninitialized") + append_cxx_flag_if_supported("-Wno-unused-but-set-variable" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-maybe-uninitialized" CMAKE_CXX_FLAGS) endif() string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0") string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0") - string(APPEND CMAKE_CXX_FLAGS " -fno-math-errno") - string(APPEND CMAKE_CXX_FLAGS " -fno-trapping-math") - check_cxx_compiler_flag("-Werror=format" HAS_WERROR_FORMAT) - if(HAS_WERROR_FORMAT) - string(APPEND CMAKE_CXX_FLAGS " -Werror=format") - endif() - check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE) - if(HAS_WERROR_CAST_FUNCTION_TYPE) - string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type") - endif() + append_cxx_flag_if_supported("-fno-math-errno" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Werror=cast-function-type" CMAKE_CXX_FLAGS) check_cxx_compiler_flag("-Werror=sign-compare" HAS_WERROR_SIGN_COMPARE) # This doesn't work globally so we use the test on specific # target_compile_options @@ -970,20 +963,20 @@ if(APPLE) if(USE_MPS) string(APPEND CMAKE_CXX_FLAGS " -DUSE_MPS -fno-objc-arc") string(APPEND CMAKE_SHARED_LINKER_FLAGS " -weak_framework Foundation -weak_framework MetalPerformanceShaders -weak_framework MetalPerformanceShadersGraph -weak_framework Metal") + # To suppress MPSGraph availability warnings + append_cxx_flag_if_supported("-Wno-unguarded-availability-new" CMAKE_CXX_FLAGS) endif() - string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-private-field") - string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-braces") - string(APPEND CMAKE_CXX_FLAGS " -Wno-c++14-extensions") - string(APPEND CMAKE_CXX_FLAGS " -Wno-constexpr-not-const") + append_cxx_flag_if_supported("-Wno-unused-private-field" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-c++14-extensions" CMAKE_CXX_FLAGS) + append_cxx_flag_if_supported("-Wno-constexpr-not-const" CMAKE_CXX_FLAGS) endif() if(EMSCRIPTEN) string(APPEND CMAKE_CXX_FLAGS " -Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0") endif() -if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0.0) - string(APPEND CMAKE_CXX_FLAGS " -Wno-stringop-overflow") -endif() +append_cxx_flag_if_supported("-Wno-stringop-overflow" CMAKE_CXX_FLAGS) if(ANDROID AND (NOT ANDROID_DEBUG_SYMBOLS)) if(CMAKE_COMPILER_IS_GNUCXX) diff --git a/CODEOWNERS b/CODEOWNERS index 1bb8efe9de0b9..a467820fb0759 100644 --- a/CODEOWNERS +++ b/CODEOWNERS @@ -1,5 +1,11 @@ +# IMPORTANT: +# This file is ONLY used to subscribe for the notifications for a +# PRs related to a specific files. People in this file are +# notrequire the approval for your changes. + # This is a comment. # Each line is a file pattern followed by one or more owners. +# For module labels => owners mapping, please see https://github.com/pytorch/pytorch/issues/24422. /torch/utils/cpp_extension.py @fmassa @soumith @ezyang @@ -8,7 +14,7 @@ /torch/csrc/autograd/ @albanD @soulitzer /torch/autograd/ @albanD @soulitzer /tools/autograd/ @albanD @soulitzer -/torch/nn/ @albanD @jbschlosser +/torch/nn/ @albanD @jbschlosser @saketh-are /torch/optim/ @albanD /test/test_public_bindings.py @albanD /test/allowlist_for_publicAPI.json @albanD @anjali411 @@ -77,3 +83,9 @@ test/test_type_promotion.py @mruberry @ngimel test/test_mps.py @kulinseth aten/src/ATen/mps/ @kulinseth aten/src/ATen/native/mps/ @kulinseth + +# Profiler +torch/csrc/autograd/profiler* @robieta +torch/autograd/profiler* @robieta +torch/csrc/profiler/ @robieta +torch/profiler/ @robieta diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7b4a1246d002d..a007cedbdcac1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -17,6 +17,7 @@ - [C++ Unit Testing](#c-unit-testing) - [Run Specific CI Jobs](#run-specific-ci-jobs) - [Writing documentation](#writing-documentation) + - [Docstring type formatting](#docstring-type-formatting) - [Building documentation](#building-documentation) - [Tips](#tips) - [Building C++ Documentation](#building-c-documentation) @@ -447,9 +448,47 @@ If you're interested in adding new developer docs, please read this [page on the The rest of this section is about user-facing documentation. -PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) +PyTorch uses [Google style](https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html) for formatting docstrings. Each line inside a docstrings block must be limited to 80 characters so that it fits into Jupyter documentation popups. + +### Docstring type formatting + +In addition to the standard Google Style docstring formatting rules, the following guidelines should be followed for docstring types (docstring types are the type information contained in the round brackets after the variable name): + +* The "`Callable`", "`Any`", "`Iterable`", "`Iterator`", "`Generator`" types should have their first letter capitalized. + +* The "`list`" and "`tuple`" types should be completely lowercase. + +* Types should not be made plural. For example: `tuple of int` should be used instead of `tuple of ints`. + +* The only acceptable delimiter words for types are `or` and `of`. No other non-type words should be used other than `optional`. + +* The word `optional` should only be used after the types, and it is only used if the user does not have to specify a value for the variable. Default values are listed after the variable description. Example: + + ``` + my_var (int, optional): Variable description. Default: 1 + ``` + +* Basic Python types should match their type name so that the [Intersphinx](https://www.sphinx-doc.org/en/master/usage/extensions/intersphinx.html) extension can correctly identify them. For example: + * Use `str` instead of `string`. + * Use `bool` instead of `boolean`. + * Use `dict` instead of `dictionary`. + +* Square brackets should be used for the dictionary type. For example: + + ``` + my_var (dict[str, int]): Variable description. + ``` + +* If a variable has two different possible types, then the word `or` should be used without a comma. Otherwise variables with 3 or more types should use commas to separate the types. Example: + + ``` + x (type1 or type2): Variable description. + y (type1, type2, or type3): Variable description. + ``` + + ### Building documentation To build the documentation: diff --git a/Dockerfile b/Dockerfile index 1bd522a624067..815a9108ce946 100644 --- a/Dockerfile +++ b/Dockerfile @@ -11,8 +11,7 @@ ARG BASE_IMAGE=ubuntu:18.04 ARG PYTHON_VERSION=3.8 FROM ${BASE_IMAGE} as dev-base -RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \ - apt-get update && apt-get install -y --no-install-recommends \ +RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ @@ -28,9 +27,16 @@ ENV PATH /opt/conda/bin:$PATH FROM dev-base as conda ARG PYTHON_VERSION=3.8 +# Automatically set by buildx +ARG TARGETPLATFORM +# translating Docker's TARGETPLATFORM into miniconda arches +RUN case ${TARGETPLATFORM} in \ + "linux/arm64") MINICONDA_ARCH=aarch64 ;; \ + *) MINICONDA_ARCH=x86_64 ;; \ + esac && \ + curl -fsSL -v -o ~/miniconda.sh -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh" COPY requirements.txt . -RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ - chmod +x ~/miniconda.sh && \ +RUN chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p /opt/conda && \ rm ~/miniconda.sh && \ /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \ @@ -57,15 +63,21 @@ ARG CUDA_VERSION=11.3 ARG CUDA_CHANNEL=nvidia ARG INSTALL_CHANNEL=pytorch-nightly ENV CONDA_OVERRIDE_CUDA=${CUDA_VERSION} -RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \ +# Automatically set by buildx +RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION} +ARG TARGETPLATFORM +# On arm64 we can only install wheel packages +RUN case ${TARGETPLATFORM} in \ + "linux/arm64") pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchtext ;; \ + *) /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" ;; \ + esac && \ /opt/conda/bin/conda clean -ya RUN /opt/conda/bin/pip install torchelastic FROM ${BASE_IMAGE} as official ARG PYTORCH_VERSION LABEL com.nvidia.volumes.needed="nvidia_driver" -RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ - apt-get update && apt-get install -y --no-install-recommends \ +RUN apt-get update && apt-get install -y --no-install-recommends \ ca-certificates \ libjpeg-dev \ libpng-dev && \ diff --git a/WORKSPACE b/WORKSPACE index d26dfca5a3336..61abbdac2b239 100644 --- a/WORKSPACE +++ b/WORKSPACE @@ -88,6 +88,7 @@ new_local_repository( name = "fbgemm", build_file = "//third_party:fbgemm/BUILD.bazel", path = "third_party/fbgemm", + repo_mapping = {"@cpuinfo" : "@org_pytorch_cpuinfo"} ) new_local_repository( @@ -103,8 +104,8 @@ new_local_repository( ) new_local_repository( - name = "cpuinfo", - build_file = "//third_party:cpuinfo.BUILD", + name = "org_pytorch_cpuinfo", + build_file = "//third_party:cpuinfo/BUILD.bazel", path = "third_party/cpuinfo", ) diff --git a/aten/CMakeLists.txt b/aten/CMakeLists.txt index 9c3757f346cda..777b13b4dcf09 100644 --- a/aten/CMakeLists.txt +++ b/aten/CMakeLists.txt @@ -55,7 +55,7 @@ set(TH_CPU_INCLUDE list(APPEND ATen_CPU_INCLUDE ${TH_CPU_INCLUDE}) if(USE_VULKAN) - list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan) + list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan ${CMAKE_CURRENT_SOURCE_DIR}/../third_party/VulkanMemoryAllocator) endif() # Find the HIP package, set the HIP paths, load the HIP CMake. diff --git a/aten/src/ATen/BatchingRegistrations.cpp b/aten/src/ATen/BatchingRegistrations.cpp index a269f82fa8176..28de70636700c 100644 --- a/aten/src/ATen/BatchingRegistrations.cpp +++ b/aten/src/ATen/BatchingRegistrations.cpp @@ -189,10 +189,6 @@ Tensor expand_symint_batching_rule(const Tensor& self, SymIntArrayRef psize, boo return self.expand(asIntArrayRefSlow(psize), implicit); } -Tensor sum_symint_batching_rule(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional opt_dtype) { - return input_t.sum(c10::asIntArrayRefSlow(dim), keepdim, opt_dtype); -} - std::vector chunk_batching_rule(const Tensor& self, int64_t chunks, int64_t dim) { auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self); auto dim_physical = self_physical.getPhysicalDim(dim); @@ -1100,7 +1096,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) { m.impl("_new_zeros_with_same_feature_meta", _new_zeros_with_same_feature_meta_batching_rule); m.impl("sum.dim_IntList", sum_batching_rule); - m.impl("sum.SymInt", sum_symint_batching_rule); m.impl("is_complex", native::is_complex); // inplace operations diff --git a/aten/src/ATen/Context.h b/aten/src/ATen/Context.h index 8f3928376473d..b21f32b9021a2 100644 --- a/aten/src/ATen/Context.h +++ b/aten/src/ATen/Context.h @@ -253,7 +253,11 @@ class TORCH_API Context { bool deterministic_cudnn = false; bool _deterministic_algorithms = false; bool _deterministic_algorithms_warn_only = false; +#ifdef USE_ROCM + bool benchmark_cudnn = true; +#else bool benchmark_cudnn = false; +#endif Float32MatmulPrecision float32_matmul_precision = at::Float32MatmulPrecision::HIGHEST; int benchmark_limit_cudnn = 10; diff --git a/aten/src/ATen/DLConvertor.cpp b/aten/src/ATen/DLConvertor.cpp index fb3f3596e1fe0..54df9d631d14f 100644 --- a/aten/src/ATen/DLConvertor.cpp +++ b/aten/src/ATen/DLConvertor.cpp @@ -215,11 +215,22 @@ void deleter(DLManagedTensor* arg) { // This function returns a shared_ptr to memory managed DLpack tensor // constructed out of ATen tensor DLManagedTensor* toDLPack(const Tensor& src) { + // create a new tensor with possibly normalized strides + // gh-83069 + auto shape = src.sizes(); + auto strides = src.strides().vec(); + for (int i=0; ihandle = src; + atDLMTensor->handle = view; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; - atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); + atDLMTensor->tensor.dl_tensor.data = view.data_ptr(); int64_t device_id = 0; if (src.is_cuda()) { device_id = src.get_device(); @@ -229,10 +240,10 @@ DLManagedTensor* toDLPack(const Tensor& src) { atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src); atDLMTensor->tensor.dl_tensor.shape = // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast) - const_cast(src.sizes().data()); + const_cast(view.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast) - const_cast(src.strides().data()); + const_cast(view.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); } diff --git a/aten/src/ATen/Dispatch.h b/aten/src/ATen/Dispatch.h index 08d41126a1619..3ae552f96d17c 100644 --- a/aten/src/ATen/Dispatch.h +++ b/aten/src/ATen/Dispatch.h @@ -8,6 +8,10 @@ #include #include +#ifdef __CUDACC__ +#include // For CUDA_VERSION +#endif + #ifdef TEMPLATE_SELECTIVE_BUILD #include #else @@ -72,10 +76,20 @@ TORCH_API void record_kernel_function_dtype(std::string name); }) #endif +// Workaround for C10_UNUSED because CUDA 10.2 and below fails to handle unused +// attribute in the type aliasing context. Keep name long and verbose to avoid +// macro collisions. +#if defined(__CUDACC__) && CUDA_VERSION < 11000 +#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND +#else +#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND C10_UNUSED +#endif + #define AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, HINT, ...) \ case enum_type: { \ AT_PRIVATE_CHECK_SELECTIVE_BUILD(enum_type); \ - using HINT = c10::impl::ScalarTypeToCPPTypeT; \ + using HINT C10_UNUSED_DISPATCH_CUDA_WORKAROUND = \ + c10::impl::ScalarTypeToCPPTypeT; \ return __VA_ARGS__(); \ } diff --git a/aten/src/ATen/EmptyTensor.cpp b/aten/src/ATen/EmptyTensor.cpp index caf2a4e653c86..ff91aa0bd14d6 100644 --- a/aten/src/ATen/EmptyTensor.cpp +++ b/aten/src/ATen/EmptyTensor.cpp @@ -62,6 +62,14 @@ size_t computeStorageNbytes( size_t itemsize_bytes, size_t storage_offset ) { + TORCH_CHECK( + sizes.size() == strides.size(), + "dimensionality of sizes (", + sizes.size(), + ") must match dimensionality of strides (", + strides.size(), + ")"); + // Ignore overflow checks on mobile #ifndef C10_MOBILE // size of the underlying storage is 1 bigger than the offset diff --git a/aten/src/ATen/ExpandUtils.h b/aten/src/ATen/ExpandUtils.h index 7a81076a7dd01..a54853b259e73 100644 --- a/aten/src/ATen/ExpandUtils.h +++ b/aten/src/ATen/ExpandUtils.h @@ -446,7 +446,7 @@ static inline Tensor sum_to( } auto sizes = tensor.sym_sizes(); - c10::SmallVector reduce_dims; + c10::SmallVector reduce_dims; const int64_t leading_dims = sizes.size() - shape.size(); for (const auto i : c10::irange(leading_dims)) { reduce_dims.push_back(i); @@ -458,7 +458,7 @@ static inline Tensor sum_to( } if (!reduce_dims.empty()) { - tensor = tensor.sum_symint(reduce_dims, /*keepdim=*/true); + tensor = tensor.sum(reduce_dims, /*keepdim=*/true); } if (always_return_non_view) { diff --git a/aten/src/ATen/FunctionalStorageImpl.cpp b/aten/src/ATen/FunctionalStorageImpl.cpp index 2fad6bfad6064..7f136759ef6af 100644 --- a/aten/src/ATen/FunctionalStorageImpl.cpp +++ b/aten/src/ATen/FunctionalStorageImpl.cpp @@ -2,6 +2,7 @@ #include #include +#include #include #include @@ -94,9 +95,8 @@ FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& value) c10::StorageImpl::use_byte_size_t(), value.numel() * value.dtype().itemsize(), DataPtr{nullptr, value.device()}, - // Using a null allocator, since FunctionalTensorImpl's aren't resizeable. - nullptr, - /*resizeable=*/false + GetAllocator(kMeta), + /*resizeable=*/true ), alias_(Alias(value)) {} diff --git a/aten/src/ATen/NestedTensorImpl.cpp b/aten/src/ATen/NestedTensorImpl.cpp index 077e9e742fc77..1e98d5ad7a957 100644 --- a/aten/src/ATen/NestedTensorImpl.cpp +++ b/aten/src/ATen/NestedTensorImpl.cpp @@ -4,7 +4,26 @@ #include #include #include +#include +#include +namespace { +inline void validate_nested_tensor_metadata( + const at::Tensor& nested_sizes, + const at::Tensor& nested_strides, + const std::vector& offsets) { + TORCH_INTERNAL_ASSERT(nested_sizes.is_contiguous()); + int64_t size_dim = nested_sizes.dim(); + TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2); + TORCH_INTERNAL_ASSERT(nested_strides.is_contiguous()); + TORCH_INTERNAL_ASSERT(nested_strides.dim() == size_dim); + TORCH_INTERNAL_ASSERT(nested_sizes.sizes() == nested_strides.sizes()); + TORCH_INTERNAL_ASSERT( + (size_dim == 0 && (int64_t)offsets.empty()) || + (size_dim == 2 && nested_sizes.size(0) == (int64_t)offsets.size())); +} + +} // namespace namespace at { namespace native { @@ -99,10 +118,7 @@ inline std::vector construct_offsets(const at::Tensor& sizes) { // correct Autograd key which is AutogradNestedTensor c10::DispatchKeySet generate_nested_key_set(at::Tensor buffer) { c10::DispatchKeySet key_set = - (c10::DispatchKeySet(DispatchKey::NestedTensor) | - c10::DispatchKeySet( - buffer.is_cuda() ? BackendComponent::CUDABit - : BackendComponent::CPUBit)); + c10::DispatchKeySet(DispatchKey::NestedTensor) | c10::DispatchKeySet{buffer.key_set().highestBackendKey()}; // Add AutogradNestedTensor specific keys key_set = key_set | inplace_or_view_ks | autograd_nested; @@ -110,36 +126,50 @@ c10::DispatchKeySet generate_nested_key_set(at::Tensor buffer) { } NestedTensorImpl::NestedTensorImpl( - at::Tensor buffer, + Storage storage, + c10::DispatchKeySet key_set, + const caffe2::TypeMeta data_type, at::Tensor nested_size_tensor, at::Tensor nested_stride_tensor, - const std::vector& offsets) - : TensorImpl( - generate_nested_key_set(buffer), - buffer.dtype(), - buffer.device()), - buffer_(std::move(buffer)), + std::vector&& offsets) + : TensorImpl(std::move(storage), key_set, data_type), nested_size_tensor_(std::move(nested_size_tensor)), nested_stride_tensor_(std::move(nested_stride_tensor)), - offsets_(offsets), - opt_sizes_(construct_opt_sizes(nested_size_tensor_)) -{ + offsets_(std::move(offsets)), + opt_sizes_(construct_opt_sizes(nested_size_tensor_)) { TORCH_WARN_ONCE( "The PyTorch API of nested tensors is in prototype stage and will change " "in the near future."); - TORCH_INTERNAL_ASSERT(buffer_.is_cuda() || buffer_.is_cpu(), "NestedTensorImpl buffer must be either CUDA or CPU but got ", buffer_); - TORCH_INTERNAL_ASSERT(nested_size_tensor_.is_contiguous()); - int64_t size_dim = nested_size_tensor_.dim(); - TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2); - TORCH_INTERNAL_ASSERT(nested_stride_tensor_.is_contiguous()); - TORCH_INTERNAL_ASSERT(nested_stride_tensor_.dim() == size_dim); - TORCH_INTERNAL_ASSERT(nested_stride_tensor_.sizes() == nested_size_tensor_.sizes()); - TORCH_INTERNAL_ASSERT((size_dim == 0 && (int64_t)offsets_.empty()) - || (size_dim == 2 && nested_size_tensor_.size(0) == (int64_t)offsets_.size())); + auto storage_device = storage_.device(); + TORCH_INTERNAL_ASSERT( + storage_device.is_cpu() || storage_device.is_cuda(), + "NestedTensorImpl storage must be either CUDA or CPU but got ", + storage_device); + validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, offsets_); refresh_dim(); set_sizes_strides_policy(c10::TensorImpl::SizesStridesPolicy::CustomSizes); } +NestedTensorImpl::NestedTensorImpl( + at::Tensor buffer, + at::Tensor nested_size_tensor, + at::Tensor nested_stride_tensor, + std::vector&& offsets) + : NestedTensorImpl( + buffer.storage(), + generate_nested_key_set(buffer), + buffer.dtype(), + nested_size_tensor, + nested_stride_tensor, + std::move(offsets)) { + + TORCH_INTERNAL_ASSERT( + buffer.dim() == 1, + "NestedTensorImpl buffer is required to be 1 dimensional but got a buffer with ", + buffer.dim(), + " dimensions."); +} + // assume contiguous, `nested_stride_tensor` and `offsets` // can be infered from `nested_size_tensor` NestedTensorImpl::NestedTensorImpl( @@ -152,6 +182,23 @@ NestedTensorImpl::NestedTensorImpl( construct_offsets(nested_size_tensor)) {} +NestedTensorImpl::NestedTensorImpl( + c10::TensorImpl::ImplType impl_type, + const at::Tensor& base_tensor, + at::Tensor nested_size_tensor, + at::Tensor nested_stride_tensor, + std::vector&& offsets) + : TensorImpl(impl_type, Storage(base_tensor.storage()), base_tensor.key_set(), base_tensor.dtype()), + nested_size_tensor_(std::move(nested_size_tensor)), + nested_stride_tensor_(std::move(nested_stride_tensor)), + offsets_(std::move(offsets)), + opt_sizes_(construct_opt_sizes(nested_size_tensor_)) { + TORCH_INTERNAL_ASSERT(base_tensor.is_nested()); + validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, offsets_); + refresh_dim(); + set_sizes_strides_policy(c10::TensorImpl::SizesStridesPolicy::CustomSizes); +} + void NestedTensorImpl::refresh_dim() { const auto my_dim = nested_size_tensor_.dim() ? nested_size_tensor_.sizes()[1] + 1 : 1; sizes_and_strides_.resize(my_dim); @@ -187,8 +234,13 @@ int64_t NestedTensorImpl::numel_custom() const { return static_cast(num_elements); } + +c10::SymInt NestedTensorImpl::sym_numel_custom() const { + return NestedTensorImpl::numel_custom(); +} + bool NestedTensorImpl::is_contiguous_custom(MemoryFormat) const { - TORCH_CHECK(false, "is_contiguous is disabled."); + return nested_tensor_impl_is_contiguous(this); } IntArrayRef NestedTensorImpl::sizes_custom() const { TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support sizes. Please file an issue on https://github.com/pytorch/nestedtensor"); @@ -200,6 +252,9 @@ c10::SymIntArrayRef NestedTensorImpl::sym_sizes_custom() const { c10::SymIntArrayRef NestedTensorImpl::sym_sizes() const { return sym_sizes_custom(); } +c10::SymIntArrayRef NestedTensorImpl::sym_strides_custom() const { + TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support strides. Please file an issue on https://github.com/pytorch/nestedtensor"); +} IntArrayRef NestedTensorImpl::strides_custom() const { TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support strides. Please file an issue on https://github.com/pytorch/nestedtensor"); @@ -209,5 +264,51 @@ const char* NestedTensorImpl::tensorimpl_type_name() const { return "NestedTensorImpl"; } + +template +c10::intrusive_ptr NestedTensorImpl::shallow_copy_and_detach_core( + VariableVersion&& version_counter, + bool allow_tensor_metadata_change) const { + if (key_set_.has(DispatchKey::Python) && + !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) { + auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this); + if (r) { + r->set_version_counter(std::forward(version_counter)); + r->set_allow_tensor_metadata_change(allow_tensor_metadata_change); + return r; + } + // otherwise just copy the TensorImpl and not the PyObject. Since + // the interpreter is dead no one can call us out on it + } + auto impl = c10::make_intrusive( + storage_, + key_set_, + data_type_, + nested_size_tensor_, + nested_stride_tensor_, + std::vector(offsets_)); + + copy_tensor_metadata( + /*src_impl=*/this, + /*dest_impl=*/impl.get(), + /*version_counter=*/std::forward(version_counter), + /*allow_tensor_metadata_change=*/allow_tensor_metadata_change); + return impl; +} + +c10::intrusive_ptr NestedTensorImpl::shallow_copy_and_detach( + const c10::VariableVersion& version_counter, + bool allow_tensor_metadata_change) const { + return shallow_copy_and_detach_core( + version_counter, allow_tensor_metadata_change); +} + +c10::intrusive_ptr NestedTensorImpl::shallow_copy_and_detach( + c10::VariableVersion&& version_counter, + bool allow_tensor_metadata_change) const { + return shallow_copy_and_detach_core( + std::move(version_counter), allow_tensor_metadata_change); +} + } // namespace native } // namespace at diff --git a/aten/src/ATen/NestedTensorImpl.h b/aten/src/ATen/NestedTensorImpl.h index 47f6c1516b9d5..f1fb8273c2902 100644 --- a/aten/src/ATen/NestedTensorImpl.h +++ b/aten/src/ATen/NestedTensorImpl.h @@ -1,8 +1,11 @@ #pragma once #include #include +#include +#include #include #include +#include #include #include #include @@ -11,15 +14,31 @@ namespace at { namespace native { struct TORCH_API NestedTensorImpl : public c10::TensorImpl { + explicit NestedTensorImpl( + Storage storage, + c10::DispatchKeySet key_set, + const caffe2::TypeMeta data_type, + at::Tensor nested_size_tensor, + at::Tensor nested_stride_tensor, + std::vector&& offsets); + explicit NestedTensorImpl( at::Tensor buffer, at::Tensor nested_size_tensor, at::Tensor nested_stride_tensor, - const std::vector& offsets); + std::vector&& offsets); // assume contiguous, `nested_stride_tensor` and `offsets` // can be infered from `nested_size_tensor` explicit NestedTensorImpl(at::Tensor buffer, at::Tensor nested_size_tensor); + // This constructor is used creating view tensors from nested tensors + explicit NestedTensorImpl( + c10::TensorImpl::ImplType impl_type, + const at::Tensor& base_tensor, + at::Tensor nested_size_tensor, + at::Tensor nested_stride_tensor, + std::vector&& offsets); + // TODO: don't expose private implementation details like this; in // particular, resizing this tensor will mess up our dim() and // callers cannot fix it. @@ -53,9 +72,25 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl { " is irregular and does not have a size."); return *optional_size; } + /** + * Return a view of the nested tensor as a 1 dimensional contiguous tensor. + * + * The buffer tensor created by this function shares the same storage_impl as + * the original nested tensor, and therefore can be seen as a view. + * + * @return A newly constructed view tensor + */ + at::Tensor get_buffer() const { + auto buffer_key_set_ = generate_buffer_key_set(); + const auto buffer_size = get_buffer_size(); + auto buffer_tensor_impl = c10::make_intrusive( + c10::TensorImpl::VIEW, Storage(storage_), buffer_key_set_, data_type_); + buffer_tensor_impl->set_sizes_contiguous(c10::makeArrayRef(buffer_size)); + return Tensor(buffer_tensor_impl); + } - const at::Tensor& get_buffer() const { - return buffer_; + int64_t get_buffer_size() const { + return storage_.nbytes() / data_type_.itemsize(); } protected: @@ -64,6 +99,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl { // TODO: numel_custom and is_contiguous_custom can be profitably overridden // with real implementations int64_t numel_custom() const override; + c10::SymInt sym_numel_custom() const override; bool is_contiguous_custom(MemoryFormat) const override; int64_t size_custom(int64_t d) const override { return this->size(d); @@ -75,16 +111,32 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl { c10::SymIntArrayRef sym_sizes_custom() const override; c10::SymIntArrayRef sym_sizes() const override; IntArrayRef strides_custom() const override; + c10::SymIntArrayRef sym_strides_custom() const override; // this one is real int64_t dim_custom() const override; + c10::intrusive_ptr shallow_copy_and_detach( + const c10::VariableVersion& version_counter, + bool allow_tensor_metadata_change) const override; + + c10::intrusive_ptr shallow_copy_and_detach( + c10::VariableVersion&& version_counter, + bool allow_tensor_metadata_change) const override; + + void shallow_copy_from(const c10::intrusive_ptr& impl) override { + copy_tensor_metadata( + /*src_impl=*/impl.get(), + /*dest_impl=*/this, + /*version_counter=*/version_counter(), + /*allow_tensor_metadata_change=*/allow_tensor_metadata_change()); + } + private: // Must be called after any changes to our dim() to sync the state // to TensorImpl. void refresh_dim(); - at::Tensor buffer_; const at::Tensor nested_size_tensor_, nested_stride_tensor_; // The starting positions of the underlying tensors in contiguous buffer // i.e. the buffer memory offsets to get the underlying tensors @@ -103,6 +155,38 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl { // TODO: maybe we can remove this metadata since // we can compute it from `nested_size_tensor_` std::vector opt_sizes_; + + template + c10::intrusive_ptr shallow_copy_and_detach_core( + VariableVersion&& version_counter, + bool allow_tensor_metadata_change) const; + + /** + * Generates a non-nested key_set from a nested tensor. + * + * For many nested tensor kernel implementations a buffer tensor + * is generated and redispatched to a non-nested kernel this function + * generates the key set used by that buffer tensor + * + * @return A newly constructed view tensor + */ + inline c10::DispatchKeySet generate_buffer_key_set() const { + auto buffer_key_set = this->key_set(); + const bool Autograd = buffer_key_set.has_any(c10::autograd_dispatch_keyset); + // Remove nested tensor specific keys + buffer_key_set = buffer_key_set - + c10::DispatchKeySet{ + c10::DispatchKey::NestedTensor, + c10::DispatchKey::AutogradNestedTensor}; + + // Add dense tensor specific keys + buffer_key_set = + buffer_key_set | c10::DispatchKeySet{c10::DispatchKey::Dense}; + buffer_key_set = Autograd + ? c10::DispatchKeySet{c10::DispatchKey::Autograd} | buffer_key_set + : buffer_key_set; + return buffer_key_set; + } }; inline NestedTensorImpl* get_nested_tensor_impl_or_null( diff --git a/aten/src/ATen/Parallel.h b/aten/src/ATen/Parallel.h index 6c99fcd422cb6..4693997624e98 100644 --- a/aten/src/ATen/Parallel.h +++ b/aten/src/ATen/Parallel.h @@ -2,6 +2,7 @@ #include #include #include +#include namespace at { diff --git a/aten/src/ATen/SparseCsrTensorImpl.cpp b/aten/src/ATen/SparseCsrTensorImpl.cpp index dab45065fa71e..69fc013211f96 100644 --- a/aten/src/ATen/SparseCsrTensorImpl.cpp +++ b/aten/src/ATen/SparseCsrTensorImpl.cpp @@ -160,6 +160,9 @@ void SparseCsrTensorImpl::set_member_tensors( IntArrayRef SparseCsrTensorImpl::strides_custom() const { TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides"); } +SymIntArrayRef SparseCsrTensorImpl::sym_strides_custom() const { + TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides"); +} void SparseCsrTensorImpl::set_size(int64_t dim, int64_t new_size) { TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_size."); } diff --git a/aten/src/ATen/SparseCsrTensorImpl.h b/aten/src/ATen/SparseCsrTensorImpl.h index 878c465962b86..1f84fb422fde9 100644 --- a/aten/src/ATen/SparseCsrTensorImpl.h +++ b/aten/src/ATen/SparseCsrTensorImpl.h @@ -76,6 +76,7 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl { protected: IntArrayRef strides_custom() const override; + SymIntArrayRef sym_strides_custom() const override; public: void set_size(int64_t dim, int64_t new_size) override; diff --git a/aten/src/ATen/TensorIterator.h b/aten/src/ATen/TensorIterator.h index fdf86cbba6afe..59f52d9dbd2ed 100644 --- a/aten/src/ATen/TensorIterator.h +++ b/aten/src/ATen/TensorIterator.h @@ -473,6 +473,10 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase { } bool has_contiguous_first_dim() const { + if (ndim() == 0) { + return true; + } + int num_tensors = ntensors(); for (const auto i : c10::irange(num_tensors)) { if (strides(i)[0] != element_size(i)) { diff --git a/aten/src/ATen/TensorMeta.h b/aten/src/ATen/TensorMeta.h index 97124611ca13d..07631c3552fd2 100644 --- a/aten/src/ATen/TensorMeta.h +++ b/aten/src/ATen/TensorMeta.h @@ -71,6 +71,7 @@ namespace impl { struct TORCH_API MetaBase { virtual const Tensor& maybe_get_output(int64_t output_idx) = 0; + // Note: [set_output_*] // See: https://github.com/pytorch/pytorch/issues/69813 // Whenever defining the output properties in the META function of a // structured kernel (what was usually done with `set_output`), use one of diff --git a/aten/src/ATen/TensorSubclassLikeUtils.h b/aten/src/ATen/TensorSubclassLikeUtils.h index 5c01ce9790407..b533b49a9ca4d 100644 --- a/aten/src/ATen/TensorSubclassLikeUtils.h +++ b/aten/src/ATen/TensorSubclassLikeUtils.h @@ -1,5 +1,6 @@ #pragma once #include +#include namespace at { @@ -39,16 +40,22 @@ constexpr auto kTensorSubclassLike = DispatchKeySet(BackendComponent::MetaBit); inline bool isTensorSubclassLike(const Tensor& tensor) { + if (c10::impl::dispatch_mode_enabled()) + return true; auto key_set = tensor.unsafeGetTensorImpl()->key_set(); return !(key_set & kTensorSubclassLike).empty(); } inline bool areAnyTensorSubclassLike(TensorList tensors) { + if (c10::impl::dispatch_mode_enabled()) + return true; return std::any_of(tensors.begin(), tensors.end(), isTensorSubclassLike); } inline bool areAnyOptionalTensorSubclassLike( const c10::List>& tensors) { + if (c10::impl::dispatch_mode_enabled()) + return true; return std::any_of( tensors.begin(), tensors.end(), [](const optional& opt_tensor) { return ( @@ -56,4 +63,16 @@ inline bool areAnyOptionalTensorSubclassLike( }); } +// Helper function to deal testing truthfulness of a scalar tensor +// in a Composite Compliant manner. +// NOTE: This function expects a scalar tensor of boolean dtype. +// Eg. +// Non-Composite Compliant Pattern : (t == 0).all().item() +// Composite Compliant Patter : is_salar_tensor_true((t == 0).all()) +inline bool is_scalar_tensor_true(const Tensor& t) { + TORCH_INTERNAL_ASSERT(t.dim() == 0) + TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool) + return at::equal(t, t.new_ones({}, t.options())); +} + } // namespace at diff --git a/aten/src/ATen/ThreadLocalState.cpp b/aten/src/ATen/ThreadLocalState.cpp index 8315ddad97b20..fb589beaba894 100644 --- a/aten/src/ATen/ThreadLocalState.cpp +++ b/aten/src/ATen/ThreadLocalState.cpp @@ -19,7 +19,7 @@ ThreadLocalState::ThreadLocalState() saved_tensors_default_hooks_ = at::SavedTensorDefaultHooks::get_stack(); - torch_dispatch_mode_state_ = at::impl::TorchDispatchModeTLS::get_state(); + torch_dispatch_mode_state_ = c10::impl::TorchDispatchModeTLS::get_state(); } void ThreadLocalState::set_grad_mode(bool enabled) { @@ -33,7 +33,7 @@ void ThreadLocalState::setThreadLocalState( // restore the dispatch key set TLS at the same time. c10::AutogradState::set_tls_state(state.autograd_tls_); - at::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_); + c10::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_); at::impl::PythonTorchFunctionTLS::set_state(state.python_torch_function_state_); diff --git a/aten/src/ATen/ThreadLocalState.h b/aten/src/ATen/ThreadLocalState.h index a21ee6a674f3c..a0067fb8aaebe 100644 --- a/aten/src/ATen/ThreadLocalState.h +++ b/aten/src/ATen/ThreadLocalState.h @@ -9,8 +9,8 @@ #include #include -#include #include +#include namespace at { diff --git a/aten/src/ATen/Utils.h b/aten/src/ATen/Utils.h index bbc235182f1e2..61c9c58fa437a 100644 --- a/aten/src/ATen/Utils.h +++ b/aten/src/ATen/Utils.h @@ -26,59 +26,6 @@ namespace at { TORCH_API int _crash_if_asan(int); -// TODO: This unwrapping code is ONLY used for TH bindings; once TH goes -// away, we can delete this function -static inline TensorImpl* checked_dense_tensor_unwrap( - const Tensor& expr, - const char* name, - int pos, - const char* api, - bool allowNull, - DeviceType device_type, - ScalarType scalar_type) { - if (allowNull && !expr.defined()) { - return nullptr; - } - if (expr.layout() != Layout::Strided) { - AT_ERROR( - "Expected dense tensor but got ", - expr.layout(), - " for argument #", - pos, - " '", - name, - "' in call to ", - api); - } - if (expr.device().type() != device_type) { - AT_ERROR( - "Expected object of device type ", - device_type, - " but got device type ", - expr.device().type(), - " for argument #", - pos, - " '", - name, - "' in call to ", - api); - } - if (expr.scalar_type() != scalar_type) { - AT_ERROR( - "Expected object of scalar type ", - scalar_type, - " but got scalar type ", - expr.scalar_type(), - " for argument #", - pos, - " '", - name, - "' in call to ", - api); - } - return expr.unsafeGetTensorImpl(); -} - // Converts a TensorList (i.e. ArrayRef to vector of TensorImpl*) // NB: This is ONLY used by legacy TH bindings, and ONLY used by cat. // Once cat is ported entirely to ATen this can be deleted! diff --git a/aten/src/ATen/autocast_mode.cpp b/aten/src/ATen/autocast_mode.cpp index da0a87b02d1d0..396b9746754cf 100644 --- a/aten/src/ATen/autocast_mode.cpp +++ b/aten/src/ATen/autocast_mode.cpp @@ -499,7 +499,6 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) { KERNEL_CPU(ADD_NS(addbmm), "addbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp) KERNEL_CPU(ADD_NS(linear), "linear", Tensor (const Tensor &, const Tensor &, const c10::optional &), lower_precision_fp) KERNEL_CPU(ADD_NS(_convolution), "_convolution.deprecated", Tensor (const Tensor &, const Tensor &, const c10::optional&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp) - KERNEL_CPU(ADD_NS(_convolution), "_convolution", Tensor (const Tensor &, const Tensor &, const c10::optional&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool, bool), lower_precision_fp) KERNEL_CPU(ADD_NS(matmul), "matmul", Tensor (const Tensor &, const Tensor &), lower_precision_fp) KERNEL_CPU(ADD_NS(conv_tbc), "conv_tbc", Tensor(const Tensor &, const Tensor &, const Tensor &, int64_t), lower_precision_fp) @@ -545,10 +544,24 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) { KERNEL_CPU(ADD_NS(replication_pad2d), "replication_pad2d", Tensor(const Tensor &, IntArrayRef), fp32) KERNEL_CPU(ADD_NS(replication_pad3d), "replication_pad3d", Tensor(const Tensor &, IntArrayRef), fp32) KERNEL_CPU(ADD_NS(mse_loss), "mse_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32) + KERNEL_CPU(ADD_NS(cosine_embedding_loss), "cosine_embedding_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32) + KERNEL_CPU(ADD_NS(nll_loss), "nll_loss", Tensor (const Tensor &, const Tensor &, const c10::optional&, int64_t, int64_t), fp32) + KERNEL_CPU(ADD_NS(nll_loss2d), "nll_loss2d", Tensor (const Tensor &, const Tensor &, const c10::optional&, int64_t, int64_t), fp32) + KERNEL_CPU(ADD_NS(hinge_embedding_loss), "hinge_embedding_loss", Tensor (const Tensor &, const Tensor &, double, int64_t), fp32) + KERNEL_CPU(ADD_NS(poisson_nll_loss), "poisson_nll_loss", Tensor (const Tensor &, const Tensor &, bool, bool, double, int64_t), fp32) + KERNEL_CPU(ADD_NS(smooth_l1_loss), "smooth_l1_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32) + KERNEL_CPU(ADD_NS(cross_entropy_loss), "cross_entropy_loss", Tensor(const Tensor &, const Tensor &, const c10::optional &, int64_t, int64_t, double), fp32) + KERNEL_CPU(ADD_NS(l1_loss), "l1_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32) + KERNEL_CPU(ADD_NS(huber_loss), "huber_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32) + KERNEL_CPU(ADD_NS(margin_ranking_loss), "margin_ranking_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32) + KERNEL_CPU(ADD_NS(soft_margin_loss), "soft_margin_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32) + KERNEL_CPU(ADD_NS(triplet_margin_loss), "triplet_margin_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, double, double, bool, int64_t), fp32) + KERNEL_CPU(ADD_NS(multi_margin_loss), "multi_margin_loss", Tensor (const Tensor &, const Tensor &, const Scalar&, const Scalar&, const c10::optional&, int64_t), fp32) KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.IntList", Tensor(const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t, int64_t, bool), fp32) KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.Tensor", Tensor(const Tensor &, const Tensor &, const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32) KERNEL_CPU(ADD_NS(kl_div), "kl_div", Tensor(const Tensor &, const Tensor &, int64_t, bool), fp32) KERNEL_CPU(ADD_NS(multilabel_margin_loss), "multilabel_margin_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32) + KERNEL_CPU(ADD_NS(binary_cross_entropy_with_logits), "binary_cross_entropy_with_logits", Tensor (const Tensor &, const Tensor &, const c10::optional&, const c10::optional&, int64_t), fp32) KERNEL_CPU(ADD_NS(fft_fft), "fft_fft", Tensor(const Tensor &, c10::optional, int64_t, c10::optional), fp32) KERNEL_CPU(ADD_NS(fft_ifft), "fft_ifft", Tensor(const Tensor &, c10::optional, int64_t, c10::optional), fp32) KERNEL_CPU(ADD_NS(fft_fft2), "fft_fft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional), fp32) diff --git a/aten/src/ATen/core/List.h b/aten/src/ATen/core/List.h index 0785a6941affd..fe75bf37cb7fa 100644 --- a/aten/src/ATen/core/List.h +++ b/aten/src/ATen/core/List.h @@ -69,7 +69,11 @@ struct ListElementConstReferenceTraits> { template class ListElementReference final { public: - operator T() const; + operator std::conditional_t< + std::is_reference::type>::value, + const T&, + T>() const; ListElementReference& operator=(T&& new_value) &&; @@ -106,9 +110,14 @@ class ListElementReference final { // this wraps vector::iterator to make sure user code can't rely // on it being the type of the underlying vector. -template -class ListIterator final : public std::iterator { -public: +template +class ListIterator final : public std::iterator< + std::random_access_iterator_tag, + T, + std::ptrdiff_t, + T*, + ListElementReference> { + public: explicit ListIterator() = default; ~ListIterator() = default; diff --git a/aten/src/ATen/core/List_inl.h b/aten/src/ATen/core/List_inl.h index 14b68b7360623..e8acb89bf3cb5 100644 --- a/aten/src/ATen/core/List_inl.h +++ b/aten/src/ATen/core/List_inl.h @@ -118,9 +118,13 @@ namespace detail { namespace impl { -template -ListElementReference::operator T() const { - return c10::detail::list_element_to(*iterator_); +template +ListElementReference::operator std::conditional_t< + std::is_reference::type>::value, + const T&, + T>() const { + return iterator_->template to(); } template diff --git a/aten/src/ATen/core/NamedRegistrations.cpp b/aten/src/ATen/core/NamedRegistrations.cpp index a9ae2f12c4dd0..bb675939b27c6 100644 --- a/aten/src/ATen/core/NamedRegistrations.cpp +++ b/aten/src/ATen/core/NamedRegistrations.cpp @@ -467,7 +467,6 @@ TORCH_LIBRARY_IMPL(aten, Named, m) { m.impl("sum.IntList_out", CppFunction::makeFallthrough()); m.impl("sum.dim_DimnameList", CppFunction::makeFallthrough()); m.impl("sum.dim_IntList", CppFunction::makeFallthrough()); - m.impl("sum.SymInt", CppFunction::makeFallthrough()); m.impl("t", CppFunction::makeFallthrough()); m.impl("tan", CppFunction::makeFallthrough()); m.impl("tan.out", CppFunction::makeFallthrough()); diff --git a/aten/src/ATen/core/PhiloxRNGEngine.h b/aten/src/ATen/core/PhiloxRNGEngine.h index d075d7dd6fbff..a702de8998d93 100644 --- a/aten/src/ATen/core/PhiloxRNGEngine.h +++ b/aten/src/ATen/core/PhiloxRNGEngine.h @@ -71,66 +71,67 @@ class philox_engine { C10_HOST_DEVICE inline explicit philox_engine(uint64_t seed = 67280421310721, uint64_t subsequence = 0, uint64_t offset = 0) { - key[0] = static_cast(seed); - key[1] = static_cast(seed >> 32); - counter = detail::UINT4(0); - counter[2] = static_cast(subsequence); - counter[3] = static_cast(subsequence >> 32); - STATE = 0; + + reset_state(seed, subsequence); incr_n(offset); } + C10_HOST_DEVICE inline void reset_state(uint64_t seed = 67280421310721, + uint64_t subsequence = 0) { + key_[0] = static_cast(seed); + key_[1] = static_cast(seed >> 32); + counter_ = detail::UINT4(0); + counter_[2] = static_cast(subsequence); + counter_[3] = static_cast(subsequence >> 32); + STATE = 0; + } + /** - * Produces a unique 32-bit pseudo random number on every invocation + * Produces a unique 32-bit pseudo random number on every invocation. Bookeeps state to avoid waste. */ - C10_HOST_DEVICE inline uint32_t operator()() { + C10_HOST_DEVICE inline uint32_t operator()(int32_t n_rounds = 10) { // 10 here to preserve back-compat behavior if(STATE == 0) { - detail::UINT4 counter_ = counter; - detail::UINT2 key_ = key; - - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - counter_ = single_round(counter_, key_); - key_[0] += (kPhilox10A); key_[1] += (kPhilox10B); - - output = single_round(counter_, key_); + detail::UINT4 counter = counter_; + detail::UINT2 key = key_; + output_ = rand(counter, key, n_rounds); incr(); } - uint32_t ret = output[STATE]; + uint32_t ret = output_[STATE]; STATE = (STATE + 1) & 3; return ret; } + inline float randn(uint32_t n_rounds) { + #ifdef __CUDA_ARCH__ + AT_ASSERT(false, "Unsupported invocation of randn on CUDA"); + #endif + reset_state(); // Reset state for randn - a little wasteful, but easier to ensure correctness. + detail::UINT4 counter = counter_; + detail::UINT2 key = key_; + detail::UINT4 i = rand(counter, key, n_rounds); + detail::FLOAT2 prenorm; + prenorm[0] = 1 - uint32_to_uniform_float(i[0]); // uint32_to_uniform_float returns [0,1), we need (0,1] to avoid passing 0 to log. + prenorm[1] = 1 - uint32_to_uniform_float(i[1]); + detail::FLOAT2 ret = normalize_pair_uniform(prenorm); + return ret[0]; + } + /** * Function that Skips N 128 bit numbers in a subsequence */ C10_HOST_DEVICE inline void incr_n(uint64_t n) { uint32_t nlo = static_cast(n); uint32_t nhi = static_cast(n >> 32); - counter[0] += nlo; + counter_[0] += nlo; // if overflow in x has occurred, carry over to nhi - if (counter[0] < nlo) { + if (counter_[0] < nlo) { nhi++; // if overflow in nhi has occurred during carry over, // propagate that overflow to y and exit to increment z // otherwise return - counter[1] += nhi; + counter_[1] += nhi; if(nhi != 0) { - if (nhi <= counter[1]) { + if (nhi <= counter_[1]) { return; } } @@ -138,34 +139,34 @@ class philox_engine { // if overflow in y has occurred during addition, // exit to increment z // otherwise return - counter[1] += nhi; - if (nhi <= counter[1]) { + counter_[1] += nhi; + if (nhi <= counter_[1]) { return; } } - if (++counter[2]) + if (++counter_[2]) return; - ++counter[3]; + ++counter_[3]; } /** * Function that Skips one 128 bit number in a subsequence */ C10_HOST_DEVICE inline void incr() { - if (++counter[0]) + if (++counter_[0]) return; - if (++counter[1]) + if (++counter_[1]) return; - if (++counter[2]) { + if (++counter_[2]) { return; } - ++counter[3]; + ++counter_[3]; } private: - detail::UINT4 counter; - detail::UINT4 output; - detail::UINT2 key; + detail::UINT4 counter_; + detail::UINT4 output_; + detail::UINT2 key_; uint32_t STATE; C10_HOST_DEVICE inline uint32_t mulhilo32(uint32_t a, uint32_t b, @@ -192,12 +193,46 @@ class philox_engine { ret[3] = lo0; return ret; } + + C10_HOST_DEVICE constexpr float uint32_to_uniform_float(uint32_t value) { + // maximum value such that `MAX_INT * scale < 1.0` (with float rounding) + constexpr float scale = 4.6566127342e-10; + return static_cast(value & 0x7FFFFFFF) * scale; + } + + + + C10_HOST_DEVICE inline detail::UINT4 rand(detail::UINT4& counter, detail::UINT2& key, uint32_t n_rounds) { + for (uint32_t round = 0; round < (n_rounds - 1); round++) { + counter = single_round(counter, key); + key[0] += (kPhilox10A); key[1] += (kPhilox10B); + } + return single_round(counter, key); + } + + inline detail::FLOAT2 normalize_pair_uniform(detail::FLOAT2 in) { + // TODO(voz) We use std:: below, and thus need a separate impl for CUDA. + float u1 = in[0]; + float u2 = in[1]; + + constexpr float two_pi = 2.0 * M_PI; + + float mag = std::sqrt(-2.0 * std::log(u1)); + + detail::FLOAT2 ret; + + ret[0] = mag * std::cos(two_pi); + ret[1] = mag * std::sin(two_pi); + return ret; + } + + static const uint32_t kPhilox10A = 0x9E3779B9; static const uint32_t kPhilox10B = 0xBB67AE85; static const uint32_t kPhiloxSA = 0xD2511F53; static const uint32_t kPhiloxSB = 0xCD9E8D57; }; -typedef philox_engine Philox4_32_10; +typedef philox_engine Philox4_32; } // namespace at diff --git a/aten/src/ATen/core/PythonFallbackKernel.cpp b/aten/src/ATen/core/PythonFallbackKernel.cpp index 37b46ae15a3c0..210aa6fa568fe 100644 --- a/aten/src/ATen/core/PythonFallbackKernel.cpp +++ b/aten/src/ATen/core/PythonFallbackKernel.cpp @@ -1,4 +1,4 @@ -#include +#include #include #include @@ -51,7 +51,7 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // If Torch Dispatch Mode is active, use its PyInterpreter for dispatch - const auto& maybe_torch_dispatch_mode_state = at::impl::TorchDispatchModeTLS::get_state(); + const auto& maybe_torch_dispatch_mode_state = c10::impl::TorchDispatchModeTLS::get_state(); if (maybe_torch_dispatch_mode_state) { maybe_torch_dispatch_mode_state->pyinterpreter()->dispatch(op, stack); return; diff --git a/aten/src/ATen/core/PythonFallbackKernel.h b/aten/src/ATen/core/PythonFallbackKernel.h index 94cd4e81291a3..f38bdd2ada90a 100644 --- a/aten/src/ATen/core/PythonFallbackKernel.h +++ b/aten/src/ATen/core/PythonFallbackKernel.h @@ -1,5 +1,5 @@ #pragma once - +#include namespace at { namespace impl { diff --git a/aten/src/ATen/core/TensorBase.h b/aten/src/ATen/core/TensorBase.h index ca9c8b5f245a0..e6dd73658efc7 100644 --- a/aten/src/ATen/core/TensorBase.h +++ b/aten/src/ATen/core/TensorBase.h @@ -48,6 +48,7 @@ inline bool variable_excluded_from_dispatch() { return c10::impl::tls_local_dispatch_key_set().excluded_.isSupersetOf(c10::autograd_dispatch_keyset); #endif } + } // NOTE: [Tensor vs. TensorBase] @@ -161,6 +162,14 @@ class TORCH_API TensorBase { return impl_->sym_size(dim); } + c10::SymInt sym_stride(int64_t dim) const { + const auto sizes = this->sym_strides(); + const auto ndim = static_cast(sizes.size()); + // false is passed to maybe_wrap_dim so behavior is identical to array access (but with wrapping) + return sizes[c10::maybe_wrap_dim(dim, ndim, /*wrap_scalar=*/false)]; + + } + int64_t size(int64_t dim) const { return impl_->size(dim); } @@ -225,6 +234,9 @@ class TORCH_API TensorBase { c10::SymIntArrayRef sym_sizes() const { return impl_->sym_sizes(); } + c10::SymIntArrayRef sym_strides() const { + return impl_->sym_strides(); + } IntArrayRef strides() const { return impl_->strides(); } @@ -286,6 +298,10 @@ class TORCH_API TensorBase { return impl_->numel(); } + c10::SymInt sym_numel() const { + return impl_->sym_numel(); + } + // Length of one array element in bytes. This is the traditional // Numpy naming. size_t itemsize() const { diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.cpp b/aten/src/ATen/core/TorchDispatchModeTLS.cpp deleted file mode 100644 index d224b08d5b54b..0000000000000 --- a/aten/src/ATen/core/TorchDispatchModeTLS.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include -#include -#include - -namespace at { namespace impl { - -thread_local std::shared_ptr torchDispatchModeState; - -void TorchDispatchModeTLS::set_state(std::shared_ptr state) { - if (state) { - c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true); - c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, true); - } else { - TorchDispatchModeTLS::reset_state(); - } - torchDispatchModeState = std::move(state); -} - -const std::shared_ptr& TorchDispatchModeTLS::get_state() { - return torchDispatchModeState; -} - -void TorchDispatchModeTLS::reset_state() { - torchDispatchModeState.reset(); - c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false); - c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, false); -} - -bool dispatch_mode_enabled() { - return static_cast(at::impl::TorchDispatchModeTLS::get_state()); -} - -bool tensor_has_dispatch(const at::Tensor& t) { - DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot}); - return t.key_set().has_any(key_set); -} - -bool tensorlist_has_dispatch(const at::TensorList& li) { - for (const auto& t: li) { - if (tensor_has_dispatch(t)) { - return true; - } - } - return false; -} - -bool tensorlist_has_dispatch(const c10::List>& li) { - for (auto i : c10::irange(li.size())) { - auto t = li.get(i); - if (t && tensor_has_dispatch(*t)) { - return true; - } - } - return false; -} - -} // namespace impl -} // namespace at diff --git a/aten/src/ATen/core/TorchDispatchUtils.cpp b/aten/src/ATen/core/TorchDispatchUtils.cpp new file mode 100644 index 0000000000000..323019b3bbbb3 --- /dev/null +++ b/aten/src/ATen/core/TorchDispatchUtils.cpp @@ -0,0 +1,31 @@ +#include + +namespace at { +namespace impl { + +bool tensor_has_dispatch(const at::Tensor& t) { + DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot}); + return t.key_set().has_any(key_set); +} + +bool tensorlist_has_dispatch(const at::TensorList& li) { + for (const auto& t: li) { + if (tensor_has_dispatch(t)) { + return true; + } + } + return false; +} + +bool tensorlist_has_dispatch(const c10::List>& li) { + for (auto i : c10::irange(li.size())) { + auto t = li.get(i); + if (t && tensor_has_dispatch(*t)) { + return true; + } + } + return false; +} + +} // namespace impl +} // namespace at diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.h b/aten/src/ATen/core/TorchDispatchUtils.h similarity index 55% rename from aten/src/ATen/core/TorchDispatchModeTLS.h rename to aten/src/ATen/core/TorchDispatchUtils.h index 9ae015e6582f7..08c009c81b478 100644 --- a/aten/src/ATen/core/TorchDispatchModeTLS.h +++ b/aten/src/ATen/core/TorchDispatchUtils.h @@ -1,25 +1,17 @@ #pragma once -#include #include #include #include #include +#include namespace at { namespace impl { -struct TORCH_API TorchDispatchModeTLS { - static void set_state(std::shared_ptr state); - static const std::shared_ptr& get_state(); - static void reset_state(); -}; - -bool dispatch_mode_enabled(); bool tensor_has_dispatch(const at::Tensor& t); bool tensorlist_has_dispatch(const at::TensorList& li); bool tensorlist_has_dispatch(const c10::List>& li); +using c10::impl::dispatch_mode_enabled; - -} // namespace impl -} // namespace at +}} diff --git a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h index d5345b28e7149..76082c5b01a4b 100644 --- a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h +++ b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h @@ -114,6 +114,9 @@ namespace detail { * they have been registered as fallthrough. The set of excluded backends * varies from operator, as some operators may have overridden the * fallthrough with custom behavior. + * + * Note - this should maintain identical impl to the py dispatcher key extraction logic + * at pytorch/torch/dispatcher.py */ struct TORCH_API DispatchKeyExtractor final { public: @@ -142,7 +145,7 @@ struct TORCH_API DispatchKeyExtractor final { // no safe toTensorRef method, alas) ks = ks | ivalue.unsafeToTensorImpl()->key_set(); } else if (C10_UNLIKELY(ivalue.isTensorList())) { - for (const at::Tensor tensor : ivalue.toTensorList()) { + for (const at::Tensor& tensor : ivalue.toTensorList()) { ks = ks | tensor.key_set(); } } diff --git a/aten/src/ATen/core/dispatch/Dispatcher.h b/aten/src/ATen/core/dispatch/Dispatcher.h index 83d1738da423b..bc40bc5b62e0c 100644 --- a/aten/src/ATen/core/dispatch/Dispatcher.h +++ b/aten/src/ATen/core/dispatch/Dispatcher.h @@ -162,6 +162,7 @@ class TORCH_API Dispatcher final { // Invoke an operator via the boxed calling convention using an IValue stack void callBoxed(const OperatorHandle& op, Stack* stack) const; + void callBoxedForDispatchKey(const OperatorHandle& op, DispatchKey dk, Stack* stack) const; // TODO: This will only be useful if we write a backend fallback that plumbs dispatch keys (currently there are none) // See Note [Plumbing Keys Through The Dispatcher] @@ -332,6 +333,9 @@ class TORCH_API OperatorHandle { return operatorDef_->op.hasKernelForDispatchKey(k); } + bool hasComputedKernelForDispatchKey(DispatchKey k) const { + return operatorDef_->op.hasComputedKernelForDispatchKey(k); + } std::string dumpComputedTable() const { return operatorDef_->op.dumpComputedTable(); @@ -376,6 +380,10 @@ class TORCH_API OperatorHandle { callBoxed(&stack); } + void callBoxedForDispatchKey(DispatchKey dk, Stack& stack) const { + c10::Dispatcher::singleton().callBoxedForDispatchKey(*this, dk, &stack); + } + void redispatchBoxed(DispatchKeySet ks, Stack* stack) const { c10::Dispatcher::singleton().redispatchBoxed(*this, ks, stack); } @@ -620,6 +628,18 @@ inline void Dispatcher::callBoxed(const OperatorHandle& op, Stack* stack) const kernel.callBoxed(op, dispatchKeySet, stack); } +// NB: this doesn't count as a "true" dispatcher jump, so no instrumentation +inline void Dispatcher::callBoxedForDispatchKey(const OperatorHandle& op, DispatchKey dk, Stack* stack) const { + // note: this doesn't need the mutex because write operations on the list keep iterators intact. + const auto& entry = op.operatorDef_->op; + // We still compute this as we're obligated to pass it on to the internal + // kernel, if it is a boxed fallback + auto dispatchKeySet = entry.dispatchKeyExtractor().getDispatchKeySetBoxed(stack); + const auto& kernel = entry.kernelForDispatchKey(dk); + kernel.callBoxed(op, dispatchKeySet, stack); +} + + inline void Dispatcher::redispatchBoxed(const OperatorHandle& op, DispatchKeySet dispatchKeySet, Stack* stack) const { // note: this doesn't need the mutex because write operations on the list keep iterators intact. const auto& entry = op.operatorDef_->op; diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.cpp b/aten/src/ATen/core/dispatch/OperatorEntry.cpp index afcf552fdecda..5c1c42bb62260 100644 --- a/aten/src/ATen/core/dispatch/OperatorEntry.cpp +++ b/aten/src/ATen/core/dispatch/OperatorEntry.cpp @@ -198,10 +198,24 @@ bool OperatorEntry::hasKernelForAnyDispatchKey(DispatchKeySet ks) const { bool OperatorEntry::hasKernelForDispatchKey(DispatchKey k) const { TORCH_INTERNAL_ASSERT(kernels_.find(DispatchKey::Undefined) == kernels_.end()); - for (auto& kv : kernels_) { - if (k == kv.first) return true; - } - return false; + auto it = kernels_.find(k); + if (it == kernels_.end()) return false; + return it->second.size() > 0; +} + +const KernelFunction& OperatorEntry::kernelForDispatchKey(DispatchKey k) const { + auto it = kernels_.find(k); + TORCH_CHECK(it != kernels_.end() && it->second.size(), "no kernel for ", k, " on ", name_); + auto jt = it->second.begin(); + TORCH_INTERNAL_ASSERT(jt->kernel.isValid()) + return jt->kernel; +} + +bool OperatorEntry::hasComputedKernelForDispatchKey(DispatchKey k) const { + TORCH_CHECK(!isAliasDispatchKey(k), "Alias keys do not have runtime kernel registrations."); + const auto dispatch_ix = getDispatchTableIndexForDispatchKey(k); + TORCH_INTERNAL_ASSERT(dispatch_ix >= 0 && dispatch_ix < c10::num_runtime_entries, toString(k), dispatch_ix); + return dispatchTable_[dispatch_ix].isValid(); } const AnnotatedKernel* OperatorEntry::getKernelForDispatchKey(DispatchKey dispatch_key) const{ diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.h b/aten/src/ATen/core/dispatch/OperatorEntry.h index 834f7b32f3947..1d9f1495f3c74 100644 --- a/aten/src/ATen/core/dispatch/OperatorEntry.h +++ b/aten/src/ATen/core/dispatch/OperatorEntry.h @@ -206,6 +206,12 @@ class TORCH_API OperatorEntry final { bool hasKernelForAnyDispatchKey(DispatchKeySet ks) const; // Returns true if kernel_ has entry for a particular key. bool hasKernelForDispatchKey(DispatchKey k) const; + // Retrieves the kernel entry at a particular key. Symmetric with + // hasKernelForDispatchKey. To get the AnnotatedKernel, see + // getKernelForDispatchKey (private) + const KernelFunction& kernelForDispatchKey(DispatchKey k) const; + // Returns true if the "computed table" has an entry for a particular key. + bool hasComputedKernelForDispatchKey(DispatchKey k) const; // Returns all the operator tags added at the time of registration const std::vector& getTags() const; diff --git a/aten/src/ATen/core/function_schema.h b/aten/src/ATen/core/function_schema.h index 77fdb20f6516a..16083820a1d81 100644 --- a/aten/src/ATen/core/function_schema.h +++ b/aten/src/ATen/core/function_schema.h @@ -550,15 +550,24 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) { bool is_opt = type->kind() == OptionalType::Kind; auto unopt_type = is_opt ? type->castRaw()->getElementType() : type; - if (unopt_type->kind() == ListType::Kind && arg.N()) { + if (unopt_type->kind() == ListType::Kind) { // sized lists get size N from arg, not type auto list = unopt_type->cast(); - out << list->getElementType()->str() << "[" << *arg.N() << "]"; + out << list->getElementType()->str(); + if (arg.alias_info() && !arg.alias_info()->containedTypes().empty()){ + out << arg.alias_info()->containedTypes()[0]; + } + std::string N = ""; + if (arg.N()) { + N = std::to_string(*arg.N()); + } + out << "[" << N << "]"; } else { out << unopt_type->str(); } - if (arg.alias_info()) { + // print alias info if it has beforeSets. + if (arg.alias_info() && !arg.alias_info()->beforeSets().empty()) { out << *arg.alias_info(); } diff --git a/aten/src/ATen/core/interned_strings.cpp b/aten/src/ATen/core/interned_strings.cpp index 0ad87c21c837f..ff9361f462a1a 100644 --- a/aten/src/ATen/core/interned_strings.cpp +++ b/aten/src/ATen/core/interned_strings.cpp @@ -141,6 +141,7 @@ bool Symbol::is_aten() const { return ns() == namespaces::aten; } bool Symbol::is_cuda() const { return ns() == namespaces::cuda; } bool Symbol::is_prim() const { return ns() == namespaces::prim; } bool Symbol::is_prims() const { return ns() == namespaces::prims; } +bool Symbol::is_nvprims() const { return ns() == namespaces::nvprims; } bool Symbol::is_onnx() const { return ns() == namespaces::onnx; } bool Symbol::is_user() const { return ns() == namespaces::user; } bool Symbol::is_caffe2() const { return ns() == namespaces::_caffe2; } diff --git a/aten/src/ATen/core/interned_strings.h b/aten/src/ATen/core/interned_strings.h index 8a195128b4d2c..0e16e8812a436 100644 --- a/aten/src/ATen/core/interned_strings.h +++ b/aten/src/ATen/core/interned_strings.h @@ -15,6 +15,7 @@ namespace c10 { #define FORALL_NS_SYMBOLS(_) \ _(namespaces, prim) \ _(namespaces, prims) \ + _(namespaces, nvprims) \ _(namespaces, aten) \ _(namespaces, cuda) \ _(namespaces, onnx) \ @@ -48,13 +49,16 @@ namespace c10 { _(prim, oneDNNFusionGuard) \ _(prim, FunctionalGraph) \ _(prim, add_optional) \ - _(prim, view_copy) \ + _(prim, expand_copy) \ + _(prim, expand_as_copy) \ + _(prim, flatten_copy) \ + _(prim, permute_copy) \ _(prim, reshape_copy) \ _(prim, squeeze_copy) \ + _(prim, t_copy) \ + _(prim, transpose_copy) \ _(prim, unsqueeze_copy) \ - _(prim, flatten_copy) \ - _(prim, expand_copy) \ - _(prim, expand_as_copy) \ + _(prim, view_copy) \ _(prim, DifferentiableGraph) \ _(prim, TensorExprGroup) \ _(prim, TensorExprDynamicGroup) \ @@ -221,6 +225,8 @@ namespace c10 { _(cuda, _current_device) \ _(cuda, synchronize) \ _(aten, has_torch_function) \ + _(aten, is_autocast_enabled) \ + _(aten, is_autocast_cpu_enabled) \ FORALL_ATEN_BASE_SYMBOLS(_) \ _(onnx, Add) \ _(onnx, Concat) \ diff --git a/aten/src/ATen/core/ivalue_inl.h b/aten/src/ATen/core/ivalue_inl.h index 301b448b834eb..00361c80a01cf 100644 --- a/aten/src/ATen/core/ivalue_inl.h +++ b/aten/src/ATen/core/ivalue_inl.h @@ -506,6 +506,7 @@ struct TORCH_API TupleElements { TORCH_CHECK(idx < inlineSize_, "TupleElements: invalid index Index = ", idx, "; Length = ", inlineSize_); return elementsInline_[idx]; } else { + TORCH_CHECK(idx < elementsVector_.size(), "TupleElements: invalid index Index = ", idx, "; Length = ", elementsVector_.size()); return elementsVector_.at(idx); } } diff --git a/aten/src/ATen/core/symbol.h b/aten/src/ATen/core/symbol.h index c06c261c3dd3c..04d480b51e317 100644 --- a/aten/src/ATen/core/symbol.h +++ b/aten/src/ATen/core/symbol.h @@ -82,6 +82,7 @@ struct TORCH_API Symbol { bool is_cuda() const; bool is_prim() const; bool is_prims() const; + bool is_nvprims() const; bool is_onnx() const; bool is_user() const; bool is_caffe2() const; diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h index f92e1bd22811c..0ee43b53e6358 100644 --- a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h +++ b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h @@ -257,6 +257,19 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return _mm256_loadu_si256((const __m256i*)tmp_values); + } + float_vec_return_type dequantize( Vectorized scale, Vectorized /*zero_point*/, @@ -417,10 +430,12 @@ struct Vectorized : public Vectorizedqi { // This is needed because the compiler emits awful code for the default // constructor for moving the enum // NOLINTNEXTLINE(clang-diagnostic-deprecated-copy) - #pragma clang diagnostic push - #pragma clang diagnostic ignored "-Wdeprecated-copy" + C10_CLANG_DIAGNOSTIC_PUSH() + #if C10_CLANG_HAS_WARNING("-Wdeprecated-copy") + C10_CLANG_DIAGNOSTIC_IGNORE("-Wdeprecated-copy") + #endif Vectorized(const Vectorized& other) : Vectorizedqi(other.vals) { } - #pragma clang diagnostic pop + C10_CLANG_DIAGNOSTIC_POP() void store(void* ptr, int count = size()) const { if (count != size()) { @@ -434,6 +449,19 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return _mm256_loadu_si256((const __m256i*)tmp_values); + } + private: __m256i cvtepi8_epi32(__m128i epi8_vals) const { return _mm256_cvtepi8_epi32(epi8_vals); @@ -580,10 +608,12 @@ struct Vectorized : public Vectorizedqi { } // NOLINTNEXTLINE(clang-diagnostic-deprecated-copy) - #pragma clang diagnostic push - #pragma clang diagnostic ignored "-Wdeprecated-copy" + C10_CLANG_DIAGNOSTIC_PUSH() + #if C10_CLANG_HAS_WARNING("-Wdeprecated-copy") + C10_CLANG_DIAGNOSTIC_IGNORE("-Wdeprecated-copy") + #endif Vectorized(const Vectorized& other) : Vectorizedqi(other.vals) { } - #pragma clang diagnostic pop + C10_CLANG_DIAGNOSTIC_POP() void store(void* ptr, int count = size()) const { if (count != size()) { @@ -597,6 +627,19 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return _mm256_loadu_si256((const __m256i*)tmp_values); + } + private: __m256i cvtepu8_epi32(__m128i epu8_vals) const { return _mm256_cvtepu8_epi32(epu8_vals); @@ -816,6 +859,19 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return Vectorized(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, @@ -948,6 +1004,19 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return Vectorized(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, @@ -1068,6 +1137,19 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy( + tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return Vectorized(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, diff --git a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h index 0267c40e1ea45..77cf3695ab912 100644 --- a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h +++ b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h @@ -551,27 +551,7 @@ class Vectorized { } Vectorized C10_ALWAYS_INLINE pow(const Vectorized& exp) const { - auto x = *this; - auto sign_bit = (*this) & sign_mask; - // |b| - auto exp_abs = exp.abs(); - auto exp_trunc = exp.trunc(); - Vectorized odd_mask; - odd_mask._vecb0 = (vec_signed(exp._vec0) & vi_1) != vi_0; - odd_mask._vecb1 = (vec_signed(exp._vec1) & vi_1) != vi_0; - // using ln fuction - auto temp = (abs().log() * exp).exp(); - - // is odd or even check from Sleef - auto is_int = (exp == exp_trunc) | (exp_abs >= vcheck); - auto is_odd = odd_mask & is_int & (exp_abs < vcheck); - // if even then then pow result should be absolute - auto temp_sign = temp | sign_bit; // copy_sign - auto out = blendv(temp, temp_sign, is_odd); - // x<0 and y != N, then NAN - auto out1 = blendv(out, v_nan, ((exp.floor() != exp) & (x < zero))); - // y = 0 then 1 - return blendv(out1, one, (exp_abs == zero)); + return {Sleef_powf4_u10vsx(_vec0, exp._vec0), Sleef_powf4_u10vsx(_vec1, exp._vec1)}; } Vectorized fmod(const Vectorized& b) const { diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h index 0f3474eaa2ade..87cf44283c0be 100644 --- a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h +++ b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h @@ -268,6 +268,18 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + float_vec_return_type dequantize( Vectorized scale, Vectorized zero_point, @@ -447,6 +459,18 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + private: __m512i cvtepi8_epi32(__m128i epi8_vals) const { return _mm512_cvtepi8_epi32(epi8_vals); @@ -611,6 +635,18 @@ struct Vectorized : public Vectorizedqi { return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + private: __m512i cvtepu8_epi32(__m128i epu8_vals) const { return _mm512_cvtepu8_epi32(epu8_vals); @@ -833,6 +869,18 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, @@ -965,6 +1013,18 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, @@ -1085,6 +1145,18 @@ struct Vectorized : public VectorizedQuantizedConverter< return Vectorized(ptr); } + static Vectorized loadu(const void* ptr, int64_t count) { + __at_align__ value_type tmp_values[size()]; + // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502 + // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two + // instructions while a loop would be compiled to one instruction. + for (const auto i : c10::irange(size())) { + tmp_values[i] = 0; + } + std::memcpy(tmp_values, reinterpret_cast(ptr), count * sizeof(value_type)); + return loadu(tmp_values); + } + static Vectorized quantize( const float_vec_return_type& rhs, float scale, diff --git a/aten/src/ATen/cpu/vec/vec_base.h b/aten/src/ATen/cpu/vec/vec_base.h index 3bf1010efd682..635ec8c82e5dc 100644 --- a/aten/src/ATen/cpu/vec/vec_base.h +++ b/aten/src/ATen/cpu/vec/vec_base.h @@ -33,6 +33,7 @@ #include #include #include +#include // These macros helped us unify vec_base.h #ifdef CPU_CAPABILITY_AVX512 @@ -975,7 +976,7 @@ inline void convert(const src_T *src, dst_T *dst, int64_t n) { #endif for (const auto i : c10::irange(n)) { (void)i; //Suppress unused variable warning - *dst = c10::static_cast_with_inter_type::apply(*src); + *dst = c10::convert(c10::load(src)); src++; dst++; } diff --git a/aten/src/ATen/cuda/CUDABlas.cpp b/aten/src/ATen/cuda/CUDABlas.cpp index e99017289d68b..866f53ee7f87f 100644 --- a/aten/src/ATen/cuda/CUDABlas.cpp +++ b/aten/src/ATen/cuda/CUDABlas.cpp @@ -1162,7 +1162,7 @@ void vdot>(CUDABLAS_DOT_ARGTYPES(c10::complex)) { reinterpret_cast(result))); } -// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda +// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda #ifdef CUDART_VERSION template <> @@ -1323,67 +1323,6 @@ void getrfBatched>( batchsize)); } -template <> -void getriBatched( - int n, double** dA_array, int ldda, int* ipiv_array, double** dC_array, int lddc, int* info_array, int batchsize) { - auto handle = at::cuda::getCurrentCUDABlasHandle(); - TORCH_CUDABLAS_CHECK(cublasDgetriBatched( - handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize)); -} - -template <> -void getriBatched( - int n, float** dA_array, int ldda, int* ipiv_array, float** dC_array, int lddc, int* info_array, int batchsize) { - auto handle = at::cuda::getCurrentCUDABlasHandle(); - TORCH_CUDABLAS_CHECK(cublasSgetriBatched( - handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize)); -} - -template <> -void getriBatched>( - int n, - c10::complex** dA_array, - int ldda, - int* ipiv_array, - c10::complex** dC_array, - int lddc, - int* info_array, - int batchsize) { - auto handle = at::cuda::getCurrentCUDABlasHandle(); - TORCH_CUDABLAS_CHECK(cublasZgetriBatched( - handle, - n, - reinterpret_cast(dA_array), - ldda, - ipiv_array, - reinterpret_cast(dC_array), - lddc, - info_array, - batchsize)); -} - -template <> -void getriBatched>( - int n, - c10::complex** dA_array, - int ldda, - int* ipiv_array, - c10::complex** dC_array, - int lddc, - int* info_array, - int batchsize) { - auto handle = at::cuda::getCurrentCUDABlasHandle(); - TORCH_CUDABLAS_CHECK(cublasCgetriBatched( - handle, - n, - reinterpret_cast(dA_array), - ldda, - ipiv_array, - reinterpret_cast(dC_array), - lddc, - info_array, - batchsize)); -} template <> void gelsBatched(CUDABLAS_GELS_BATCHED_ARGTYPES(double)) { diff --git a/aten/src/ATen/cuda/CUDABlas.h b/aten/src/ATen/cuda/CUDABlas.h index 10e589ecd6c9d..96c7fc8184228 100644 --- a/aten/src/ATen/cuda/CUDABlas.h +++ b/aten/src/ATen/cuda/CUDABlas.h @@ -227,7 +227,7 @@ void vdot>(CUDABLAS_DOT_ARGTYPES(c10::complex)); template <> void vdot>(CUDABLAS_DOT_ARGTYPES(c10::complex)); -// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda +// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda #ifdef CUDART_VERSION #define CUDABLAS_GETRS_ARGTYPES(Dtype) \ @@ -287,22 +287,6 @@ TORCH_CUDA_CU_API void getrfBatched>(CUDABLAS_GETRF_ARGTYPE template<> TORCH_CUDA_CU_API void getrfBatched>(CUDABLAS_GETRF_ARGTYPES(c10::complex)); -#define CUDABLAS_GETRI_ARGTYPES(Dtype) \ - int n, Dtype** dA_array, int ldda, int* ipiv_array, Dtype** dC_array, int lddc, int* info_array, int batchsize - -template -void getriBatched(CUDABLAS_GETRI_ARGTYPES(Dtype)) { - TORCH_CHECK(false, "at::cuda::blas::getriBatched: not implemented for ", typeid(Dtype).name()); -} -template<> -TORCH_CUDA_CU_API void getriBatched(CUDABLAS_GETRI_ARGTYPES(float)); -template<> -TORCH_CUDA_CU_API void getriBatched(CUDABLAS_GETRI_ARGTYPES(double)); -template<> -TORCH_CUDA_CU_API void getriBatched>(CUDABLAS_GETRI_ARGTYPES(c10::complex)); -template<> -TORCH_CUDA_CU_API void getriBatched>(CUDABLAS_GETRI_ARGTYPES(c10::complex)); - #define CUDABLAS_GELS_BATCHED_ARGTYPES(Dtype) \ cublasHandle_t handle, cublasOperation_t trans, int m, int n, int nrhs, Dtype** dA_array, int ldda, Dtype** dC_array, int lddc, int* info, int *devInfoArray, int batchSize diff --git a/aten/src/ATen/cuda/CUDAEvent.h b/aten/src/ATen/cuda/CUDAEvent.h index f07daeb979b9e..205fad8c11214 100644 --- a/aten/src/ATen/cuda/CUDAEvent.h +++ b/aten/src/ATen/cuda/CUDAEvent.h @@ -2,6 +2,7 @@ #include #include +#include #include #include #include @@ -45,6 +46,10 @@ struct TORCH_CUDA_CPP_API CUDAEvent { try { if (is_created_) { CUDAGuard guard(device_index_); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_deletion(reinterpret_cast(event_)); + } cudaEventDestroy(event_); } } catch (...) { /* No throw */ } @@ -113,6 +118,13 @@ struct TORCH_CUDA_CPP_API CUDAEvent { " does not match recording stream's device ", stream.device_index(), "."); CUDAGuard guard(device_index_); AT_CUDA_CHECK(cudaEventRecord(event_, stream)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_record( + reinterpret_cast(event_), + reinterpret_cast(stream.stream()) + ); + } was_recorded_ = true; } @@ -122,6 +134,13 @@ struct TORCH_CUDA_CPP_API CUDAEvent { if (is_created_) { CUDAGuard guard(stream.device_index()); AT_CUDA_CHECK(cudaStreamWaitEvent(stream, event_, 0)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_wait( + reinterpret_cast(event_), + reinterpret_cast(stream.stream()) + ); + } } } @@ -164,6 +183,10 @@ struct TORCH_CUDA_CPP_API CUDAEvent { device_index_ = device_index; CUDAGuard guard(device_index_); AT_CUDA_CHECK(cudaEventCreateWithFlags(&event_, flags_)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_creation(reinterpret_cast(event_)); + } is_created_ = true; } diff --git a/aten/src/ATen/cuda/CUDASparse.h b/aten/src/ATen/cuda/CUDASparse.h index ecb7127dfa322..d309cd5d8e311 100644 --- a/aten/src/ATen/cuda/CUDASparse.h +++ b/aten/src/ATen/cuda/CUDASparse.h @@ -4,13 +4,26 @@ // cuSparse Generic API added in CUDA 10.1 // Windows support added in CUDA 11.0 -// ROCm is not enabled #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && ((CUSPARSE_VERSION >= 10300) || (CUSPARSE_VERSION >= 11000 && defined(_WIN32))) #define AT_USE_CUSPARSE_GENERIC_API() 1 #else #define AT_USE_CUSPARSE_GENERIC_API() 0 #endif +// hipSparse Generic API ROCm 5.2 +#if defined(USE_ROCM) && ROCM_VERSION >= 50200 +#define AT_USE_HIPSPARSE_GENERIC_52_API() 1 +#else +#define AT_USE_HIPSPARSE_GENERIC_52_API() 0 +#endif + +// hipSparse Generic API ROCm 5.1 +#if defined(USE_ROCM) && ROCM_VERSION >= 50100 +#define AT_USE_HIPSPARSE_GENERIC_API() 1 +#else +#define AT_USE_HIPSPARSE_GENERIC_API() 0 +#endif + // cuSparse Generic API spsv function was added in CUDA 11.3.0 #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && (CUSPARSE_VERSION >= 11500) #define AT_USE_CUSPARSE_GENERIC_SPSV() 1 diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp index 3065babf89b6f..6319e214ac987 100644 --- a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp +++ b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp @@ -9,7 +9,7 @@ namespace at { namespace cuda { namespace sparse { -#if AT_USE_CUSPARSE_GENERIC_API() +#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API() namespace { @@ -53,6 +53,7 @@ cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type) { } } +#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API() CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset) { TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.layout() == kStrided); IntArrayRef input_strides = input.strides(); @@ -105,6 +106,7 @@ CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t ba descriptor_.reset(raw_descriptor); } +#endif // AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API() CuSparseDnVecDescriptor::CuSparseDnVecDescriptor(const Tensor& input) { // cuSPARSE doesn't support batched vectors @@ -175,7 +177,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6 value_type // data type of values )); -#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 +#if AT_USE_HIPSPARSE_GENERIC_52_API() || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000) if (ndim == 3 && batch_offset == -1) { int batch_count = at::native::cuda_int_cast(at::native::batchCount(input), "batch_count"); @@ -204,7 +206,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6 descriptor_.reset(raw_descriptor); } -#endif // AT_USE_CUSPARSE_GENERIC_API() +#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API() } // namespace sparse } // namespace cuda diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.h b/aten/src/ATen/cuda/CUDASparseDescriptors.h index 40078b65df647..60c9ff0ffa88a 100644 --- a/aten/src/ATen/cuda/CUDASparseDescriptors.h +++ b/aten/src/ATen/cuda/CUDASparseDescriptors.h @@ -40,6 +40,11 @@ class CuSparseDescriptor { #if defined(USE_ROCM) // hipSPARSE doesn't define this using cusparseMatDescr = std::remove_pointer::type; +using cusparseDnMatDescr = std::remove_pointer::type; +using cusparseDnVecDescr = std::remove_pointer::type; +using cusparseSpMatDescr = std::remove_pointer::type; +using cusparseSpMatDescr = std::remove_pointer::type; +using cusparseSpGEMMDescr = std::remove_pointer::type; #if AT_USE_HIPSPARSE_TRIANGULAR_SOLVE() using bsrsv2Info = std::remove_pointer::type; using bsrsm2Info = std::remove_pointer::type; @@ -92,15 +97,17 @@ class TORCH_CUDA_CPP_API CuSparseBsrsm2Info #endif // AT_USE_HIPSPARSE_TRIANGULAR_SOLVE -#if AT_USE_CUSPARSE_GENERIC_API() +#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API() cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type); +#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API() class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor : public CuSparseDescriptor { public: explicit CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset = -1); }; +#endif //AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API() class TORCH_CUDA_CPP_API CuSparseDnVecDescriptor : public CuSparseDescriptor { @@ -116,7 +123,7 @@ class TORCH_CUDA_CPP_API CuSparseSpMatCsrDescriptor public: explicit CuSparseSpMatCsrDescriptor(const Tensor& input, int64_t batch_offset = -1); -#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 +#if defined(USE_ROCM) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000) std::tuple get_size() { int64_t rows, cols, nnz; TORCH_CUDASPARSE_CHECK(cusparseSpMatGetSize( @@ -190,7 +197,7 @@ class TORCH_CUDA_CPP_API CuSparseSpSMDescriptor }; #endif -#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000 +#if (defined(USE_ROCM) && ROCM_VERSION >= 50200) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000) class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor : public CuSparseDescriptor { public: @@ -202,7 +209,7 @@ class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor }; #endif -#endif // AT_USE_CUSPARSE_GENERIC_API() +#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API() } // namespace sparse } // namespace cuda diff --git a/aten/src/ATen/cuda/jiterator.h b/aten/src/ATen/cuda/jiterator.h index 41a6f719a9e33..ac2c4d7cecf3f 100644 --- a/aten/src/ATen/cuda/jiterator.h +++ b/aten/src/ATen/cuda/jiterator.h @@ -33,7 +33,7 @@ TORCH_CUDA_CPP_API c10::SmallVector CompileAndLaunchKernel( const c10::SmallVector& tensors, const c10::SmallVector& extra_args, bool return_by_ref) { - TORCH_CHECK(false, "Jiterator is not supported on ROCm"); + TORCH_CHECK(false, "Jiterator is not supported"); } }} // namespace at::cuda diff --git a/aten/src/ATen/cuda/jiterator_impl.h b/aten/src/ATen/cuda/jiterator_impl.h index 7144b6d8eeaf9..5ba251055ad2a 100644 --- a/aten/src/ATen/cuda/jiterator_impl.h +++ b/aten/src/ATen/cuda/jiterator_impl.h @@ -27,6 +27,16 @@ namespace native { _(7) \ _(8) +#define AT_FOR_8_CASES_WITH_COMMA(_) \ + _(1) , \ + _(2) , \ + _(3) , \ + _(4) , \ + _(5) , \ + _(6) , \ + _(7) , \ + _(8) + c10::SmallVector get_extra_args_typenames(const c10::SmallVector& extra_args) { c10::SmallVector args_typenames(extra_args.size()); for (auto i = 0; i < extra_args.size(); ++i) { @@ -83,9 +93,9 @@ static std::unique_ptr> make_unique_offset_calculator( template struct OffsetCalculatorVariant { -#define DEFINE_CASE(index) std::unique_ptr>, +#define DEFINE_CASE(index) std::unique_ptr> using OffsetCalculatorTypes = c10::variant< - AT_FOR_8_CASES(DEFINE_CASE) + AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE) >; #undef DEFINE_CASE @@ -113,9 +123,9 @@ struct OffsetCalculatorVariant { struct ArrayVariant { // works for up to 8 input + 8 outputs -#define DEFINE_CASE(index) at::detail::Array, at::detail::Array, +#define DEFINE_CASE(index) at::detail::Array, at::detail::Array using ArrayTypes = c10::variant< - AT_FOR_8_CASES(DEFINE_CASE) + AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE) >; #undef DEFINE_CASE @@ -149,9 +159,9 @@ struct ArrayVariant { }; struct TrivialOffsetCalculatorVariant { -#define DEFINE_CASE(index) TrivialOffsetCalculator, +#define DEFINE_CASE(index) TrivialOffsetCalculator using TrivialOffsetCalculatorTypes = c10::variant< - AT_FOR_8_CASES(DEFINE_CASE) + AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE) >; #undef DEFINE_CASE @@ -177,9 +187,9 @@ struct TrivialOffsetCalculatorVariant { }; struct LoadWithCastVariant { -#define DEFINE_CASE(index) std::unique_ptr>, +#define DEFINE_CASE(index) std::unique_ptr> using LoadWithCastPtr = c10::variant< - AT_FOR_8_CASES(DEFINE_CASE) + AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE) >; #undef DEFINE_CASE @@ -206,9 +216,9 @@ struct LoadWithCastVariant { }; struct StoreWithCastVariant { -#define DEFINE_CASE(index) std::unique_ptr>, +#define DEFINE_CASE(index) std::unique_ptr> using StoreWithCastPtr = c10::variant< - AT_FOR_8_CASES(DEFINE_CASE) + AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE) >; #undef DEFINE_CASE diff --git a/aten/src/ATen/cuda/llvm_complex.cpp b/aten/src/ATen/cuda/llvm_complex.cpp index 55e39e2802721..d88bdc4ce6579 100644 --- a/aten/src/ATen/cuda/llvm_complex.cpp +++ b/aten/src/ATen/cuda/llvm_complex.cpp @@ -834,7 +834,7 @@ complex::type> pow(const complex<_Tp>& __x, const complex<_Up>& __y) { typedef complex::type> result_type; - return _VSTD::pow(result_type(__x), result_type(__y)); + return std::pow(result_type(__x), result_type(__y)); } template @@ -847,7 +847,7 @@ typename enable_if pow(const complex<_Tp>& __x, const _Up& __y) { typedef complex::type> result_type; - return _VSTD::pow(result_type(__x), result_type(__y)); + return std::pow(result_type(__x), result_type(__y)); } template @@ -860,7 +860,7 @@ typename enable_if pow(const _Tp& __x, const complex<_Up>& __y) { typedef complex::type> result_type; - return _VSTD::pow(result_type(__x), result_type(__y)); + return std::pow(result_type(__x), result_type(__y)); } // __sqr, computes pow(x, 2) diff --git a/aten/src/ATen/jit_macros.h b/aten/src/ATen/jit_macros.h index ca765f03afbff..9af826549021a 100644 --- a/aten/src/ATen/jit_macros.h +++ b/aten/src/ATen/jit_macros.h @@ -3,12 +3,5 @@ #include // AT_USE_JITERATOR(), controls whether we jit some elementwise kernels -// Currently unsupported on ROCm GPUs -#if !AT_ROCM_ENABLED() #define AT_USE_JITERATOR() true #define jiterator_stringify(...) std::string(#__VA_ARGS__); -#else -#define AT_USE_JITERATOR() false -#define jiterator_stringify(...) \ - static_assert(false, "Jiterator is not supported on ROCm"); -#endif // USE_ROCM diff --git a/aten/src/ATen/jiterator_macros.h b/aten/src/ATen/jiterator_macros.h index 63a7dfa2eb967..3aa4c7ebb0af0 100644 --- a/aten/src/ATen/jiterator_macros.h +++ b/aten/src/ATen/jiterator_macros.h @@ -25,8 +25,8 @@ // These `,`s confuse the preprocessor into thinking we are passing // multiple arguments to the macro. #define jiterator_code(...) __VA_ARGS__ -#if defined(__CUDACC__) -// CPU and CUDA case +#if defined(__CUDACC__) || defined(__HIPCC__) +// CPU and CUDA and ROCm case #define stringify_code(...) #__VA_ARGS__ #define jiterator_also_stringify_as(code, str_name) \ code /* define the function */ \ diff --git a/aten/src/ATen/mps/IndexKernels.h b/aten/src/ATen/mps/IndexKernels.h new file mode 100644 index 0000000000000..b789cdc184161 --- /dev/null +++ b/aten/src/ATen/mps/IndexKernels.h @@ -0,0 +1,132 @@ +#pragma once + +namespace at { +namespace mps { + +static const char * indexing_metal_shaders = R"INDEX_METAL( +#include +using namespace metal; + +constant uint32_t num_indices [[function_constant(0)]]; + +struct IndexAB { + // Allow up to 16 indices + metal::array indexArray [[ id(0) ]]; +}; + +template +kernel void index_select( + constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]) { + + constant const int64_t * index_sizes = (constant const int64_t *)indexSizes; + constant const int64_t * index_strides = (constant const int64_t *)indexStrides; + int64_t offset = 0; + for (uint32_t i = 0; i < num_indices; i++) { + int64_t index = ((constant const int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)]; + if (index < 0) { + index += index_sizes[i]; + } + offset += index * index_strides[i]; + } + device T * out = (device T*)((device char*)outputData + offsets[thread_index].x); + constant const T * in = (constant const T*)((constant const char*)inputData + offsets[thread_index].y + offset); + *out = *in; +} + +template +[[host_name("index_select_float")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_half")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_long")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_int")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_short")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_char")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); +template +[[host_name("index_select_uchar")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); + +template +[[host_name("index_select_bool")]] +kernel void index_select(constant const IndexAB & indexAB [[buffer(0)]], + constant const void * indexSizes [[buffer(1)]], + constant const void * indexStrides [[buffer(2)]], + constant const uint3 * offsets [[buffer(3)]], + constant const void * inputData [[buffer(4)]], + device void * outputData [[buffer(5)]], + uint thread_index [[thread_position_in_grid]]); + +kernel void kernel_index_offsets(constant const packed_uint3 * strides [[buffer(0)]], + device uint3 * data_offsets [[buffer(1)]], + constant const uint * iter_shape [[buffer(2)]], + constant const uint & num_dimensions [[buffer(3)]], + constant const uint & num_offsets [[buffer(4)]], + uint thread_index [[thread_position_in_grid]]) { + uint32_t idx = thread_index; + for (uint32_t dim = 0; dim < num_dimensions; dim++) { + uint32_t remainder = idx % iter_shape[dim]; + idx /= iter_shape[dim]; + for (uint32_t offset = 0; offset < num_offsets; offset++) + data_offsets[thread_index][offset] += remainder * strides[dim][offset]; + } +} +)INDEX_METAL"; +} +} diff --git a/aten/src/ATen/mps/MPSAllocator.mm b/aten/src/ATen/mps/MPSAllocator.mm index 2433acbc050b2..28275c782b794 100644 --- a/aten/src/ATen/mps/MPSAllocator.mm +++ b/aten/src/ATen/mps/MPSAllocator.mm @@ -336,7 +336,7 @@ DataPtr allocate(const size_t nbytes) const override { DeleterFnPtr raw_deleter() const override { return &Delete; } bool is_shared(void* ptr) const { return _getAllocImpl().isSharedBuffer(ptr); } - bool is_shared_storge_supported() const { return m_has_unified_memory; } + bool is_shared_storage_supported() const { return m_has_unified_memory; } private: bool m_has_unified_memory; @@ -375,7 +375,7 @@ static bool isEnvVarEnabled(const char *envvar) { at::Allocator* getMPSSharedAllocator() { auto& sa = _getSharedAllocator(); - if (sa.is_shared_storge_supported()) { + if (sa.is_shared_storage_supported()) { return &sa; } diff --git a/aten/src/ATen/mps/MPSDevice.h b/aten/src/ATen/mps/MPSDevice.h index d957c5440a06d..77e93ea1234a4 100644 --- a/aten/src/ATen/mps/MPSDevice.h +++ b/aten/src/ATen/mps/MPSDevice.h @@ -11,9 +11,15 @@ #include #include typedef id MTLDevice_t; +typedef id MTLLibrary_t; +typedef id MTLFunction_t; +typedef MTLFunctionConstantValues* MTLFunctionConstantValues_t; #else typedef void* MTLDevice; typedef void* MTLDevice_t; +typedef void* MTLLibrary_t; +typedef void* MTLFunction_t; +typedef void* MTLFunctionConstantValues_t; #endif using namespace std; @@ -48,11 +54,14 @@ class TORCH_API MPSDevice { return _mtl_device; } + MTLFunction_t metalIndexingFunction(const std::string &kernel, MTLFunctionConstantValues_t constantValues); + ~MPSDevice(); private: static MPSDevice* _device; MTLDevice_t _mtl_device; + MTLLibrary_t _mtl_indexing_library; MPSDevice(); }; diff --git a/aten/src/ATen/mps/MPSDevice.mm b/aten/src/ATen/mps/MPSDevice.mm index 2775100666494..007dfbea54bf5 100644 --- a/aten/src/ATen/mps/MPSDevice.mm +++ b/aten/src/ATen/mps/MPSDevice.mm @@ -3,6 +3,7 @@ #include #include +#include namespace at { namespace mps { @@ -10,6 +11,15 @@ static std::unique_ptr mps_device; static c10::once_flag mpsdev_init; +static inline MTLLanguageVersion getMetalLanguageVersion(const id& device) { + // MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants) + // host_name attribute needs at least Metal 2.2 + MTLLanguageVersion languageVersion = MTLLanguageVersion2_2; + + TORCH_CHECK([device supportsFamily:MTLGPUFamilyMac2], "Missing Metal support for MTLGPUFamilyMac2"); + return languageVersion; +} + MPSDevice* MPSDevice::getInstance() { c10::call_once(mpsdev_init, [] { mps_device = std::unique_ptr(new MPSDevice()); @@ -17,12 +27,41 @@ return mps_device.get(); } +id MPSDevice::metalIndexingFunction(const std::string& kernel, MTLFunctionConstantValues* constantValues) { + TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device); + NSError* error = nil; + if (!_mtl_indexing_library) { + MTLCompileOptions *options = [MTLCompileOptions new]; + [options setLanguageVersion: getMetalLanguageVersion(_mtl_device)]; + [options setFastMathEnabled: YES]; + _mtl_indexing_library = [_mtl_device newLibraryWithSource: [NSString stringWithCString: mps::indexing_metal_shaders encoding:NSASCIIStringEncoding] + options: options + error: &error]; + TORCH_CHECK(_mtl_indexing_library, "Failed to create indexing library, error: ", [[error description] UTF8String]); + } + + id indexFunction = nil; + if (constantValues) { + indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()] + constantValues: constantValues + error: &error] autorelease]; + } else { + indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()]] autorelease]; + } + + TORCH_CHECK(indexFunction, "Failed to create specialized function state object: ", kernel, ", error: ", [[error description] UTF8String]); + + return indexFunction; +} + MPSDevice::~MPSDevice() { [_mtl_device release]; + [_mtl_indexing_library release]; _mtl_device = nil; + _mtl_indexing_library = nil; } -MPSDevice::MPSDevice(): _mtl_device(nil) { +MPSDevice::MPSDevice(): _mtl_device(nil), _mtl_indexing_library(nil) { // Check that MacOS 12.3+ version of MPS framework is available // Create the MPSGraph and check method introduced in 12.3+ // which is used by MPS backend. @@ -45,7 +84,7 @@ break; } } - assert(_mtl_device); + TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device); } at::Allocator* getMPSSharedAllocator(); diff --git a/aten/src/ATen/mps/MPSFallback.mm b/aten/src/ATen/mps/MPSFallback.mm index c4488330be75a..4f9e635dce05a 100644 --- a/aten/src/ATen/mps/MPSFallback.mm +++ b/aten/src/ATen/mps/MPSFallback.mm @@ -35,10 +35,6 @@ void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) // These ops are not supported via MPS backend currently, and we fallback to run on CPU. // For the rest of unsupported ops the user needs to pass 'PYTORCH_ENABLE_MPS_FALLBACK=1' // to fallback on CPU, otherwise we will error out. - m.impl("bitwise_and.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); - m.impl("bitwise_or.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); - m.impl("bitwise_xor.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); - m.impl("bitwise_not.out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); m.impl("bitwise_left_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); m.impl("bitwise_right_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); m.impl("embedding_renorm_", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>()); diff --git a/aten/src/ATen/mps/MPSGuardImpl.mm b/aten/src/ATen/mps/MPSGuardImpl.mm index c2987fdaa3e73..2aedeccf82cb9 100644 --- a/aten/src/ATen/mps/MPSGuardImpl.mm +++ b/aten/src/ATen/mps/MPSGuardImpl.mm @@ -9,9 +9,6 @@ void MPSGuardImpl::createEvent( mpsEvent_t* event, const EventFlag flag) const { - id mtl_device = MPSDevice::getInstance()->device(); - // when static casting we already create an _event object. - auto mps_event = static_cast(*event); } void MPSGuardImpl::destroyEvent( diff --git a/aten/src/ATen/native/AdaptiveAveragePooling.cpp b/aten/src/ATen/native/AdaptiveAveragePooling.cpp index cf4321a1d2d60..855d54eadba88 100644 --- a/aten/src/ATen/native/AdaptiveAveragePooling.cpp +++ b/aten/src/ATen/native/AdaptiveAveragePooling.cpp @@ -16,16 +16,16 @@ namespace { IntArrayRef output_size) { TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2"); - int64_t ndim = input.ndimension(); - for (const auto i : c10::irange(1, ndim)) { + int64_t ndim = input.dim(); + TORCH_CHECK((ndim == 3 || ndim == 4), + "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes()); + for (const auto i : {-2, -1}) { TORCH_CHECK(input.size(i) > 0, "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, " - "but input has sizes ", input.sizes(), " with dimension ", i, " being " + "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being " "empty"); } - TORCH_CHECK((ndim == 3 || ndim == 4), - "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes()); TORCH_CHECK(input.dtype() == output.dtype(), "expected dtype ", input.dtype(), " for `output` but got dtype ", output.dtype()); diff --git a/aten/src/ATen/native/BatchLinearAlgebra.cpp b/aten/src/ATen/native/BatchLinearAlgebra.cpp index b2dc974f5a3b8..56c66171a9616 100644 --- a/aten/src/ATen/native/BatchLinearAlgebra.cpp +++ b/aten/src/ATen/native/BatchLinearAlgebra.cpp @@ -27,12 +27,6 @@ extern "C" void cgetrf_(int *m, int *n, std::complex *a, int *lda, int *i extern "C" void dgetrf_(int *m, int *n, double *a, int *lda, int *ipiv, int *info); extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info); -// getri -extern "C" void zgetri_(int *n, std::complex *a, int *lda, int *ipiv, std::complex *work, int *lwork, int *info); -extern "C" void cgetri_(int *n, std::complex *a, int *lda, int *ipiv, std::complex *work, int *lwork, int *info); -extern "C" void dgetri_(int *n, double *a, int *lda, int *ipiv, double *work, int *lwork, int *info); -extern "C" void sgetri_(int *n, float *a, int *lda, int *ipiv, float *work, int *lwork, int *info); - // potrs extern "C" void zpotrs_(char *uplo, int *n, int *nrhs, std::complex *a, int *lda, std::complex *b, int *ldb, int *info); extern "C" void cpotrs_(char *uplo, int *n, int *nrhs, std::complex *a, int *lda, std::complex *b, int *ldb, int *info); @@ -454,6 +448,18 @@ TORCH_META_FUNC(_linalg_solve_ex)(const Tensor& A, set_output_contiguous(3, shape.slice(0, ndim - 2), A.options().dtype(kInt)); } +TORCH_META_FUNC(linalg_inv_ex)(const Tensor& A, bool check_errors) { + at::native::squareCheckInputs(A, "linalg.inv"); + at::native::checkFloatingOrComplex(A, "linalg.inv", /*allow_low_precision_dtypes*/false); + + auto shape = A.sizes(); + + auto result_strides = at::native::batched_matrix_contiguous_strides(shape, /*f-contig*=*/true); + set_output_strided(0, shape, result_strides, A.options(), {}); + set_output_contiguous( + 1, shape.slice(0, shape.size() - 2), A.options().dtype(ScalarType::Int)); // info +} + TORCH_META_FUNC(linalg_lu_factor_ex)(const Tensor& A, bool pivot, bool check_errors) { TORCH_CHECK(A.dim() >= 2, "torch.lu_factor: Expected tensor with 2 or more dimensions. Got size: ", A.sizes(), " instead"); @@ -682,31 +688,12 @@ namespace native { // Define the per-batch functions to be used in the main implementation of the batched // linear algebra operations -template -void lapackGetri(int n, scalar_t *a, int lda, int *ipiv, scalar_t *work, int lwork, int *info); - template void lapackCholeskySolve(char uplo, int n, int nrhs, scalar_t *a, int lda, scalar_t *b, int ldb, int *info); template void lapackSymeig(char jobz, char uplo, int n, scalar_t *a, int lda, value_t *w, scalar_t *work, int lwork, value_t *rwork, int *info); -template<> void lapackGetri>(int n, c10::complex *a, int lda, int *ipiv, c10::complex *work, int lwork, int *info) { - zgetri_(&n, reinterpret_cast*>(a), &lda, ipiv, reinterpret_cast*>(work), &lwork, info); -} - -template<> void lapackGetri>(int n, c10::complex *a, int lda, int *ipiv, c10::complex *work, int lwork, int *info) { - cgetri_(&n, reinterpret_cast*>(a), &lda, ipiv, reinterpret_cast*>(work), &lwork, info); -} - -template<> void lapackGetri(int n, double *a, int lda, int *ipiv, double *work, int lwork, int *info) { - dgetri_(&n, a, &lda, ipiv, work, &lwork, info); -} - -template<> void lapackGetri(int n, float *a, int lda, int *ipiv, float *work, int lwork, int *info) { - sgetri_(&n, a, &lda, ipiv, work, &lwork, info); -} - template<> void lapackLu>(int m, int n, c10::complex *a, int lda, int *ipiv, int *info) { zgetrf_(&m, &n, reinterpret_cast*>(a), &lda, ipiv, info); } @@ -1513,223 +1500,37 @@ bool _requires_fw_or_bw_grad(const Tensor& input) { || input._fw_grad(/*level */ 0).defined()); } -// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -/* -Computes the inverse of n-by-n matrix 'self' -This is an in-place routine, it overwrites the content of 'self'. -'infos_lu' and 'infos_getri' are int Tensors containing error codes for each matrix in the batched input. -'infos_lu' is for holding lapackLU errors, and 'infos_getri' is for holding lapackGetri errors. -For more information see LAPACK's documentation for GETRI and GETRF routines. -*/ -template -static void apply_inverse(Tensor& self, Tensor& infos_lu, Tensor& infos_getri) { -#if !AT_BUILD_WITH_LAPACK() - AT_ERROR("inverse: LAPACK library not found in compilation"); -#else - using value_t = typename c10::scalar_value_type::type; - auto self_data = self.data_ptr(); - auto self_matrix_stride = matrixStride(self); - auto batch_size = batchCount(self); - auto n = self.size(-2); - auto lda = std::max(1, n); - - auto ipiv = at::empty({lda}, self.options().dtype(kInt)); - auto ipiv_data = ipiv.data_ptr(); - auto infos_lu_data = infos_lu.data_ptr(); - auto infos_getri_data = infos_getri.data_ptr(); - - // NOLINTNEXTLINE(cppcoreguidelines-init-variables) - int info; - // Run once, first to get the optimum work size - // Since we deal with batches of matrices with the same dimensions, doing this outside - // the loop saves (batch_size - 1) workspace queries which would provide the same result - // and (batch_size - 1) calls to allocate and deallocate workspace using at::empty() - int lwork = -1; - scalar_t wkopt; - lapackGetri(n, self_data, lda, ipiv_data, &wkopt, lwork, &info); - lwork = std::max(1, real_impl(wkopt)); - Tensor work = at::empty({lwork}, self.options()); - auto work_data = work.data_ptr(); - - for (const auto i : c10::irange(batch_size)) { - scalar_t* self_working_ptr = &self_data[i * self_matrix_stride]; - int* info_lu_working_ptr = &infos_lu_data[i]; - lapackLu(n, n, self_working_ptr, lda, ipiv_data, info_lu_working_ptr); - - // now compute the actual inverse - int* info_getri_working_ptr = &infos_getri_data[i]; - lapackGetri(n, self_working_ptr, lda, ipiv_data, work_data, lwork, info_getri_working_ptr); - } -#endif -} - -Tensor inverse(const Tensor &self) { - if (self.numel() == 0) { - return at::empty_like(self); - } - return at::linalg_inv(self); -} - -Tensor& inverse_out(const Tensor &self, Tensor &result) { - at::linalg_inv_out(result, self); - return result; -} - -// This is a type dispatching helper function for 'apply_inverse' -Tensor& _linalg_inv_out_helper_cpu(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) { - // This function calculates the inverse matrix in-place - // result should be in column major order and contain matrices to invert - // the content of result is overwritten by 'apply_inverse' - AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cpu", [&]{ - apply_inverse(result, infos_lu, infos_getri); - }); - return result; -} - -// Computes the inverse matrix of 'input', it is saved to 'result' in-place -// LAPACK/MAGMA/cuSOLVER error codes are saved in 'infos' tensors, they are not checked here -static Tensor& linalg_inv_out_info(Tensor& result, Tensor& infos_lu, Tensor& infos_getri, const Tensor& input) { - squareCheckInputs(input, "linalg.inv"); - checkSameDevice("linalg.inv", result, input); - checkLinalgCompatibleDtype("linalg.inv", result, input); - - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.scalar_type() == kInt); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.scalar_type() == kInt); - - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.device() == input.device()); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.device() == input.device()); - - bool result_input_same_type = (result.scalar_type() == input.scalar_type()); - bool result_equal_expected_shape = result.sizes().equals(input.sizes()); - bool is_batched_column_major = false; - if (result.dim() >= 2) { - is_batched_column_major = result.mT().is_contiguous(); - } - - // if result is not empty and not in batched column major format - bool copy_needed = (result.numel() != 0 && !is_batched_column_major); - copy_needed |= !result_input_same_type; // or result does not have the same dtype as input - copy_needed |= (result.numel() != 0 && !result_equal_expected_shape); // or result does not have the expected shape - // we have to allocate a temporary tensor - - // similar conditions for infos_lu and infos_getri tensors - auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2] - copy_needed |= (infos_lu.numel() != 0 && !infos_lu.is_contiguous()); - copy_needed |= (infos_lu.numel() != 0 && !(infos_lu.sizes().equals(expected_info_shape))); - - copy_needed |= (infos_getri.numel() != 0 && !infos_getri.is_contiguous()); - copy_needed |= (infos_getri.numel() != 0 && !(infos_getri.sizes().equals(expected_info_shape))); - - if (copy_needed) { - Tensor result_tmp = at::empty(input.sizes(), input.options()); - result_tmp.transpose_(-2, -1); - Tensor infos_lu_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt)); - Tensor infos_getri_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt)); - - result_tmp = linalg_inv_out_info(result_tmp, infos_lu_tmp, infos_getri_tmp, input); - - at::native::resize_output(result, result_tmp.sizes()); - result.copy_(result_tmp); - at::native::resize_output(infos_lu, infos_lu_tmp.sizes()); - infos_lu.copy_(infos_lu_tmp); - at::native::resize_output(infos_getri, infos_getri_tmp.sizes()); - infos_getri.copy_(infos_getri_tmp); - return result; - } - // else use result's storage directly - - // if result has no elements we can modify it - if (result.numel() == 0) { - at::native::resize_as_(result, input.mT(), MemoryFormat::Contiguous); - result.transpose_(-2, -1); - } - - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.sizes().equals(input.sizes())); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.scalar_type() == input.scalar_type()); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.device() == input.device()); - - // result tensor must be in batched column major order (Fortran contiguous) - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.mT().is_contiguous()); - - // if info has no elements we can modify it - if (infos_lu.numel() == 0) { - infos_lu.resize_(expected_info_shape); - infos_lu.fill_(0); - } - if (infos_getri.numel() == 0) { - infos_getri.resize_(expected_info_shape); - infos_getri.fill_(0); +// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg.inv ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +TORCH_IMPL_FUNC(linalg_inv_ex_out)(const Tensor& A, bool check_errors, const Tensor& result, const Tensor& info) { + // Fill result with the identity + result.zero_(); + result.diagonal(0, -2, -1).fill_(1.); + at::linalg_solve_ex_out(const_cast(result), const_cast(info), A, result, /*left*/true); + if (check_errors) { + at::_linalg_check_errors(info, "linalg.inv_ex", A.dim() == 2); } - - // info tensors must be contiguous - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.is_contiguous()); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.sizes().equals(expected_info_shape)); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.is_contiguous()); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.sizes().equals(expected_info_shape)); - - // _linalg_inv_out_helper_ (apply_inverse) performs calculations in-place and result must be a copy of input - result.copy_(input); - - // TODO: Replace this helper with DECLARE/DEFINE_DISPATCH - result = at::_linalg_inv_out_helper_(result, infos_lu, infos_getri); - return result; } -// Computes the inverse matrix of 'input', it is saved to 'result' in-place -Tensor& linalg_inv_out(const Tensor &input, Tensor &result) { - auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2] - auto infos_lu = at::zeros({info_shape}, input.options().dtype(kInt)); - auto infos_getri = at::zeros({info_shape}, input.options().dtype(kInt)); - result = linalg_inv_out_info(result, infos_lu, infos_getri, input); - - // Now check LAPACK/MAGMA/cuSOLVER error codes - at::_linalg_check_errors(infos_lu, "linalg.inv", result.dim() == 2); - at::_linalg_check_errors(infos_getri, "linalg.inv", result.dim() == 2); +Tensor& linalg_inv_out(const Tensor& A, Tensor& result) { + auto info = at::empty({0}, A.options().dtype(kInt)); + at::linalg_inv_ex_out(result, info, A); + at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2); return result; } -// Computes the inverse matrix of 'input' -Tensor linalg_inv(const Tensor &input) { +Tensor linalg_inv(const Tensor& A) { Tensor result, info; - std::tie(result, info) = at::linalg_inv_ex(input, /*check_errors=*/false); - - // we pass check_errors=false above and do the check here - // so that the name of the function is correct in the error message - at::_linalg_check_errors(info, "torch.linalg.inv", input.dim() == 2); + std::tie(result, info) = at::linalg_inv_ex(A); + at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2); return result; } -std::tuple linalg_inv_ex_out(const Tensor& input, bool check_errors, Tensor& inverse, Tensor& info) { - squareCheckInputs(input, "linalg.inv_ex"); - ScalarType info_output_type = ScalarType::Int; - TORCH_CHECK( - info.scalar_type() == info_output_type, - "torch.linalg.inv_ex: ", - "Expected info to have ", info_output_type, " dtype, but got info with dtype ", info.scalar_type()); - - // provided `info` tensor is used to save the information about the LU decomposition of `input` - // in addition current implementation requires a separate tensor - // for saving the information about the inversion process after the LU decomposition - auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2] - auto info_inversion = at::zeros({expected_info_shape}, input.options().dtype(kInt)); - - linalg_inv_out_info(inverse, info, info_inversion, input); - - if (check_errors) { - at::_linalg_check_errors(info, "torch.linalg.inv_ex", input.dim() == 2); - } - - return std::tuple(inverse, info); +Tensor& inverse_out(const Tensor& A, Tensor& result) { + return at::linalg_inv_out(result, A); } -std::tuple linalg_inv_ex(const Tensor& input, bool check_errors) { - squareCheckInputs(input, "linalg.inv_ex"); - Tensor inverse = at::empty(input.sizes(), input.options(), MemoryFormat::Contiguous); - inverse.transpose_(-2, -1); // make `inverse` tensor with batched column major format - auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2] - Tensor info = at::zeros({info_shape}, input.options().dtype(kInt)); - std::tie(inverse, info) = at::native::linalg_inv_ex_out(input, check_errors, inverse, info); - return std::make_tuple(inverse, info); +Tensor inverse(const Tensor& A) { + return at::linalg_inv(A); } // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -2001,6 +1802,7 @@ TORCH_IMPL_FUNC(_linalg_solve_ex_out)(const Tensor& A, // Possible optimization: Compute the LU factorization of A^T if A is contiguous // Then we solve A^T X = B with adjoint=True // This saves a copy as A doesn't need to be copied into an F-contig matrix in lu_factor + // This optimization makes functorch's batching rule difficult. See NOTE [ solve_ex Batch Rule Contiguity ] const bool use_A_T = A.is_contiguous() && !A.is_complex(); at::linalg_lu_factor_ex_out(const_cast(LU), const_cast(pivots), diff --git a/aten/src/ATen/native/Convolution.cpp b/aten/src/ATen/native/Convolution.cpp index 9f2d8efbd6181..7128b5c8aea1d 100644 --- a/aten/src/ATen/native/Convolution.cpp +++ b/aten/src/ATen/native/Convolution.cpp @@ -713,6 +713,7 @@ at::Tensor complex_convolution( IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, + bool transposed, IntArrayRef output_padding, int64_t groups) { check_input_same_type_as_parameters(input, weight, bias); @@ -730,15 +731,15 @@ at::Tensor complex_convolution( // conv(W, x, b) = a - b + i(c - a - b) Tensor a, b, c; if (!bias.defined()) { - a = at::convolution(i_r, w_r, bias, stride, padding, dilation, false, output_padding, groups); - b = at::convolution(i_i, w_i, bias, stride, padding, dilation, false, output_padding, groups); - c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, false, output_padding, groups); + a = at::convolution(i_r, w_r, bias, stride, padding, dilation, transposed, output_padding, groups); + b = at::convolution(i_i, w_i, bias, stride, padding, dilation, transposed, output_padding, groups); + c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, transposed, output_padding, groups); } else { Tensor b_r, b_i; std::tie(b_r, b_i) = complex_to_real(bias.resolve_conj()); - a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, false, output_padding, groups); - b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, false, output_padding, groups); - c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, false, output_padding, groups); + a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, transposed, output_padding, groups); + b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, transposed, output_padding, groups); + c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, transposed, output_padding, groups); } auto i = c10::Scalar(c10::complex(0, 1)); @@ -791,7 +792,7 @@ at::Tensor conv1d( std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv1d"); Tensor output; if (at::isComplexType(input_.scalar_type())) { - output = complex_convolution(input, weight, bias, stride, padding, dilation, {0}, groups); + output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups); } else { output = at::convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups); } @@ -810,7 +811,7 @@ at::Tensor conv2d( std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d"); Tensor output; if (at::isComplexType(input_.scalar_type())) { - output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0}}, groups); + output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups); } else { output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups); } @@ -829,7 +830,7 @@ at::Tensor conv3d( std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv3d"); Tensor output; if (at::isComplexType(input_.scalar_type())) { - output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0, 0}}, groups); + output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups); } else { output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups); } @@ -979,8 +980,14 @@ at::Tensor conv_transpose1d( Tensor input; bool is_batched; std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv_transpose1d"); - auto output = at::convolution( + Tensor output; + if (at::isComplexType(input_.scalar_type())) { + output = complex_convolution( input, weight, bias, stride, padding, dilation, true, output_padding, groups); + } else { + output = at::convolution( + input, weight, bias, stride, padding, dilation, true, output_padding, groups); + } return is_batched ? output : output.squeeze(0); } diff --git a/aten/src/ATen/native/Correlation.cpp b/aten/src/ATen/native/Correlation.cpp index 0bd27195df766..a00aeb4dc9257 100644 --- a/aten/src/ATen/native/Correlation.cpp +++ b/aten/src/ATen/native/Correlation.cpp @@ -1,5 +1,6 @@ #include #include +#include namespace at { namespace native { @@ -47,7 +48,7 @@ Tensor cov( " != ", num_observations); TORCH_CHECK( - num_observations == 0 || w.min().ge(0).item(), + num_observations == 0 || at::is_scalar_tensor_true(w.min().ge(0)), "cov(): fweights cannot be negative"); } @@ -70,7 +71,7 @@ Tensor cov( " != ", num_observations); TORCH_CHECK( - num_observations == 0 || aw.min().ge(0).item(), + num_observations == 0 || at::is_scalar_tensor_true(aw.min().ge(0)), "cov(): aweights cannot be negative"); w = w.defined() ? w * aw : aw; } @@ -81,7 +82,7 @@ Tensor cov( : at::scalar_tensor(num_observations, in.options().dtype(kLong)); TORCH_CHECK( - !w.defined() || w_sum.ne(0).item(), + !w.defined() || at::is_scalar_tensor_true(w_sum.ne(0)), "cov(): weights sum to zero, can't be normalized"); const auto avg = (w.defined() ? in * w : in).sum(OBSERVATIONS_DIM) / w_sum; @@ -95,7 +96,7 @@ Tensor cov( norm_factor = w_sum - correction; } - if (norm_factor.le(0).item()) { + if (at::is_scalar_tensor_true(norm_factor.le(0))) { TORCH_WARN("cov(): degrees of freedom is <= 0"); norm_factor.zero_(); } diff --git a/aten/src/ATen/native/Cross.cpp b/aten/src/ATen/native/Cross.cpp index 4b3e43da1147b..9b268c6e3d542 100644 --- a/aten/src/ATen/native/Cross.cpp +++ b/aten/src/ATen/native/Cross.cpp @@ -9,17 +9,20 @@ namespace at { namespace meta { -TORCH_PRECOMPUTE_META_FUNC(linalg_cross) -(const Tensor & input, const Tensor & other, const int64_t dimension) { - auto out_size = infer_size(input.sizes(), other.sizes()); - Tensor input_broadcasted = input.expand(out_size); - Tensor other_broadcasted = other.expand(out_size); +TORCH_META_FUNC(linalg_cross) +(const Tensor & input, const Tensor & other, int64_t dim) { + auto x_d = input.dim(); + auto y_d = other.dim(); + // This is to avoid things like + // linalg.cross(torch.randn(2, 3), torch.randn(5, 2, 3), dim=2) + TORCH_CHECK(x_d == y_d, "linalg.cross: inputs must have the same number of dimensions."); + TORCH_CHECK(input.size(dim) == 3 && other.size(dim) == 3, "linalg.cross: inputs dimension ", dim, " must have length 3. Got ", input.size(dim), " and ", other.size(dim)); - int64_t dim = maybe_wrap_dim(dimension, input.dim()); // default dim = -1 - TORCH_CHECK(input_broadcasted.size(dim) == 3, "dimension ", dimension, " does not have size 3"); + // Broadcast the batch dimension of input and other. + // Since the non-batch dimensions agree, this is the same as broadcast all the inputs + auto out_size = infer_size(input.sizes(), other.sizes()); set_output_raw_strided(0, out_size, {}, input.options()); - return TORCH_PRECOMPUTE_STRUCT(linalg_cross)().set_dim(dim); } } @@ -56,8 +59,9 @@ Tensor & cross_out(const Tensor & input, const Tensor & other, const c10::option TORCH_IMPL_FUNC(linalg_cross_out) -(const Tensor & input, const Tensor & other, const int64_t dim, const Tensor & out) { - auto out_size = infer_size(input.sizes(), other.sizes()); +(const Tensor & input, const Tensor & other, int64_t dim, const Tensor & out) { + dim = maybe_wrap_dim(dim, input.dim()); + auto out_size = out.sizes(); Tensor input_broadcasted = input.expand(out_size); Tensor other_broadcasted = other.expand(out_size); diff --git a/aten/src/ATen/native/DispatchStub.h b/aten/src/ATen/native/DispatchStub.h index 6e71b5bb5881b..bcbf41fd9d0ff 100644 --- a/aten/src/ATen/native/DispatchStub.h +++ b/aten/src/ATen/native/DispatchStub.h @@ -114,6 +114,7 @@ struct TORCH_API DispatchStubImpl { std::atomic cpu_dispatch_ptr; void* cuda_dispatch_ptr; void* hip_dispatch_ptr; + void* mps_dispatch_ptr; #else std::atomic cpu_dispatch_ptr{nullptr}; void* cuda_dispatch_ptr = nullptr; diff --git a/aten/src/ATen/native/Dropout.cpp b/aten/src/ATen/native/Dropout.cpp index 36e1b92ad1bdb..2514f15d00e1a 100644 --- a/aten/src/ATen/native/Dropout.cpp +++ b/aten/src/ATen/native/Dropout.cpp @@ -109,7 +109,7 @@ native_dropout_cpu(const Tensor& input, double p, c10::optional train) { return std::make_tuple(output, mask); } -Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, double scale) { +Tensor native_dropout_backward(const Tensor& grad, const Tensor& mask, double scale) { Tensor result = grad * mask * scale; return result; } @@ -117,7 +117,10 @@ Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, doubl Tensor dropout(const Tensor& input, double p, bool train) { auto result = [&]() { NoNamesGuard guard; - if (train && is_fused_kernel_acceptable(input, p)) { + // TODO: we can remove this is_nested() code smell in the future + // if we find a way to support _dropout for nested tensor + // e.g. make it an op (at::_dropout) to use dispatcher? + if (input.is_nested() || (train && is_fused_kernel_acceptable(input, p))) { return std::get<0>(at::native_dropout(input, p, train)); } return _dropout(input, p, train); diff --git a/aten/src/ATen/native/ForeachOpsKernels.cpp b/aten/src/ATen/native/ForeachOpsKernels.cpp index 7d6dec7ad24a7..f5665be248e46 100644 --- a/aten/src/ATen/native/ForeachOpsKernels.cpp +++ b/aten/src/ATen/native/ForeachOpsKernels.cpp @@ -199,6 +199,9 @@ FOREACH_POINTWISE_OP_SCALAR(addcmul); FOREACH_POINTWISE_OP_SCALARLIST(addcdiv); FOREACH_POINTWISE_OP_SCALARLIST(addcmul); +// NOTE(crcrpar): It didn't seem feasible to use `self[i]` as both the first and the last +// arguments of `maximum_out` and `minimum_out` so I tentatively embarrassingly get and copy +// the result to `self[i]`. #define FOREACH_MAXIMUM_MINIMUM_OP(NAME) \ std::vector foreach_tensor_##NAME##_slow(TensorList tensors1, TensorList tensors2) { \ check_foreach_api_restrictions(tensors1, tensors2); \ @@ -211,6 +214,13 @@ std::vector foreach_tensor_##NAME##_slow(TensorList tensors1, TensorList \ return result; \ } \ +void foreach_tensor_##NAME##_slow_(TensorList self, TensorList other) { \ + check_foreach_api_restrictions(self, other); \ + for (const auto i : c10::irange(self.size())) { \ + const auto tmp = at::NAME(self[i], other[i]); \ + self[i].copy_(tmp, /* non_blocking */ true); \ + } \ +} FOREACH_MAXIMUM_MINIMUM_OP(maximum) FOREACH_MAXIMUM_MINIMUM_OP(minimum) diff --git a/aten/src/ATen/native/Integration.cpp b/aten/src/ATen/native/Integration.cpp index d1795a70b7222..7ca01bae18a57 100644 --- a/aten/src/ATen/native/Integration.cpp +++ b/aten/src/ATen/native/Integration.cpp @@ -34,10 +34,10 @@ Tensor do_trapezoid(const Tensor& y, double dx, int64_t dim) { } Tensor zeros_like_except(const Tensor& y, int64_t dim) { - auto sizes = y.sizes().vec(); + auto sizes = y.sym_sizes().vec(); dim = maybe_wrap_dim(dim, y.dim()); sizes.erase(sizes.begin() + dim); - return at::zeros(sizes, y.options()); + return at::zeros_symint(sizes, y.options()); } Tensor do_cumulative_trapezoid(const Tensor& y, const Tensor& dx, int64_t dim) { @@ -111,7 +111,7 @@ Tensor trapezoid(const Tensor& y, const Tensor& x, int64_t dim) { Tensor trapezoid(const Tensor& y, const Scalar& dx, int64_t dim) { // see above - if (y.size(dim) == 0) { + if (y.sym_size(dim) == 0) { return zeros_like_except(y, dim); } TORCH_CHECK(y.scalar_type() != kBool, "trapezoid: received a bool input for `y`, but bool is not supported") diff --git a/aten/src/ATen/native/Linear.cpp b/aten/src/ATen/native/Linear.cpp index a002369fc547d..6137a2f0f153f 100644 --- a/aten/src/ATen/native/Linear.cpp +++ b/aten/src/ATen/native/Linear.cpp @@ -26,9 +26,6 @@ Tensor linear(const Tensor& input, const Tensor& weight, const c10::optional batch_norm_cpu_update_stats_template( auto _var_sum = at::empty({n_input}, input.options().dtype(dtype)); auto _mean_a = _mean.accessor(); auto _var_sum_a = _var_sum.accessor(); + auto momentum_ = static_cast(momentum); batch_norm_cpu_collect_stats_stub(kCPU, _mean, _var_sum, input); @@ -195,11 +196,11 @@ std::tuple batch_norm_cpu_update_stats_template( save_var_transform_a[f] = VarTransform{}(_var_sum_a[f] / n, eps); if (running_mean.defined()) { - running_mean_a[f] = momentum * _mean_a[f] + (1 - momentum) * running_mean_a[f]; + running_mean_a[f] = momentum_ * _mean_a[f] + (1 - momentum_) * running_mean_a[f]; } if (running_var.defined()) { - accscalar_t unbiased_var = _var_sum_a[f] / (n - 1); - running_var_a[f] = momentum * unbiased_var + (1 - momentum) * running_var_a[f]; + accscalar_t unbiased_var = _var_sum_a[f] / (n - 1); + running_var_a[f] = momentum_ * unbiased_var + (1 - momentum_) * running_var_a[f]; } } }); @@ -523,7 +524,7 @@ std::tuple _batch_norm_impl_index( && cudnn_enabled ); - if (use_miopen) { + if (use_miopen && input.suggest_memory_format() != MemoryFormat::ChannelsLast && input.suggest_memory_format() != MemoryFormat::ChannelsLast3d) { return std::tuple_cat( at::miopen_batch_norm( input.contiguous(), weight.contiguous(), bias.contiguous(), diff --git a/aten/src/ATen/native/Onehot.cpp b/aten/src/ATen/native/Onehot.cpp index 7455e27a1701e..a0c061062174b 100644 --- a/aten/src/ATen/native/Onehot.cpp +++ b/aten/src/ATen/native/Onehot.cpp @@ -23,14 +23,14 @@ Tensor one_hot(const Tensor &self, int64_t num_classes) { } // non-empty tensor - if (self.device().type() != at::kCUDA) { + if (self.device().type() != at::kCUDA && self.device().type() != at::kMPS) { //for cuda, rely on device assert thrown by scatter TORCH_CHECK(self.min().item().toLong() >= 0, "Class values must be non-negative."); } if (num_classes == -1) { num_classes = self.max().item().toLong() + 1; } else { - if (self.device().type() != at::kCUDA) { + if (self.device().type() != at::kCUDA && self.device().type() != at::kMPS) { //rely on device asserts from scatter to avoid sync here TORCH_CHECK(num_classes > self.max().item().toLong(), "Class values must be smaller than num_classes."); } else { diff --git a/aten/src/ATen/native/Pool.h b/aten/src/ATen/native/Pool.h index 0f3885524a79a..1106c5db0134f 100644 --- a/aten/src/ATen/native/Pool.h +++ b/aten/src/ATen/native/Pool.h @@ -58,6 +58,11 @@ template static inline T pooling_output_shape( T inputSize, T kernelSize, T pad, T stride, T dilation, bool ceil_mode) { TORCH_CHECK(stride != 0, "stride should not be zero"); + TORCH_CHECK(pad >= 0, + "pad must be non-negative, but got pad: ", pad); + TORCH_CHECK(pad <= kernelSize / 2, + "pad should be at most half of kernel size, but got pad=", + pad, " and kernel_size=", kernelSize) return pooling_output_shape_pad_lr( inputSize, kernelSize, pad, pad, stride, dilation, ceil_mode); } diff --git a/aten/src/ATen/native/README.md b/aten/src/ATen/native/README.md index 043e93e332a69..cfce94a36c0e4 100644 --- a/aten/src/ATen/native/README.md +++ b/aten/src/ATen/native/README.md @@ -476,6 +476,28 @@ as `Tensor &`, which 1) allowed changing which `TensorImpl` the `Tensor` itself was not necessary to allow the underlying data to change. (This was like using `T * const` when we wanted `const T*`.) +### `autogen` + +``` +- func: my_op_(Tensor(a!) self) -> Tensor(a!) +... + autogen: my_op, my_op.out +``` + +`autogen` keyword is being used to specify which native function the codegen system should generate +implementations for. +* For an in-place variant of a native function (op name ends with an `_`), we will generate a functional +variant and an out= variant. +* If a functional variant is given, we generate an out= variant. +* We don't support `autogen` for view ops, ops that bypass the dispatcher as well as composite ops. + +We also generate kernels for generated ops, which merely copy and return the result from the base ops. +These generated kernels can be found in `/aten/src/ATen/CompositeViewCopyKernels.cpp`. + +Also notice that for new operators being added to `native_functions.yaml`, if they satisfy the requirements +mentioned above, they should include `autogen` keyword, since functionalization depends on it. We will +enforce this in codegen. + ## Writing an implementation in C++ diff --git a/aten/src/ATen/native/RangeFactories.cpp b/aten/src/ATen/native/RangeFactories.cpp index b4eff5ed9e21f..038da93456edb 100644 --- a/aten/src/ATen/native/RangeFactories.cpp +++ b/aten/src/ATen/native/RangeFactories.cpp @@ -142,6 +142,10 @@ Tensor& range_out(const Scalar& start, const Scalar& end, const Scalar& step, Te return result; } +Tensor& range_out_no_step(const Scalar& start, const Scalar& end, Tensor& result) { + return range_out(start, end, /*step = */ 1, result); +} + Tensor& arange_out(const Scalar& start, const Scalar& end, const Scalar& step, Tensor& result) { AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, result.scalar_type(), "arange_cpu", [&]() { using accscalar_t = at::acc_type; diff --git a/aten/src/ATen/native/ReduceAllOps.cpp b/aten/src/ATen/native/ReduceAllOps.cpp index 43538c2347627..31764734b67ab 100644 --- a/aten/src/ATen/native/ReduceAllOps.cpp +++ b/aten/src/ATen/native/ReduceAllOps.cpp @@ -1,4 +1,5 @@ #include +#include #include #include @@ -17,6 +18,13 @@ Tensor min(const Tensor &self) { return result; } +Tensor& min_unary_out(const Tensor &self, Tensor& out) { + Tensor tmp_output = at::min(self); + at::native::resize_output(out, tmp_output.sizes()); + out.copy_(tmp_output); + return out; +} + Tensor max(const Tensor &self) { TORCH_CHECK(self.numel() > 0, "max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument."); @@ -25,6 +33,13 @@ Tensor max(const Tensor &self) { return result; } +Tensor& max_unary_out(const Tensor &self, Tensor& out) { + Tensor tmp_output = at::max(self); + at::native::resize_output(out, tmp_output.sizes()); + out.copy_(tmp_output); + return out; +} + // DEPRECATED: Use at::aminmax instead std::tuple _aminmax_all(const Tensor &self) { TORCH_WARN_ONCE("_aminmax is deprecated as of PyTorch 1.11 and will be removed in a future release. Use aminmax instead." diff --git a/aten/src/ATen/native/ReduceOps.cpp b/aten/src/ATen/native/ReduceOps.cpp index 52ddcd83774ff..803892a781a4b 100644 --- a/aten/src/ATen/native/ReduceOps.cpp +++ b/aten/src/ATen/native/ReduceOps.cpp @@ -1079,16 +1079,12 @@ Tensor sum(const Tensor& self, DimnameList dim, bool keepdim, c10::optional opt_dtype) { - return at::sum(input_t, c10::asIntArrayRefSlow(dim), keepdim, opt_dtype); -} - Tensor& sum_out(const Tensor& self, DimnameList dim, bool keepdim, optional opt_dtype, Tensor& result) { return at::sum_out(result, self, dimnames_to_positions(self, dim), keepdim, opt_dtype); } -Tensor& nansum_out(const Tensor& self, IntArrayRef dim, +Tensor& nansum_out(const Tensor& self, at::OptionalIntArrayRef dim, bool keepdim, optional opt_dtype, Tensor& result) { TORCH_CHECK(!c10::isComplexType(self.scalar_type()), "nansum does not support complex inputs"); // For integral types, use existing sum as @@ -1107,7 +1103,7 @@ Tensor& nansum_out(const Tensor& self, IntArrayRef dim, return result; } -Tensor nansum(const Tensor& self, IntArrayRef dim, bool keepdim, c10::optional opt_dtype) { +Tensor nansum(const Tensor& self, at::OptionalIntArrayRef dim, bool keepdim, c10::optional opt_dtype) { ScalarType dtype = get_dtype_from_self(self, opt_dtype, true); Tensor result = create_reduction_result(self, dim, keepdim, dtype); return at::native::nansum_out(self, dim, keepdim, dtype, result); @@ -1239,7 +1235,7 @@ Tensor& mean_out(const Tensor& self, DimnameList dim, // TODO(@heitorschueroff) implement custom kernels for nanmean Tensor& nanmean_out( const Tensor& self, - IntArrayRef dim, + at::OptionalIntArrayRef dim, bool keepdim, c10::optional opt_dtype, Tensor& result) { @@ -1254,7 +1250,7 @@ Tensor& nanmean_out( Tensor nanmean( const Tensor& self, - IntArrayRef dim, + at::OptionalIntArrayRef dim, bool keepdim, optional opt_dtype) { TORCH_CHECK( @@ -1603,7 +1599,7 @@ static Tensor& std_var_out( if (at::isComplexType(self.scalar_type())) { // For complex, calculate variance of real and imaginary components - // seperately then add to get overall variance. + // separately then add to get overall variance. ScalarType dtype = c10::toRealValueType(get_dtype_from_result(result, {})); Tensor real_in = at::real(self); Tensor real_out = at::empty({0}, self.options().dtype(dtype)); @@ -1674,7 +1670,7 @@ static std::tuple std_var_mean_out( result1.scalar_type(), " and ", result2.scalar_type(), "."); if (at::isComplexType(self.scalar_type())) { - // For complex, calculate for real and imaginary components seperately then combine as: + // For complex, calculate for real and imaginary components separately then combine as: // variance = var_real + var_imag // mean = mean_real + j * mean_imag ScalarType dtype = c10::toRealValueType(get_dtype_from_result(result1, {})); @@ -1729,25 +1725,31 @@ static std::tuple std_var_mean_out( } std::tuple var_mean( - const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) { - return at::var_mean(self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); + const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) { + return at::var_mean( + self, /*dim=*/at::OptionalIntArrayRef(dim), + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), + keepdim); } std::tuple std_mean( - const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) { - return at::std_mean(self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); + const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) { + return at::std_mean( + self, /*dim=*/at::OptionalIntArrayRef(dim), + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), + keepdim); } std::tuple std_mean(const Tensor& self, bool unbiased) { return at::std_mean( - self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0}); + self, /*dim=*/c10::nullopt, + /*correction=*/c10::make_optional({unbiased ? 1 : 0})); } std::tuple var_mean(const Tensor& self, bool unbiased) { return at::var_mean( - self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0}); + self, /*dim=*/c10::nullopt, + /*correction=*/c10::make_optional({unbiased ? 1 : 0})); } std::tuple var_mean_out( @@ -1782,32 +1784,37 @@ std::tuple std_mean( Tensor var(const Tensor& self, bool unbiased) { return at::var( - self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0}); + self, /*dim=*/c10::nullopt, + /*correction=*/c10::make_optional({unbiased ? 1 : 0})); } -Tensor var(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) { - return at::var(self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); +Tensor var(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) { + return at::var( + self, /*dim=*/at::OptionalIntArrayRef(dim), + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), + keepdim); } -Tensor& var_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) { - return at::var_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); +Tensor& var_out(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) { + return at::var_out( + result, self, /*dim=*/at::OptionalIntArrayRef(dim), + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), + keepdim); } Tensor std(const Tensor& self, bool unbiased) { return at::std( - self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0}); + self, /*dim=*/c10::nullopt, /*correction=*/c10::make_optional({unbiased ? 1 : 0})); } -Tensor std(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) { - return at::std(self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); +Tensor std(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) { + return at::std(self, dim, + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), keepdim); } -Tensor& std_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) { - return at::std_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim), - /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim); +Tensor& std_out(const Tensor& self, at::OptionalIntArrayRef opt_dim, bool unbiased, bool keepdim, Tensor& result) { + return at::std_out(result, self, opt_dim, + /*correction=*/c10::make_optional({unbiased ? 1 : 0}), keepdim); } Tensor std(const Tensor& self, at::OptionalIntArrayRef dim, @@ -1966,9 +1973,6 @@ bool cpu_equal(const Tensor& self, const Tensor& other) { at::NoNamesGuard guard; TORCH_CHECK(self.device() == other.device(), "Cannot compare two tensors on " "different devices. Got: ", self.device(), " and ", other.device()); - TORCH_CHECK(self.dtype() == other.dtype(), - "Expected object of scalar type ", self.dtype(), " but got scalar type ", - other.dtype(), " for argument 'other'"); if (!self.is_same_size(other)) { return false; } diff --git a/aten/src/ATen/native/ReduceOpsUtils.h b/aten/src/ATen/native/ReduceOpsUtils.h index 7c73c85d4c2ff..9db9802ea788b 100644 --- a/aten/src/ATen/native/ReduceOpsUtils.h +++ b/aten/src/ATen/native/ReduceOpsUtils.h @@ -159,7 +159,7 @@ static void resize_reduction_result( } inline Tensor create_reduction_result( - const Tensor& self, IntArrayRef dim, bool keepdim, ScalarType dtype + const Tensor& self, at::OptionalIntArrayRef dim, bool keepdim, ScalarType dtype ) { DimMask mask = make_dim_mask(dim, self.dim()); auto shape = shape_from_dim_mask(self, mask, keepdim); diff --git a/aten/src/ATen/native/SoftMax.cpp b/aten/src/ATen/native/SoftMax.cpp index 43c7874e43722..21a94d5ed9235 100644 --- a/aten/src/ATen/native/SoftMax.cpp +++ b/aten/src/ATen/native/SoftMax.cpp @@ -131,7 +131,18 @@ void host_softmax( Tensor output, const Tensor& input, const int64_t dim, - bool* mask = nullptr) { + bool* mask = nullptr, + const c10::optional mask_type_ = NULL) { + + if (MaskedSoftMax) { + TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined"); + int64_t mask_type = mask_type_.value(); + TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)"); + + // TODO: Add support for TxT src_mask + TORCH_CHECK(mask_type != 0, "src_mask not currently supported on CPU"); + } + int64_t outer_size = 1; int64_t dim_size = input.size(dim); int64_t inner_size = 1; @@ -541,7 +552,7 @@ Tensor log_softmax(const Tensor& self, Dimname dim, optional dtype) return at::log_softmax(self, dimname_to_position(self, dim), dtype); } -Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::optional dim_) { +Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::optional dim_, const c10::optional mask_type_) { TORCH_CHECK( input_.sizes() == mask_.sizes(), "Mask shape should match input shape"); TORCH_CHECK( @@ -564,7 +575,7 @@ Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10:: scalar_t, false /* LogSoftMax */, true /* MaskedSoftMax */>( - output, input, dim, mask.data_ptr()); + output, input, dim, mask.data_ptr(), mask_type_); }); return output; } diff --git a/aten/src/ATen/native/Sorting.cpp b/aten/src/ATen/native/Sorting.cpp index 18820973fd847..66b9daf7fad8c 100644 --- a/aten/src/ATen/native/Sorting.cpp +++ b/aten/src/ATen/native/Sorting.cpp @@ -226,9 +226,9 @@ Tensor quantile_compute( // NOTE: this check is only performed when running on the CPU to avoid // synchronizing an accelerator with the CPU if (self.device().is_cpu()) { - TORCH_CHECK( - q.ge(0).logical_and_(q.le(1)).all().item(), - "quantile() q values must be in the range [0, 1]"); + auto all_q_in_range = q.ge(0).logical_and_(q.le(1)).all(); + TORCH_CHECK(at::is_scalar_tensor_true(all_q_in_range), + "quantile() q values must be in the range [0, 1]"); } // Flatten input if no dim provided else move dim to reduce as last dimension. diff --git a/aten/src/ATen/native/SpectralOps.cpp b/aten/src/ATen/native/SpectralOps.cpp index d6389608a9e36..c2e5bda454ea4 100644 --- a/aten/src/ATen/native/SpectralOps.cpp +++ b/aten/src/ATen/native/SpectralOps.cpp @@ -1,5 +1,6 @@ #include #include +#include #include #include #include @@ -1100,8 +1101,8 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional ho y = y.slice(2, start, end, 1); window_envelop = window_envelop.slice(2, start, end, 1); - const auto window_envelop_lowest = window_envelop.abs().min().item().toDouble(); - if (window_envelop_lowest < 1e-11) { + const auto window_envelop_lowest = window_envelop.abs().min().lt(1e-11); + if (at::is_scalar_tensor_true(window_envelop_lowest)) { std::ostringstream ss; REPR(ss) << "window overlap add min: " << window_envelop_lowest; AT_ERROR(ss.str()); @@ -1121,7 +1122,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional ho } return y; - #undef REPR +#undef REPR } Tensor istft(const Tensor& self, const int64_t n_fft, const optional hop_lengthOpt, diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.cpp b/aten/src/ATen/native/TensorAdvancedIndexing.cpp index 951d9eeb18fa3..647955ad4dfc7 100644 --- a/aten/src/ATen/native/TensorAdvancedIndexing.cpp +++ b/aten/src/ATen/native/TensorAdvancedIndexing.cpp @@ -521,9 +521,9 @@ AdvancedIndex::AdvancedIndex(const Tensor& src, TensorList indices_list) } } - // For CUDA tensors, force all index tensors to have the same striding to - // simplify the CUDA kernel. - if (indices.size() >= 2 && this->src.device().type() == kCUDA) { + // For CUDA/MPS tensors, force all index tensors to have the same striding to + // simplify the CUDA/MPS kernel. + if (indices.size() >= 2 && (this->src.device().type() == kCUDA || this->src.device().type() == kMPS)) { if (!all_strides_match(indices)) { for (auto & indice : indices) { indice = indice.contiguous(); diff --git a/aten/src/ATen/native/TensorConversions.cpp b/aten/src/ATen/native/TensorConversions.cpp index 02ccd133c7ee0..819516f673971 100644 --- a/aten/src/ATen/native/TensorConversions.cpp +++ b/aten/src/ATen/native/TensorConversions.cpp @@ -400,10 +400,10 @@ Tensor sparse_compressed_to_dense( if (self.dim() > 3) { // Flatten batch dims auto n_batch_dim = self.dim() - 2; - crow_indices = crow_indices.flatten(0, n_batch_dim); - col_indices = col_indices.flatten(0, n_batch_dim); - values = values.flatten(0, n_batch_dim); - dense = dense.flatten(0, n_batch_dim); + crow_indices = crow_indices.flatten(0, n_batch_dim - 1); + col_indices = col_indices.flatten(0, n_batch_dim - 1); + values = values.flatten(0, n_batch_dim - 1); + dense = dense.flatten(0, n_batch_dim - 1); } // At this point everything has 3d shape either the batch dim was inserted, @@ -427,8 +427,8 @@ Tensor sparse_compressed_to_dense( dense[batch].index_add_(0, offsets, values[batch]); } - // untile the result, NOTE: The final reshape uses the original self.sizes() - // which will squeeze out the extra batch dim if we put one in + // un-tile the result, NOTE: The final reshape uses the original + // self.sizes() which will squeeze out the extra batch dim if we put one in return dense .unflatten( 1, {self.size(-2) / blocksize[0], self.size(-1) / blocksize[1]}) diff --git a/aten/src/ATen/native/TensorFactories.cpp b/aten/src/ATen/native/TensorFactories.cpp index 7d112b9f415d4..c9cc522e06b83 100644 --- a/aten/src/ATen/native/TensorFactories.cpp +++ b/aten/src/ATen/native/TensorFactories.cpp @@ -101,8 +101,12 @@ Tensor arange( return at::arange_out(result, start, end, step); } +Tensor& arange_start_out(const Scalar& start, const Scalar& end, Tensor& result) { + return at::arange_out(result, start, end, /*step=*/1); +} + Tensor& arange_out(const Scalar& end, Tensor& result) { - return at::arange_out(result, /*start=*/0, end); + return at::arange_out(result, /*start=*/0, end, /*step=*/1); } Tensor& arange_out(Tensor& result, const Scalar& start, const Scalar& end) { @@ -1019,12 +1023,12 @@ Tensor tril_indices_cpu( // // 3. sequential RAM + transpose: create an n X 2 Tensor, fill the Tensor // sequentially, and then transpose it. - AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "tril_indices", [&]() -> void { + AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "tril_indices", [&]() -> void { // fill the Tensor with correct values - scalar_t* result_data = result.data_ptr(); + index_t* result_data = result.data_ptr(); int64_t i = 0; - scalar_t r = std::max(0, -offset), c = 0; + index_t r = std::max(0, -offset), c = 0; while (i < tril_size) { result_data[i] = r; result_data[tril_size + i++] = c; @@ -1057,14 +1061,14 @@ Tensor triu_indices_cpu( // create an empty Tensor with correct size auto result = at::native::empty_cpu({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt); - AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "triu_indices", [&]() -> void { + AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "triu_indices", [&]() -> void { // fill the Tensor with correct values - scalar_t* result_data = result.data_ptr(); + index_t* result_data = result.data_ptr(); int64_t i = 0; // not typing std::max with scalar_t as it could be an unsigned type // NOTE: no need to check if the returned value of std::max overflows - // scalar_t, as i and triu_size act as a guard. - scalar_t c = std::max(0, offset), r = 0; + // index_t, as i and triu_size act as a guard. + index_t c = std::max(0, offset), r = 0; while (i < triu_size) { result_data[i] = r; result_data[triu_size + i++] = c; @@ -1091,19 +1095,19 @@ Tensor zeros(IntArrayRef size, c10::optional layout, c10::optional device, c10::optional pin_memory) { - // See [Note: hacky wrapper removal for TensorOptions] - TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); - - auto result = at::empty(size, options); - return result.zero_(); + return at::zeros_symint(c10::SymIntArrayRef::fromIntArrayRef(size), dtype, layout, device, pin_memory); } -Tensor zeros_symint(c10::SymIntArrayRef size, +Tensor zeros_symint(SymIntArrayRef size, c10::optional dtype, c10::optional layout, c10::optional device, c10::optional pin_memory) { - return at::zeros(asIntArrayRefSlow(size), dtype, layout, device, pin_memory); + // See [Note: hacky wrapper removal for TensorOptions] + TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); + + auto result = at::empty_symint(size, options); + return result.zero_(); } Tensor _efficientzerotensor(IntArrayRef size, @@ -1143,7 +1147,7 @@ Tensor zeros_like( TORCH_CHECK( !(optional_memory_format.has_value()), "memory format option is only supported by strided tensors"); - auto res = at::empty({0}, options); // to be resized + auto res = at::empty({0}, self.options().merge_in(options)); // to be resized if (self.is_sparse()) { res.sparse_resize_and_clear_( @@ -1186,11 +1190,12 @@ Tensor bartlett_window(int64_t window_length, Tensor bartlett_window( int64_t window_length, bool periodic, - c10::optional dtype, + c10::optional dtype_opt, c10::optional layout, c10::optional device, c10::optional pin_memory) { // See [Note: hacky wrapper removal for TensorOptions] + ScalarType dtype = c10::dtype_or_default(dtype_opt); TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); window_function_checks("bartlett_window", options, window_length); @@ -1224,11 +1229,12 @@ Tensor blackman_window(int64_t window_length, Tensor blackman_window( int64_t window_length, bool periodic, - c10::optional dtype, + c10::optional dtype_opt, c10::optional layout, c10::optional device, c10::optional pin_memory) { // See [Note: hacky wrapper removal for TensorOptions] + ScalarType dtype = c10::dtype_or_default(dtype_opt); TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); window_function_checks("blackman_window", options, window_length); @@ -1294,11 +1300,12 @@ Tensor hamming_window( bool periodic, double alpha, double beta, - c10::optional dtype, + c10::optional dtype_opt, c10::optional layout, c10::optional device, c10::optional pin_memory) { // See [Note: hacky wrapper removal for TensorOptions] + ScalarType dtype = c10::dtype_or_default(dtype_opt); TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); window_function_checks("hamming_window", options, window_length); @@ -1370,11 +1377,12 @@ Tensor kaiser_window( int64_t window_length, bool periodic, double beta, - c10::optional dtype, + c10::optional dtype_opt, c10::optional layout, c10::optional device, c10::optional pin_memory) { // See [Note: hacky wrapper removal for TensorOptions] + ScalarType dtype = c10::dtype_or_default(dtype_opt); TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); window_function_checks("kaiser_window", options, window_length); diff --git a/aten/src/ATen/native/TensorShape.cpp b/aten/src/ATen/native/TensorShape.cpp index bbecb346ce3ef..85fb9fb627efb 100644 --- a/aten/src/ATen/native/TensorShape.cpp +++ b/aten/src/ATen/native/TensorShape.cpp @@ -123,11 +123,11 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(ITensorListRef tensors, int64_t dim) { size_t size_at_dim = 0; for (const auto i : c10::irange(materialized.size())) { const Tensor& t = materialized[i]; + all_same_dtype = all_same_dtype && out_dtype == t.scalar_type(); if (!at::native::cat_should_skip_tensor(t)) { at::native::check_cat_shape_except_dim(materialized[valid], t, dim, i); size_at_dim += t.size(dim); all_contiguous = all_contiguous && t.is_contiguous(memory_format); - all_same_dtype = all_same_dtype && out_dtype == t.scalar_type(); all_same_sizes_and_stride = all_same_sizes_and_stride && t.sizes() == materialized[valid].get().sizes() && t.strides() == materialized[valid].get().strides(); @@ -2188,17 +2188,18 @@ std::vector dsplit(const Tensor& self, int64_t split_size) { std::vector split_with_sizes(const Tensor& self, IntArrayRef split_sizes, int64_t dim) { TORCH_CHECK(self.dim() != 0, "split expects at least a 1-dimensional tensor"); - int64_t dim_size = self.size(dim); - int64_t num_splits = split_sizes.size(); - std::vector splits(num_splits); + const int64_t dim_size = self.size(dim); + const int64_t num_splits = split_sizes.size(); int64_t start_idx = 0; + std::vector splits; + splits.reserve(num_splits); for (const auto i : c10::irange(num_splits)) { auto length = split_sizes[i]; TORCH_CHECK(length >= 0, "split_with_sizes expects split_sizes have only non-negative ", "entries, but got split_sizes=", split_sizes); - splits[i] = self.narrow(dim, start_idx, length); + splits.push_back(at::native::slice(self, dim, start_idx, start_idx + length, 1)); start_idx += length; } TORCH_CHECK(start_idx == dim_size, @@ -3249,7 +3250,7 @@ Tensor diagonal_backward(const Tensor & grad, IntArrayRef input_sizes, int64_t o Tensor movedim(const Tensor& self, IntArrayRef src, IntArrayRef dst) { TORCH_CHECK(src.size() == dst.size(), "movedim: Invalid source or destination dims: source (", - src, " dims ) should contain the same number of dims as destination (", dst, " dims)"); + src, " dims) should contain the same number of dims as destination (", dst, " dims)"); size_t self_dim = self.dim(); DimVector normalized_src(src.size()); diff --git a/aten/src/ATen/native/TestOps.cpp b/aten/src/ATen/native/TestOps.cpp index 9a3a5b10cb269..a8c30f5c3ba61 100644 --- a/aten/src/ATen/native/TestOps.cpp +++ b/aten/src/ATen/native/TestOps.cpp @@ -1,6 +1,7 @@ // Copyright 2004-present Facebook. All Rights Reserved. #include +#include #include #include @@ -74,5 +75,33 @@ Tensor _test_warn_in_autograd(const Tensor &self) { return self.clone(); } +// Test registration of per-dispatch-key derivatives in derivatives.yaml. +// See derivatives.yaml for dummy registrations. + +Tensor _test_autograd_multiple_dispatch_fullcoverage(const Tensor &self) { + return self.clone(); +} + +Tensor _test_autograd_multiple_dispatch_ntonly(const Tensor &self, bool b) { + return self.clone(); +} + +// Test derivative dispatch registration for view_copy ops +Tensor _test_autograd_multiple_dispatch_view(const Tensor &self) { + return self.view(-1); +} + } // namespace native + +namespace functionalization { + +// view_copy ops must have a functional inverse registered +Tensor FunctionalInverses::_test_autograd_multiple_dispatch_view_copy_inverse(const at::Tensor& base, const at::Tensor& mutated_view, bool reapply_views) { + TORCH_INTERNAL_ASSERT(false, + "Attempted to call _test_autograd_multiple_dispatch_view_copy_inverse() during the functionalization pass. ", + "This function is for testing only and should never be called."); + return Tensor(); +} + +} // namespace functionalization } // namespace at diff --git a/aten/src/ATen/native/UpSample.h b/aten/src/ATen/native/UpSample.h index 6b248352de6ad..f3dd836444d13 100644 --- a/aten/src/ATen/native/UpSample.h +++ b/aten/src/ATen/native/UpSample.h @@ -2,11 +2,11 @@ #include -#include +#include #include +#include #include - /** * Note [compute_scales_value] * Note [area_pixel_compute_scale] @@ -288,7 +288,8 @@ static inline scalar_t area_pixel_compute_source_index( if (align_corners) { return scale * dst_index; } else { - scalar_t src_idx = scale * (dst_index + 0.5) - 0.5; + scalar_t src_idx = scale * (dst_index + static_cast(0.5)) - + static_cast(0.5); // [Note] Follow Opencv resize logic: // We allow negative src_idx here and later will use // dx = src_idx - floorf(src_idx) @@ -301,7 +302,8 @@ static inline scalar_t area_pixel_compute_source_index( // where we should and then remove this cubic flag. // This matters in cubic mode, as we might need [-1, 0, 1, 2] // to interpolate and the weights can be affected. - return (!cubic && src_idx < 0) ? scalar_t(0) : src_idx; + return (!cubic && src_idx < static_cast(0)) ? scalar_t(0) + : src_idx; } } @@ -445,8 +447,10 @@ static inline void compute_source_index_and_lambda( lambda0 = static_cast(1); lambda1 = static_cast(0); } else { - const scalar_t real_input_index = area_pixel_compute_source_index( - ratio, output_index, align_corners, /*cubic=*/false); + using accscalar_t = at::acc_type; + const accscalar_t real_input_index = + area_pixel_compute_source_index( + ratio, output_index, align_corners, /*cubic=*/false); input_index0 = static_cast(real_input_index); int64_t offset = (input_index0 < input_size - 1) ? 1 : 0; input_index1 = input_index0 + offset; diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h b/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h index 098b862297fd5..6ac89681899c5 100644 --- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h +++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h @@ -19,7 +19,7 @@ struct TORCH_API PackedLinearWeightQnnp PackedLinearWeightQnnp(const at::Tensor& weight, const c10::optional& bias, const int64_t out_features_block_size /* block sparsity size across output_features */, const int64_t in_features_block_size /* block sparsity size across input_features */); explicit PackedLinearWeightQnnp(const BCSRSerializationType& serialized); c10::optional orig_bias_; - // Seperate copy of bias exist so that we can fill in zeros when + // Separate copy of bias exist so that we can fill in zeros when // optional bias does not exist. This is to compy with qnnpack operator that // expects bias to be present. // In case bias is present bias_ is just a reference to orig_bias_ diff --git a/aten/src/ATen/native/cpu/CopyKernel.cpp b/aten/src/ATen/native/cpu/CopyKernel.cpp index de1841d989c3b..27df65c7b0485 100644 --- a/aten/src/ATen/native/cpu/CopyKernel.cpp +++ b/aten/src/ATen/native/cpu/CopyKernel.cpp @@ -13,9 +13,6 @@ namespace native { inline namespace CPU_CAPABILITY { void neg_kernel(TensorIteratorBase &iter); void conj_kernel(TensorIteratorBase &iter); -} // namespace CPU_CAPABILITY - -namespace { void float_bfloat16_copy_kernel(TensorIteratorBase &iter, bool requires_neg) { auto strides_out = iter.strides(0); @@ -246,22 +243,20 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) { AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, dtype, "copy_", [&] { using dest_t = scalar_t; AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, iter.dtype(1), "copy_", [&] { - // Note (@zasdfgbnm): - // - // The code below can not be simplified as - // cpu_kernel(iter, c10::static_cast_with_inter_type::apply); - // - // because this would force the compiler to instantiate the inline function and generate a function call in the loop - // instead of inlining it, making all the optimizations like vectorization impossible. - // You can verify this by looking the the symbols of `libtorch_cpu.so`: - // - // readelf -Ws libtorch_cpu.so | grep static_cast_with_inter_type - // - // If done correctly, the above command should have no output. - // - // See: https://github.com/pytorch/pytorch/issues/31271 - cpu_kernel(iter, [](scalar_t src) -> dest_t { - return c10::static_cast_with_inter_type::apply(src); }); + if (iter.has_contiguous_first_dim()) { + TORCH_INTERNAL_ASSERT(iter.ninputs() == 1); + TORCH_INTERNAL_ASSERT(iter.noutputs() == 1); + + iter.for_each([](char **data, const int64_t *strides, int64_t size) { + auto src = reinterpret_cast(data[1]); + auto dst = reinterpret_cast(data[0]); + at::vec::convert(src, dst, size); + }); + } else { + cpu_kernel(iter, [](scalar_t x) -> dest_t { + return c10::convert(x); + }); + } }); }); @@ -274,7 +269,7 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) { } } -} // anonymous namespace +} // namespace CPU_CAPABILITY REGISTER_DISPATCH(copy_stub, ©_kernel); diff --git a/aten/src/ATen/native/cpu/CopyKernel.h b/aten/src/ATen/native/cpu/CopyKernel.h new file mode 100644 index 0000000000000..9d2affd6101ab --- /dev/null +++ b/aten/src/ATen/native/cpu/CopyKernel.h @@ -0,0 +1,12 @@ +#pragma once + +namespace at { +struct TensorIteratorBase; + +namespace native { +inline namespace CPU_CAPABILITY { + +void direct_copy_kernel(TensorIteratorBase &iter); +void copy_kernel(TensorIterator& iter, bool /*non_blocking*/); + +}}} // namespace at::native::CPU_CAPABILITY diff --git a/aten/src/ATen/native/cpu/Loops.h b/aten/src/ATen/native/cpu/Loops.h index 4f64a64b51a4c..2558736ddc0fc 100644 --- a/aten/src/ATen/native/cpu/Loops.h +++ b/aten/src/ATen/native/cpu/Loops.h @@ -36,11 +36,6 @@ #include #include -#ifndef _MSC_VER -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wunused-but-set-parameter" -#endif - namespace at { namespace native { inline namespace CPU_CAPABILITY { using namespace vec; @@ -398,7 +393,3 @@ void cpu_serial_kernel_vec(TensorIteratorBase& iter, func_t&& op, vec_func_t&& v } }}} // namespace at::native:: - -#ifndef _MSC_VER -#pragma GCC diagnostic pop -#endif diff --git a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp index a53587e56da4b..8a0534fd3da5f 100644 --- a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp +++ b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -203,13 +204,18 @@ static void angle_kernel(TensorIteratorBase& iter) { // NB: Ignores the negative bit on tensors void conj_kernel(TensorIteratorBase& iter) { - AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4( - kBool, kBFloat16, kHalf, kComplexHalf, iter.common_dtype(), "conj_cpu", [&]() { - cpu_kernel_vec( - iter, - [=](scalar_t a) -> scalar_t { return conj_impl(a); }, - [=](Vectorized a) { return a.conj(); }); - }); + AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cpu", + AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] { + // conj is a no-op for non-complex types + direct_copy_kernel(iter); + }) + AT_DISPATCH_CASE_COMPLEX_TYPES_AND(kComplexHalf, [&] { + cpu_kernel_vec( + iter, + [=](scalar_t a) -> scalar_t { return conj_impl(a); }, + [=](Vectorized a) { return a.conj(); }); + }) + ); } static void bitwise_not_kernel(TensorIteratorBase& iter) { diff --git a/aten/src/ATen/native/cpu/UpSampleKernel.cpp b/aten/src/ATen/native/cpu/UpSampleKernel.cpp index cfc9318623725..1c08fee0acac7 100644 --- a/aten/src/ATen/native/cpu/UpSampleKernel.cpp +++ b/aten/src/ATen/native/cpu/UpSampleKernel.cpp @@ -767,7 +767,6 @@ struct HelperInterpNearest : public HelperInterpBase { AT_DISPATCH_FLOATING_TYPES_AND( ScalarType::BFloat16, scalar_type, "compute_indices_weights_nearest", [&] { - scalar_t scale = area_pixel_compute_scale(input_size, output_size, align_corners, opt_scale); auto input_index_ptr = output[0].data_ptr(); @@ -778,10 +777,11 @@ struct HelperInterpNearest : public HelperInterpBase { // index_f32 = (output_index) * scale // input_index = floor(index_f32) // Same as OpenCV INTER_NEAREST - + using accscalar_t = at::acc_type; for (const auto i : c10::irange(output_size)) { - const scalar_t real_input_index = area_pixel_compute_source_index( - scale, i, /*align_corners=*/true, /*cubic=*/false); + const accscalar_t real_input_index = + area_pixel_compute_source_index( + scale, i, /*align_corners=*/true, /*cubic=*/false); input_index = static_cast(floorf(real_input_index)); input_index_ptr[i] = static_cast(std::min(input_index, input_size - 1)) * stride; } @@ -818,7 +818,6 @@ struct HelperInterpNearestExact : public HelperInterpNearest { AT_DISPATCH_FLOATING_TYPES( scalar_type, "compute_indices_weights_nearest", [&] { - scalar_t scale = area_pixel_compute_scale(input_size, output_size, align_corners, opt_scale); auto input_index_ptr = output[0].data_ptr(); @@ -829,10 +828,11 @@ struct HelperInterpNearestExact : public HelperInterpNearest { // index_f32 = (output_index + 0.5) * scale - 0.5 // input_index = round(index_f32) // Same as Pillow and Scikit-Image/Scipy ndi.zoom - + using accscalar_t = at::acc_type; for (const auto i : c10::irange(output_size)) { - const scalar_t real_input_index = area_pixel_compute_source_index( - scale, i, /*align_corners=*/align_corners, /*cubic=*/false); + const accscalar_t real_input_index = + area_pixel_compute_source_index( + scale, i, /*align_corners=*/align_corners, /*cubic=*/false); input_index = static_cast(floorf(real_input_index + 0.5)); input_index_ptr[i] = static_cast(std::min(input_index, input_size - 1)) * stride; } @@ -865,10 +865,8 @@ struct HelperInterpLinear : public HelperInterpBase { std::vector output; HelperInterpLinear::init_indices_weights( scalar_type, output, output_size, ndims, reshape_dim, HelperInterpLinear::interp_size); - AT_DISPATCH_FLOATING_TYPES_AND( ScalarType::BFloat16, scalar_type, "compute_indices_weights_linear", [&] { - scalar_t scale = area_pixel_compute_scale(input_size, output_size, align_corners, opt_scale); auto input_index0_ptr = output[0].data_ptr(); @@ -970,7 +968,6 @@ struct HelperInterpCubic : public HelperInterpBase { AT_DISPATCH_FLOATING_TYPES_AND( ScalarType::BFloat16, scalar_type, "compute_indices_weights_cubic", [&] { - scalar_t scale = area_pixel_compute_scale(input_size, output_size, align_corners, opt_scale); int64_t input_index; @@ -980,11 +977,11 @@ struct HelperInterpCubic : public HelperInterpBase { int64_t * idx_ptr; scalar_t * wt_ptr; - + using accscalar_t = at::acc_type; for (const auto i : c10::irange(output_size)) { - - const scalar_t real_input_index = area_pixel_compute_source_index( - scale, i, align_corners, /*cubic=*/true); + const accscalar_t real_input_index = + area_pixel_compute_source_index( + scale, i, align_corners, /*cubic=*/true); input_index = static_cast(floorf(real_input_index)); get_cubic_upsample_coefficients(coeffs, real_input_index - input_index); @@ -1184,7 +1181,6 @@ void _separable_upsample_generic_Nd_kernel_impl_single_dim( int interp_size = F::interp_size; auto input_scalar_type = input.scalar_type(); - if (interp_size == 1 && input_scalar_type == at::ScalarType::Byte) { // nearest also supports uint8 tensor, but we have to use float // with compute_indices_weights diff --git a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu index 55b0d3322e04b..2f9e57ed121ee 100644 --- a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu +++ b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu @@ -442,10 +442,14 @@ namespace { output_arg{ output, "output", 2 }; checkAllSameGPU(__func__, {input_arg, output_arg}); - for (int64_t i = 1; i < input.ndimension(); i++) { + TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2"); + int64_t ndim = input.dim(); + TORCH_CHECK((ndim == 3 || ndim == 4), + "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes()); + for (const auto i : {-2, -1}) { TORCH_CHECK(input.size(i) > 0, "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, " - "but input has sizes ", input.sizes(), " with dimension ", i, " being " + "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being " "empty"); } @@ -538,9 +542,6 @@ namespace { break; } case at::MemoryFormat::Contiguous: { - TORCH_CHECK((input.ndimension() == 3 || input.ndimension() == 4), - "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", - input.sizes()); int64_t grid_x = input.size(-3); if (input.ndimension() == 4) { input_ = input.contiguous(); diff --git a/aten/src/ATen/native/cuda/Blas.cpp b/aten/src/ATen/native/cuda/Blas.cpp index 3ca9814175c59..3ff971fdfaff8 100644 --- a/aten/src/ATen/native/cuda/Blas.cpp +++ b/aten/src/ATen/native/cuda/Blas.cpp @@ -171,8 +171,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma scalar_type == at::ScalarType::Half || scalar_type == at::ScalarType::BFloat16) && mat2_sizes[0] > 1 && mat2_sizes[1] > 1 && - mat2_sizes[0] < 65535 && mat2_sizes[1] < 65535 && - mat1_sizes[0] < 65535 && mat1_sizes[1] < 65535 && + mat2_sizes[0] < 65535*32 && mat2_sizes[1] < 65535*32 && + mat1_sizes[0] < 65535*32 && mat1_sizes[1] < 65535*32 && // avoid leaing dim >> rows bugs ((mat1.strides()[0]==1 && mat1.strides()[1]==mat1_sizes[0]) || (mat1.strides()[1] == 1 && mat1.strides()[0] == mat1_sizes[1]) || (scalar_type != at::ScalarType::Half && scalar_type != at::ScalarType::BFloat16)) && ((mat2.strides()[0]==1 && mat2.strides()[1]==mat2_sizes[0]) || (mat2.strides()[1] == 1 && mat2.strides()[0] == mat2_sizes[1]) || (scalar_type != at::ScalarType::Half && scalar_type != at::ScalarType::BFloat16)); diff --git a/aten/src/ATen/native/cuda/Copy.cu b/aten/src/ATen/native/cuda/Copy.cu index 4fb647e329d3c..9ed3c5a8bbcc4 100644 --- a/aten/src/ATen/native/cuda/Copy.cu +++ b/aten/src/ATen/native/cuda/Copy.cu @@ -23,7 +23,6 @@ namespace native { void neg_kernel_cuda(TensorIteratorBase &iter); void conj_kernel_cuda(TensorIteratorBase &iter); -namespace { void direct_copy_kernel_cuda(TensorIteratorBase &iter) { ScalarType dtype = iter.dtype(0); if (isQIntType(dtype)) { @@ -43,7 +42,6 @@ void neg_conj_kernel_cuda(TensorIteratorBase &iter) { gpu_kernel(iter, [] GPU_LAMBDA(scalar_t x) { return -std::conj(x); }); }); } -} // namespace (anonymous) using namespace at::cuda; diff --git a/aten/src/ATen/native/cuda/Copy.h b/aten/src/ATen/native/cuda/Copy.h new file mode 100644 index 0000000000000..5639567d66668 --- /dev/null +++ b/aten/src/ATen/native/cuda/Copy.h @@ -0,0 +1,10 @@ +#pragma once + +namespace at { +struct TensorIteratorBase; + +namespace native { + +void direct_copy_kernel_cuda(TensorIteratorBase &iter); + +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/CumminmaxKernel.cu b/aten/src/ATen/native/cuda/CumminmaxKernel.cu new file mode 100644 index 0000000000000..ea73273e2d4b7 --- /dev/null +++ b/aten/src/ATen/native/cuda/CumminmaxKernel.cu @@ -0,0 +1,29 @@ +#define TORCH_ASSERT_NO_OPERATORS +#include +#include + +#include +#include + +#include +#include + +namespace at { namespace native { + +void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) { + AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16, + self.scalar_type(), "cummax_cuda", [&]() { + scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits::infinity()) : std::numeric_limits::lowest(); + scan_dim_with_indices(self, values, indices, dim, init, std::greater_equal()); + }); +} + +void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) { + AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16, + self.scalar_type(), "cummin_cuda", [&]() { + scalar_t init = self.is_floating_point() ? std::numeric_limits::infinity() : std::numeric_limits::max(); + scan_dim_with_indices(self, values, indices, dim, init, std::less_equal()); + }); +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/CumprodKernel.cu b/aten/src/ATen/native/cuda/CumprodKernel.cu new file mode 100644 index 0000000000000..d1f3233abb130 --- /dev/null +++ b/aten/src/ATen/native/cuda/CumprodKernel.cu @@ -0,0 +1,23 @@ +#define TORCH_ASSERT_NO_OPERATORS +#include +#include + +#include +#include + +namespace at { namespace native { + +void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { + AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2( + ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() { + scalar_t init = 1; + scan_dim( + self, + result, + dim, + init, + std::multiplies()); + }); +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/CumsumKernel.cu b/aten/src/ATen/native/cuda/CumsumKernel.cu new file mode 100644 index 0000000000000..85866b3f0f325 --- /dev/null +++ b/aten/src/ATen/native/cuda/CumsumKernel.cu @@ -0,0 +1,25 @@ +#define TORCH_ASSERT_NO_OPERATORS +#include +#include + +#include +#include + +namespace at { namespace native { + +void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { + AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2( + ScalarType::Half, ScalarType::BFloat16, + self.scalar_type(), "cumsum_cuda", + [&]() { + scalar_t init = 0; + scan_dim( + self, + result, + dim, + init, + std::plus()); + }); +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/DistanceKernel.cu b/aten/src/ATen/native/cuda/DistanceKernel.cu index a9130bd3e8083..2ae4cd592e6bc 100644 --- a/aten/src/ATen/native/cuda/DistanceKernel.cu +++ b/aten/src/ATen/native/cuda/DistanceKernel.cu @@ -6,6 +6,8 @@ #include #include +#include +#include #include #ifndef AT_PER_OPERATOR_HEADERS @@ -21,20 +23,7 @@ namespace at { namespace native { namespace { -static const int forward_threads = 256; - -template -static __forceinline__ __device__ scalar_t device_sqrt(scalar_t val); - -template <> -__forceinline__ __device__ float device_sqrt(float val) { - return ::sqrtf(val); -} - -template <> -__forceinline__ __device__ double device_sqrt(double val) { - return ::sqrt(val); -} +constexpr int kCUDANumThreads = 256; template struct dists { @@ -92,27 +81,16 @@ struct dists { }; template -__device__ static inline scalar_t reduce_agg(scalar_t agg) { - for (int offset = warpSize / 2; offset > 0; offset /= 2) { - F::agg(agg, WARP_SHFL_DOWN(agg, offset)); - } - - __shared__ scalar_t shared[forward_threads]; - int lane = threadIdx.x % warpSize; - int warp_id = threadIdx.x / warpSize; - if (lane == 0) { - shared[warp_id] = agg; - } +struct DistReduceOp { + __forceinline__ __device__ scalar_t combine(scalar_t a, scalar_t b) const { + F::agg(a, b); + return a; + } - __syncthreads(); - agg = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0.0; - if (warp_id == 0) { - for (int offset = blockDim.x / warpSize / 2; offset > 0; offset /= 2) { - F::agg(agg, WARP_SHFL_DOWN(agg, offset)); + __forceinline__ __device__ scalar_t warp_shfl_down(scalar_t data, int offset) const { + return WARP_SHFL_DOWN(data, offset); } - } - return agg; -} +}; template __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t * self, const int64_t n, const int64_t m, const scalar_t p, @@ -133,7 +111,9 @@ __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t F::inc(agg, std::abs(*a - *b), p); } - agg = reduce_agg(agg); + __shared__ scalar_t agg_smem[kCUDANumThreads]; + scalar_t agg_init{0.0}; + agg = cuda_utils::BlockReduce(agg, DistReduceOp{}, agg_init, agg_smem); if (threadIdx.x == 0) { result[k] = F::finish(agg, p); } @@ -222,7 +202,9 @@ __global__ static void cdist_kernel_cuda_impl(scalar_t * result, const scalar_t for (; a < end; a += stride, b += stride) { F::inc(agg, std::abs(*a - *b), p); } - agg = reduce_agg(agg); + __shared__ scalar_t agg_smem[kCUDANumThreads]; + scalar_t agg_init{0.0}; + agg = cuda_utils::BlockReduce(agg, DistReduceOp{}, agg_init, agg_smem); if (threadIdx.x == 0) { result[blockIdx.x] = F::finish(agg, p); } @@ -236,31 +218,27 @@ void cdist_kernel_impl(Tensor& result, const Tensor& x1, const Tensor& x2, doubl const int64_t l1_size = r1 * m; const int64_t l2_size = r2 * m; const dim3 grid(result.numel()); - const dim3 block(forward_threads); + const dim3 block(kCUDANumThreads); AT_DISPATCH_FLOATING_TYPES(x1.scalar_type(), "cdist_cuda", [&] { + auto impl_fptr = cdist_kernel_cuda_impl::p>; if (p == 0.0) { - cdist_kernel_cuda_impl::zero><<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_kernel_cuda_impl::zero>; } else if (p == 1.0) { - cdist_kernel_cuda_impl::one><<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_kernel_cuda_impl::one>; } else if (p == 2.0) { - cdist_kernel_cuda_impl::two><<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_kernel_cuda_impl::two>; } else if (std::isinf(p)) { - cdist_kernel_cuda_impl::inf><<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); - } else { - cdist_kernel_cuda_impl::p><<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_kernel_cuda_impl::inf>; } + impl_fptr<<>>(result.data_ptr(), x1.data_ptr(), x2.data_ptr(), p, r2, m, r_size, l1_size, l2_size); + C10_CUDA_KERNEL_LAUNCH_CHECK(); }); } void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) { const dim3 grid(result.numel()); - const dim3 block(forward_threads); + const dim3 block(kCUDANumThreads); int64_t n = self.size(0); int64_t m = self.size(1); // https://github.com/pytorch/pytorch/issues/15511 demonstrated we need to do @@ -269,22 +247,18 @@ void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) { const double n2_squared_minus_1 = n2 * n2 - 1; AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda", [&] { + auto impl_fptr = pdist_kernel_cuda_impl::p>; if (p == 0.0) { - pdist_kernel_cuda_impl::zero><<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_kernel_cuda_impl::zero>; } else if (p == 1.0) { - pdist_kernel_cuda_impl::one><<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_kernel_cuda_impl::one>; } else if (p == 2.0) { - pdist_kernel_cuda_impl::two><<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_kernel_cuda_impl::two>; } else if (std::isinf(p)) { - pdist_kernel_cuda_impl::inf><<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); - } else { - pdist_kernel_cuda_impl::p><<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_kernel_cuda_impl::inf>; } + impl_fptr<<>>(result.data_ptr(), self.data_ptr(), n, m, p, n2, n2_squared_minus_1); + C10_CUDA_KERNEL_LAUNCH_CHECK(); }); } @@ -311,22 +285,18 @@ void pdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor Tensor buffer = at::empty({n - 1, result.size(0), result.size(1)}, result.options()); AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda_backward", [&] { + auto impl_fptr = pdist_backward_kernel_cuda_impl::p>; if (p == 1.0) { - pdist_backward_kernel_cuda_impl::one><<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_backward_kernel_cuda_impl::one>; } else if (p < 2.0) { - pdist_backward_kernel_cuda_impl::lt_two><<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_backward_kernel_cuda_impl::lt_two>; } else if (p == 2.0) { - pdist_backward_kernel_cuda_impl::two><<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_backward_kernel_cuda_impl::two>; } else if (std::isinf(p)) { - pdist_backward_kernel_cuda_impl::inf><<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); - } else { - pdist_backward_kernel_cuda_impl::p><<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = pdist_backward_kernel_cuda_impl::inf>; } + impl_fptr<<>>(buffer.data_ptr(), grad.data_ptr(), self.data_ptr(), dist.data_ptr(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1); + C10_CUDA_KERNEL_LAUNCH_CHECK(); }); at::sum_out(result, buffer, 0); @@ -364,32 +334,20 @@ void cdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor Tensor buffer = at::empty({batch, r2, r1, m}, result.options()); AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "cdist_cuda_backward", [&] { + auto impl_fptr = cdist_backward_kernel_cuda_impl::p>; if (p == 1.0) { - cdist_backward_kernel_cuda_impl::one><<>>(buffer.data_ptr(), - grad.data_ptr(), x1.data_ptr(), x2.data_ptr(), dist.data_ptr(), - p, r1, r2, m, count, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_backward_kernel_cuda_impl::one>; } else if (p < 2.0) { - cdist_backward_kernel_cuda_impl::lt_two><<>>(buffer.data_ptr(), - grad.data_ptr(), x1.data_ptr(), x2.data_ptr(), dist.data_ptr(), - p, r1, r2, m, count, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_backward_kernel_cuda_impl::lt_two>; } else if (p == 2.0) { - cdist_backward_kernel_cuda_impl::two><<>>(buffer.data_ptr(), - grad.data_ptr(), x1.data_ptr(), x2.data_ptr(), dist.data_ptr(), - p, r1, r2, m, count, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); + impl_fptr = cdist_backward_kernel_cuda_impl::two>; } else if (std::isinf(p)) { - cdist_backward_kernel_cuda_impl::inf><<>>(buffer.data_ptr(), - grad.data_ptr(), x1.data_ptr(), x2.data_ptr(), dist.data_ptr(), - p, r1, r2, m, count, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); - } else { - cdist_backward_kernel_cuda_impl::p><<>>(buffer.data_ptr(), + impl_fptr = cdist_backward_kernel_cuda_impl::inf>; + } + impl_fptr<<>>(buffer.data_ptr(), grad.data_ptr(), x1.data_ptr(), x2.data_ptr(), dist.data_ptr(), p, r1, r2, m, count, r_size, l1_size, l2_size); - C10_CUDA_KERNEL_LAUNCH_CHECK(); - } + C10_CUDA_KERNEL_LAUNCH_CHECK(); }); at::sum_out(result, buffer, 1); diff --git a/aten/src/ATen/native/cuda/EmbeddingBag.cu b/aten/src/ATen/native/cuda/EmbeddingBag.cu index 7ac3a7151b79c..2cd76cbe34d1a 100644 --- a/aten/src/ATen/native/cuda/EmbeddingBag.cu +++ b/aten/src/ATen/native/cuda/EmbeddingBag.cu @@ -26,6 +26,7 @@ #include #include #include +#include #include @@ -457,14 +458,6 @@ Tensor _embedding_bag_dense_backward_cuda(const Tensor &grad_, const Tensor &ind } } -template -__inline__ __device__ -static scalar_t warpReduceSum(scalar_t val) { - for (int offset = C10_WARP_SIZE/2; offset > 0; offset /= 2) - val += WARP_SHFL_DOWN(val, offset); - return val; -} - template __global__ static void _embedding_bag_per_sample_weights_backward_kernel( const scalar_t* grad, int64_t grad_stride0, int64_t grad_stride1, @@ -495,7 +488,7 @@ __global__ static void _embedding_bag_per_sample_weights_backward_kernel( weight[weight_stride0 * embedding_idx + weight_stride1 * feature_idx]; } } - result = warpReduceSum(result); + result = cuda_utils::WarpReduceSum(result); if (thread_in_warp == 0) { output[sample_idx] = result; } diff --git a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu index 95d0854bd3dec..3b04b68b0f391 100644 --- a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu +++ b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu @@ -188,7 +188,7 @@ std::vector foreach_tensor_##NAME##_cuda(TensorList tensors1, TensorList tensor_lists.emplace_back(std::move(vec_res)); \ \ AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, tensors1[0].scalar_type(), "foreach_maximum_minimum_op_cuda", [&]() { \ - using opmath_t = at::opmath_type; \ + using opmath_t = at::opmath_type; \ auto op = [] GPU_LAMBDA (opmath_t a, opmath_t b) -> opmath_t { \ opmath_t c = a OP b ? a : b; \ if (_isnan(a)) { \ @@ -196,12 +196,37 @@ std::vector foreach_tensor_##NAME##_cuda(TensorList tensors1, TensorList } \ return c;}; \ multi_tensor_apply<3>(tensor_lists, \ - PointwiseOpListFunctor(), \ - op); \ + BinaryOpListAlphaFunctor(), \ + op, \ + opmath_t(1)); \ }); \ \ return tensor_lists[2]; \ } \ + \ +void foreach_tensor_##NAME##_cuda_(TensorList self, TensorList other) { \ + check_foreach_api_restrictions(self, other); \ + if (!can_use_fast_route({self, other}) || has_bool_tensor(self)) { \ + return at::native::foreach_tensor_##NAME##_slow_(self, other); \ + } \ + \ + AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, self[0].scalar_type(), "foreach_maximum_minimum_op_cuda_", \ + [&]() { \ + using opmath_t = at::opmath_type; \ + std::vector> tensor_lists{self.vec(), other.vec()}; \ + auto op = [] GPU_LAMBDA (opmath_t a, opmath_t b) -> opmath_t { \ + opmath_t c = a OP b ? a : b; \ + if (_isnan(a)) { \ + c = a; \ + } \ + return c; \ + }; \ + multi_tensor_apply<2>(tensor_lists, \ + BinaryOpListAlphaFunctor(), \ + op, \ + opmath_t(1)); \ + }); \ +} \ FOREACH_MAXIMUM_MINIMUM_OP(maximum, >) FOREACH_MAXIMUM_MINIMUM_OP(minimum, <) diff --git a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu index 46ea4eadf1feb..24db8776cd49a 100644 --- a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu +++ b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu @@ -185,10 +185,10 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_out_cuda) ( AT_DISPATCH_FLOATING_TYPES_AND_HALF(input.scalar_type(), "fractional_max_pool2d_out_cuda_frame", [&] { - auto devInput = input_.packed_accessor(); - auto devOutput = output_.packed_accessor(); - auto devIndices = indices_.packed_accessor(); - auto devSamples = randomSamples.packed_accessor(); + auto devInput = input_.packed_accessor64(); + auto devOutput = output_.packed_accessor64(); + auto devIndices = indices_.packed_accessor64(); + auto devSamples = randomSamples.packed_accessor64(); fractional_max_pool2d_out_cuda_frame <<>>( devOutput, devIndices, devInput, devSamples, @@ -253,12 +253,12 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_backward_cuda)( gradInput_.size(0)); dim3 block(outputPlaneSize > 128 ? 128 : outputPlaneSize); - auto devIndices = indices_.packed_accessor(); + auto devIndices = indices_.packed_accessor64(); AT_DISPATCH_FLOATING_TYPES_AND_HALF(gradOutput.scalar_type(), "fractional_max_pool2d_backward_out_cuda_frame", [&] { - auto devGradInput = gradInput_.packed_accessor(); - auto devGradOutput = gradOutput_.packed_accessor(); + auto devGradInput = gradInput_.packed_accessor64(); + auto devGradOutput = gradOutput_.packed_accessor64(); fractional_max_pool2d_backward_out_cuda_frame <<>>( devGradInput, devGradOutput, devIndices); diff --git a/aten/src/ATen/native/cuda/Indexing.cu b/aten/src/ATen/native/cuda/Indexing.cu index 4720da4bd1124..6ea88069ca2ef 100644 --- a/aten/src/ATen/native/cuda/Indexing.cu +++ b/aten/src/ATen/native/cuda/Indexing.cu @@ -1256,6 +1256,11 @@ Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, const Scalar& valu Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, const Tensor & value) { TORCH_CHECK(value.dim() == 0, "masked_fill_ only supports a 0-dimensional value tensor, but got tensor " "with ", value.dim(), " dimension(s)."); + // We hit this function if either of the input tensor lives on CUDA. + // It is ok, if `value` is `CPU` tensor but we should not allow `self` or + // `mask` to be CPU tensor. Check for `self` and `mask` being on same device + // exists in `masked_fill__cuda` (Scalar version). + TORCH_CHECK(!self.device().is_cpu(), "masked_fill_: Expected inputs to be on same device") return masked_fill__cuda(self, mask, value.item()); } diff --git a/aten/src/ATen/native/cuda/JitLoops.cuh b/aten/src/ATen/native/cuda/JitLoops.cuh index bb37a6acc2e14..6f350c550ce93 100644 --- a/aten/src/ATen/native/cuda/JitLoops.cuh +++ b/aten/src/ATen/native/cuda/JitLoops.cuh @@ -12,11 +12,7 @@ #include -#if !AT_ROCM_ENABLED() #include -#else -#error Jiterator not supported on ROCm -#endif namespace at { namespace native { diff --git a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp index 913e30b77c0ff..cb6cacb3630fb 100644 --- a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp +++ b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp @@ -35,8 +35,7 @@ namespace native { namespace { cuda::detail::LinalgDispatch disp = {_symeig_helper_cuda, _cholesky_solve_helper_cuda, - legacy_lstsq_cuda, - _linalg_inv_out_helper_cuda}; + legacy_lstsq_cuda}; at::DynamicLibrary& getTorchLinalgLibrary() { static at::DynamicLibrary lib("libtorch_cuda_linalg.so", nullptr, true); @@ -177,12 +176,6 @@ void registerLinalgDispatch(const LinalgDispatch& disp_) { } }} //namespace cuda::detail -Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) { - getTorchLinalgLibrary(); - TORCH_CHECK(disp.inv_out_helper != _linalg_inv_out_helper_cuda, "Can't find _linalg_inv_out_helper_cuda"); - return disp.inv_out_helper(result, infos_lu, infos_getri); -} - std::tuple legacy_lstsq_cuda(const Tensor &B, const Tensor &A) { getTorchLinalgLibrary(); TORCH_CHECK(disp.legacy_lstsq != legacy_lstsq_cuda, "Can't find legacy_lstsq_cuda"); diff --git a/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu new file mode 100644 index 0000000000000..28b3236caa2de --- /dev/null +++ b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu @@ -0,0 +1,37 @@ +#define TORCH_ASSERT_NO_OPERATORS +#include +#include +#include + +#include +#include + +#include +#include + +namespace at { namespace native { + +void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { + AT_DISPATCH_FLOATING_TYPES_AND2( + ScalarType::Half, ScalarType::BFloat16, + self.scalar_type(), "logcumsumexp_cuda", + [&]() { + using opmath_t = at::opmath_type; + scalar_t init = -std::numeric_limits::infinity(); + auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x_, const scalar_t y_) -> scalar_t { + const opmath_t x{x_}, y{y_}; + auto min = at::_isnan(y) ? y : std::min(x, y); //std::min returns first arg if one of the args is nan + auto max = at::_isnan(y) ? y : std::max(x, y); //std::max returns first arg if one of the args is nan + if (min != max || ::isfinite(min)) { + // nan will be propagated here + return ::log1p(std::exp(min - max)) + max; + } else { + // special case to correctly handle infinite inputs + return x; + } + }; + scan_dim(self, result, dim, init, log_add_exp); + }); +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/Loss.cu b/aten/src/ATen/native/cuda/Loss.cu index 2b5d17b9547ed..fcb3229198ab7 100644 --- a/aten/src/ATen/native/cuda/Loss.cu +++ b/aten/src/ATen/native/cuda/Loss.cu @@ -211,6 +211,7 @@ __global__ void nll_loss_forward_reduce_cuda_kernel_1d( // If the only element was omited, we get 0. See the discussion in // https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162 *output = scalar_t{0}; + *total_weight = scalar_t{0}; } } @@ -280,6 +281,7 @@ void nll_loss_forward_out_cuda_template( if (reduction == Reduction::None && n_dims == 2) { at::native::resize_output(output, {batch_size}); + total_weight.zero_(); if (batch_size == 0) { // This guards from unnecessary operations and launching CUDA kernel with // 0 blocks. diff --git a/aten/src/ATen/native/cuda/NLLLoss2d.cu b/aten/src/ATen/native/cuda/NLLLoss2d.cu index 2246c836f3dca..a2027587d1c5e 100644 --- a/aten/src/ATen/native/cuda/NLLLoss2d.cu +++ b/aten/src/ATen/native/cuda/NLLLoss2d.cu @@ -268,9 +268,9 @@ void nll_loss2d_forward_out_cuda_template( 0, at::cuda::getCurrentCUDAStream()>>>( count, - input.packed_accessor(), - target.packed_accessor(), - output.packed_accessor(), + input.packed_accessor64(), + target.packed_accessor64(), + output.packed_accessor64(), optional_data(weight_), ignore_index); C10_CUDA_KERNEL_LAUNCH_CHECK(); @@ -403,9 +403,9 @@ void nll_loss2d_backward_out_cuda_template( 0, at::cuda::getCurrentCUDAStream()>>>( count, - target.packed_accessor(), - grad_output.packed_accessor(), - grad_input.packed_accessor(), + target.packed_accessor64(), + grad_output.packed_accessor64(), + grad_input.packed_accessor64(), optional_data(weight_), ignore_index); C10_CUDA_KERNEL_LAUNCH_CHECK(); diff --git a/aten/src/ATen/native/cuda/Normalization.cuh b/aten/src/ATen/native/cuda/Normalization.cuh index a9b11e76db680..cc79284fea4db 100644 --- a/aten/src/ATen/native/cuda/Normalization.cuh +++ b/aten/src/ATen/native/cuda/Normalization.cuh @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -60,26 +61,10 @@ struct Float2 { v2 += a.v2; return *this; } -}; - -template -struct SumOp { - __device__ SumOp(const PTA& t) : tensor(t) {} - __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) { - return static_cast(tensor[batch][plane][n]); - } - const PTA& tensor; -}; - -template -struct VarOp { - __device__ VarOp(accscalar_t m, const PTA& t) : mean(m), tensor(t) {} - __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) { - accscalar_t val = tensor[batch][plane][n]; - return (val - mean) * (val - mean); + __device__ friend Float2 operator+(Float2 a, const Float2& b) { + a += b; + return a; } - const accscalar_t mean; - const PTA& tensor; }; template @@ -96,21 +81,25 @@ struct GradOp { const PTA& grad_output; }; -// Sum across all threads within a warp -template -static __device__ __forceinline__ T warpSum(T val) { - for (int i = 0; i < getMSB(C10_WARP_SIZE); ++i) { - val += WARP_SHFL_XOR(val, 1 << i, C10_WARP_SIZE); - } - return val; -} +template +struct SumReduceOp { + __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; } + + __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const { + return WARP_SHFL_DOWN(data, offset); + } +}; template -static __device__ __forceinline__ Float2 warpSum(Float2 value) { - value.v1 = warpSum(value.v1); - value.v2 = warpSum(value.v2); - return value; -} +struct SumReduceOp> { + using acc_t = Float2; + + __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; } + + __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const { + return {WARP_SHFL_DOWN(data.v1, offset), WARP_SHFL_DOWN(data.v2, offset)}; + } +}; // Sum across (batch, x/y/z) applying Op() pointwise // this works by first having each thread sum it's part @@ -130,37 +119,13 @@ __device__ scalar_t reduce(Op op, PTA tensor, int plane) { sum += op(batch, plane, x); } } - - // first warpSum to get one value per thread to - // one value per warp - sum = warpSum(sum); - - // this writes each warps item into shared memory - // there are at most C10_WARP_SIZE items left because - // there are at most C10_WARP_SIZE**2 threads at the beginning __shared__ scalar_t shared[C10_WARP_SIZE]; - __syncthreads(); - int tid = threadIdx.x + threadIdx.y * blockDim.x; - if (tid % C10_WARP_SIZE == 0) { - shared[tid / C10_WARP_SIZE] = sum; - } - if (tid >= blockDim.x * blockDim.y / C10_WARP_SIZE && tid < C10_WARP_SIZE) { - // zero out the other entries in shared - shared[tid] = (scalar_t)0; - } - __syncthreads(); - // now have a second warpSum to reduce the intermediate values - // from shared memory to a single number. The very first - // thread writes it to shared memory. - - if (tid / C10_WARP_SIZE == 0) { - sum = warpSum(shared[tid]); - if (tid == 0) { + SumReduceOp reduce_op; + sum = cuda_utils::BlockReduce, cuda_utils::Block2D>(sum, reduce_op, 0, shared); + if (threadIdx.x == 0 && threadIdx.y == 0) { shared[0] = sum; - } } __syncthreads(); - // Everyone picks it up, should be broadcast into the whole grad_input return shared[0]; } diff --git a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh index 9958d4c9b8144..5d3bea36e37a3 100644 --- a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh +++ b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh @@ -90,7 +90,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc dst += idx_offset; if (is_transformer_mask) { - mask += (idx_offset / head_chunk_size) * stride + local_idx; + mask += ((first_batch * stride) / head_chunk_size) * stride + local_idx; } else { mask += idx_offset; } @@ -117,13 +117,14 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc acc_t max_value[WARP_BATCH]; #pragma unroll for (int i = 0; i < WARP_BATCH; ++i) { + int batch_element_count = (i >= local_batches) ? 0 : element_count; bool is_meaningful_max = false; max_value[i] = elements[i][0]; #pragma unroll for (int it = 0; it < WARP_ITERATIONS; ++it) { if (is_masked) { int idx = it*WARP_SIZE; - if ((idx + local_idx) < element_count) { + if ((idx + local_idx) < batch_element_count) { if (!is_transformer_mask) { idx += i*element_count; } @@ -147,6 +148,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc acc_t sum[WARP_BATCH] { 0.0f }; #pragma unroll for (int i = 0; i < WARP_BATCH; ++i) { + int batch_element_count = (i >= local_batches) ? 0 : element_count; #pragma unroll for (int it = 0; it < WARP_ITERATIONS; ++it) { if (!is_masked) { @@ -158,7 +160,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc } } else { int idx = it*WARP_SIZE; - bool valid = (idx + local_idx) < element_count; + bool valid = (idx + local_idx) < batch_element_count; if (!is_transformer_mask) { idx += i*element_count; } diff --git a/aten/src/ATen/native/cuda/ScanKernels.cpp b/aten/src/ATen/native/cuda/ScanKernels.cpp index 206543384a996..69f86c006950c 100644 --- a/aten/src/ATen/native/cuda/ScanKernels.cpp +++ b/aten/src/ATen/native/cuda/ScanKernels.cpp @@ -89,6 +89,11 @@ Tensor _logcumsumexp_cuda(const Tensor& self, int64_t dim) { } void cumsum_cuda_kernel(const Tensor& result, const Tensor& self, int64_t dim) { + if (self.is_floating_point() || self.is_complex()) { + // See Note [Writing Nondeterministic Operations] + // Issue reporting nondeterministic behavior: https://github.com/pytorch/pytorch/issues/75240 + globalContext().alertNotDeterministic("cumsum_cuda_kernel"); + } auto result_ = contiguous_out_arg(result); launch_cumsum_cuda_kernel(*result_, self, dim); if (!result.is_same(*result_)) { diff --git a/aten/src/ATen/native/cuda/ScanKernels.cu b/aten/src/ATen/native/cuda/ScanUtils.cuh similarity index 84% rename from aten/src/ATen/native/cuda/ScanKernels.cu rename to aten/src/ATen/native/cuda/ScanUtils.cuh index 44982208c086a..ba27a245172b5 100644 --- a/aten/src/ATen/native/cuda/ScanKernels.cu +++ b/aten/src/ATen/native/cuda/ScanUtils.cuh @@ -1,18 +1,15 @@ -#define TORCH_ASSERT_NO_OPERATORS -#include -#include -#include -#include -#include +#pragma once #include -#include -#include - +#include #include +#include -#include +#include +#include +#include -namespace at { namespace native { +namespace at { +namespace native { template constexpr inline integer ceil_div(integer n, integer m) { @@ -158,7 +155,7 @@ __global__ void tensor_kernel_scan_outer_dim_with_indices(scalar_t *self_, scala } } -void check_fits_in_unsigned(int64_t val, const char* name) { +inline void check_fits_in_unsigned(int64_t val, const char* name) { constexpr auto umax = std::numeric_limits::max(); TORCH_CHECK( val >= 0 && val <= umax, name, " must fit in a 32-bit uint32_t value"); @@ -224,22 +221,6 @@ void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, con } } -void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) { - AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16, - self.scalar_type(), "cummax_cuda", [&]() { - scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits::infinity()) : std::numeric_limits::lowest(); - scan_dim_with_indices(self, values, indices, dim, init, std::greater_equal()); - }); -} - -void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) { - AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16, - self.scalar_type(), "cummin_cuda", [&]() { - scalar_t init = self.is_floating_point() ? std::numeric_limits::infinity() : std::numeric_limits::max(); - scan_dim_with_indices(self, values, indices, dim, init, std::less_equal()); - }); -} - // TODO: The implementation of `tensor_kernel_scan_outer_dim` and // `tensor_kernel_scan_innermost_dim` is similar to // `tensor_kernel_scan_outer_dim_with_indices` @@ -468,54 +449,4 @@ void scan_dim(const TensorBase& self, const TensorBase& result, } } -void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { - AT_DISPATCH_FLOATING_TYPES_AND2( - ScalarType::Half, ScalarType::BFloat16, - self.scalar_type(), "logcumsumexp_cuda", - [&]() { - using accscalar_t = acc_type; - scalar_t init = -std::numeric_limits::infinity(); - auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x, const scalar_t y) -> scalar_t { - scalar_t min = at::_isnan(y) ? y : std::min(x,y); //std::min returns first arg if one of the args is nan - scalar_t max = at::_isnan(y) ? y : std::max(x,y); //std::max returns first arg if one of the args is nan - if (min != max || ::isfinite(static_cast(min))) { - // nan will be propagated here - return ::log1p(std::exp(min - max)) + max; - } else { - // special case to correctly handle infinite inputs - return x; - } - }; - scan_dim(self, result, dim, init, log_add_exp); - }); -} - -void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { - AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2( - ScalarType::Half, ScalarType::BFloat16, - self.scalar_type(), "cumsum_cuda", - [&]() { - scalar_t init = 0; - scan_dim( - self, - result, - dim, - init, - std::plus()); - }); -} - -void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) { - AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2( - ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() { - scalar_t init = 1; - scan_dim( - self, - result, - dim, - init, - std::multiplies()); - }); -} - -}} // namespace at::native +}} // namespace at::native diff --git a/aten/src/ATen/native/cuda/SoftMax.cu b/aten/src/ATen/native/cuda/SoftMax.cu index feeb08ae34b34..c53276e619be2 100644 --- a/aten/src/ATen/native/cuda/SoftMax.cu +++ b/aten/src/ATen/native/cuda/SoftMax.cu @@ -957,18 +957,24 @@ TORCH_IMPL_FUNC(softmax_backward_cuda_out) host_softmax_backward(tmp, output, dim, half_to_float, grad_input); } -Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10::optional dim_) { +Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10::optional dim_, const c10::optional mask_type_) { Tensor output = at::empty_like(input_, input_.options()); TORCH_CHECK(mask_.scalar_type() == ScalarType::Bool, "Mask should be a boolean tensor"); + TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined"); + int64_t mask_type = mask_type_.value(); + TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)"); + // If input is [B, H, T, T] and mask is [B, T] // we have special fast kernel - bool is_BxT_mask = (input_.dim() == 4 && mask_.dim() == 2 && input_.size(0) == mask_.size(0) && input_.size(2) == mask_.size(1) && input_.size(3) == mask_.size(1)); + // mask_type == 1 => mask_ is a src_key_padding_mask + bool is_BxT_mask = (mask_type == 1) && (input_.dim() == 4 && mask_.dim() == 2 && input_.size(0) == mask_.size(0) && input_.size(2) == mask_.size(1) && input_.size(3) == mask_.size(1)); // If input is [B, H, T, T] and mask is [T, T] // expand mask to [B, H, T, T] and treat it like regular mask // TODO We should have special fast kernel for TxT mask as well - bool is_TxT_mask = input_.dim() == 4 && mask_.dim() == 2 && input_.size(3) == mask_.size(1) && input_.size(2) == mask_.size(0) && mask_.size(0) == mask_.size(1); + // mask_type == 0 => mask_ is a src_mask + bool is_TxT_mask = (mask_type == 0) && input_.dim() == 4 && mask_.dim() == 2 && input_.size(3) == mask_.size(1) && input_.size(2) == mask_.size(0) && mask_.size(0) == mask_.size(1); TORCH_CHECK(mask_.sizes() == input_.sizes() || is_BxT_mask || is_TxT_mask, "Mask shape should match input. mask: ", mask_.sizes(), " input: ", input_.sizes()); auto input = input_.dim() == 0 ? input_.view(1) : input_; diff --git a/aten/src/ATen/native/cuda/TensorFactories.cu b/aten/src/ATen/native/cuda/TensorFactories.cu index 6e05908b2ccea..03711b194a983 100644 --- a/aten/src/ATen/native/cuda/TensorFactories.cu +++ b/aten/src/ATen/native/cuda/TensorFactories.cu @@ -294,10 +294,10 @@ Tensor tril_indices_cuda( cuda::getApplyGrid(tril_size, dim_grid, tensor.get_device()), "unable to get dim grid"); - AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "tril_indices_cuda", [&] { + AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "tril_indices_cuda", [&] { tril_indices_kernel<<< dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>( - tensor.data_ptr(), + tensor.data_ptr(), trapezoid_row_offset, m_first_row, col, @@ -372,10 +372,10 @@ Tensor triu_indices_cuda( cuda::getApplyGrid(triu_size, dim_grid, tensor.get_device()), "unable to get dim grid"); - AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "triu_indices_cuda", [&] { + AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "triu_indices_cuda", [&] { triu_indices_kernel<<< dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>( - tensor.data_ptr(), + tensor.data_ptr(), std::max(0, offset), m_first_row, col, diff --git a/aten/src/ATen/native/cuda/TensorTopK.cu b/aten/src/ATen/native/cuda/TensorTopK.cu index 1caf3ec576086..631d887c1a01f 100644 --- a/aten/src/ATen/native/cuda/TensorTopK.cu +++ b/aten/src/ATen/native/cuda/TensorTopK.cu @@ -405,13 +405,23 @@ __global__ void computeBlockwiseWithinKCounts( int current_bit, bool largest, // outputs: - uint32_t* withinKCounts // size: num_slices * blocks_per_slice == num_blocks + uint32_t* withinKCounts, // size: num_slices * blocks_per_slice == num_blocks + uint32_t num_blocks ) { // This kernel should be launched with the same number of blocks as the `radixFindKthValues` kernel. int tidx = threadIdx.x; uint32_t block_idx = getLinearBlockId(); uint32_t slice_idx = block_idx / blocks_per_slice; + // The grid is computed from `getGridFromTiles`, when there are lots of + // elements, we will use both blockIdx.x and blockIdx.y, and maybe blockIdx.z + // when this is the case, the number of blocks that we are launching can be + // more than the number of blocks we need. So we need to check the range of + // `block_idx`. + if (block_idx >= num_blocks) { + return; + } + Bitwise desired = doLdg(desires + slice_idx); Bitwise desired_digit = at::cuda::Bitfield::getBitfield(desired, current_bit, RADIX_BITS); @@ -702,7 +712,7 @@ void launch( C10_CUDA_KERNEL_LAUNCH_CHECK(); #if CUB_SUPPORTS_SCAN_BY_KEY() computeBlockwiseWithinKCounts<<>>( - desired, counts, blocks_per_slice, current_bit, largest, withinKCounts); + desired, counts, blocks_per_slice, current_bit, largest, withinKCounts, num_blocks); C10_CUDA_KERNEL_LAUNCH_CHECK(); #endif desiredMask = at::cuda::Bitfield::setBitfield(desiredMask, RADIX_MASK, current_bit, RADIX_BITS); diff --git a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu index 0589c3ba4f0dd..a04194b1117e5 100644 --- a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu +++ b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu @@ -1,6 +1,7 @@ #define TORCH_ASSERT_NO_OPERATORS #include #include +#include #include #include #include @@ -58,22 +59,10 @@ void angle_kernel_cuda(TensorIteratorBase& iter) { } } -// We manually overload conj because std::conj does not work types other than c10::complex. -template -__host__ __device__ static inline scalar_t conj_wrapper(scalar_t v) { - return v; -} - -template -__host__ __device__ static inline c10::complex conj_wrapper(c10::complex v) { - return std::conj(v); -} - // NB: Ignores the negative bit on tensors const char conj_name[] = "conj_kernel"; void conj_kernel_cuda(TensorIteratorBase& iter) { - auto common_dtype = iter.common_dtype(); - if (common_dtype == kComplexHalf) { + auto conj_chalf = [&] { using scalar_t = c10::complex; #if AT_USE_JITERATOR() static const auto conj_string = jiterator_stringify( @@ -85,17 +74,23 @@ void conj_kernel_cuda(TensorIteratorBase& iter) { jitted_gpu_kernel(iter, conj_string); #else gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { - return conj_wrapper(a); + return std::conj(a); }); #endif - } else { - AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3( - kBool, kBFloat16, kHalf, iter.common_dtype(), "conj_cuda", [&]() { - gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { - return conj_wrapper(a); - }); - }); - } + }; + + AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cuda", + AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] { + // Conj is a no-op for non-complex types + direct_copy_kernel_cuda(iter); + }) + AT_DISPATCH_CASE_COMPLEX_TYPES([&] { + gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t { + return std::conj(a); + }); + }) + AT_DISPATCH_CASE(kComplexHalf, conj_chalf) + ); } REGISTER_DISPATCH(angle_stub, &angle_kernel_cuda); diff --git a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu index 0cb0d9f238cf5..2481fd6028960 100644 --- a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu +++ b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu @@ -151,7 +151,9 @@ void sigmoid_kernel_cuda(TensorIteratorBase& iter) { } else { AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, common_dtype, "sigmoid_cuda", [&]() { gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t { - return scalar_t{1} / (scalar_t{1} + std::exp(-a)); + using opmath_t = at::opmath_type; + const auto one = opmath_t{1}; + return static_cast(one/(one + std::exp(-opmath_t{a}))); }); }); } @@ -179,8 +181,9 @@ void sinc_kernel_cuda(TensorIteratorBase& iter) { return scalar_t(1); } else { // NVCC says constexpr var is not accessible from device - scalar_t product = c10::detail::pi() * a; - return std::sin(product) / product; + using opmath_t = at::opmath_type; + opmath_t product = c10::detail::pi() * opmath_t{a}; + return static_cast(std::sin(product) / product); } }); }); diff --git a/aten/src/ATen/native/cuda/block_reduce.cuh b/aten/src/ATen/native/cuda/block_reduce.cuh index e01cd0b060f53..fa75c71f8acaf 100644 --- a/aten/src/ATen/native/cuda/block_reduce.cuh +++ b/aten/src/ATen/native/cuda/block_reduce.cuh @@ -29,24 +29,43 @@ __inline__ __device__ T WarpReduceSum(T val) { return val; } +struct Block1D { + static __forceinline__ __device__ int Tid() { return threadIdx.x; } + + static __forceinline__ __device__ int Warps() { + return blockDim.x / C10_WARP_SIZE; + } +}; + +struct Block2D { + static __forceinline__ __device__ int Tid() { + return threadIdx.x + threadIdx.y * blockDim.x; + } + + static __forceinline__ __device__ int Warps() { + return blockDim.x * blockDim.y / C10_WARP_SIZE; + } +}; + // Sums `val` across all threads in a block. // +// Warning: the return value is only valid for thread 0. // Assumptions: -// - Thread blocks are an 1D set of threads (indexed with `threadIdx.x` only) // - The size of each block should be a multiple of `C10_WARP_SIZE` // - `shared` should be a pointer to shared memory with size of, at least, // `sizeof(T) * number_of_warps` -template +template __inline__ __device__ T BlockReduceSum(T val, T* shared) { - const int lid = threadIdx.x % C10_WARP_SIZE; - const int wid = threadIdx.x / C10_WARP_SIZE; + const int tid = B::Tid(); + const int lid = tid % C10_WARP_SIZE; + const int wid = tid / C10_WARP_SIZE; val = WarpReduceSum(val); - __syncthreads(); + __syncthreads(); // prevent races when BlockReduces are called in a row. if (lid == 0) { shared[wid] = val; } __syncthreads(); - val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid] : T(0); + val = (tid < B::Warps()) ? shared[lid] : T(0); if (wid == 0) { val = WarpReduceSum(val); } @@ -62,19 +81,19 @@ __inline__ __device__ T WarpReduce(T val, const ReduceOp& op) { return val; } -template +template __inline__ __device__ T BlockReduce(T val, const ReduceOp& op, const T& identity_element, T* shared) { - const int lid = threadIdx.x % C10_WARP_SIZE; - const int wid = threadIdx.x / C10_WARP_SIZE; + const int tid = B::Tid(); + const int lid = tid % C10_WARP_SIZE; + const int wid = tid / C10_WARP_SIZE; val = WarpReduce(val, op); - __syncthreads(); + __syncthreads(); // prevent races when BlockReduces are called in a row. if (lid == 0) { shared[wid] = val; } __syncthreads(); - val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid] - : identity_element; + val = (tid < B::Warps()) ? shared[lid] : identity_element; if (wid == 0) { val = WarpReduce(val, op); } diff --git a/aten/src/ATen/native/cuda/jit_utils.cpp b/aten/src/ATen/native/cuda/jit_utils.cpp index 673ea9f476e46..f86b624b84f7c 100644 --- a/aten/src/ATen/native/cuda/jit_utils.cpp +++ b/aten/src/ATen/native/cuda/jit_utils.cpp @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -40,7 +41,148 @@ namespace at { namespace cuda { namespace jit { +// hiprtc already includes some traits, so this removes duplicate definitions of +// integral_constant, is_same, is_integral, enable_if, is_floating_point, is_arithmetic. +// Copied from aten/src/ATen/cuda/llvm_basic.cpp, then modified as above. +// If not compiling for ROCm, return the original get_traits_string(). +std::string get_traits_string_but_hiprtc_safe() { +#ifdef USE_ROCM + return R"ESCAPE( +namespace std { + +template +_Tp&& __declval(int); +template +_Tp __declval(long); +template +decltype(__declval<_Tp>(0)) declval() noexcept; + +template struct remove_const {typedef _Tp type;}; +template struct remove_const {typedef _Tp type;}; +template using remove_const_t = typename remove_const<_Tp>::type; + +template struct remove_volatile {typedef _Tp type;}; +template struct remove_volatile {typedef _Tp type;}; +template using remove_volatile_t = typename remove_volatile<_Tp>::type; + +template struct remove_cv +{typedef typename remove_volatile::type>::type type;}; +template using remove_cv_t = typename remove_cv<_Tp>::type; + +template struct __libcpp_is_floating_point : public false_type {}; +template <> struct __libcpp_is_floating_point : public true_type {}; +template <> struct __libcpp_is_floating_point : public true_type {}; +template <> struct __libcpp_is_floating_point : public true_type {}; + +template +inline constexpr bool is_arithmetic_v = is_arithmetic<_Tp>::value; + +template +struct __numeric_type +{ + static void __test(...); + static float __test(float); + static double __test(char); + static double __test(int); + static double __test(unsigned); + static double __test(long); + static double __test(unsigned long); + static double __test(long long); + static double __test(unsigned long long); + static double __test(double); + static long double __test(long double); + + typedef decltype(__test(declval<_Tp>())) type; + static const bool value = !is_same::value; +}; + +template <> +struct __numeric_type +{ + static const bool value = true; +}; + +// __promote + +template ::value && + __numeric_type<_A2>::value && + __numeric_type<_A3>::value> +class __promote_imp +{ +public: + static const bool value = false; +}; + +template +class __promote_imp<_A1, _A2, _A3, true> +{ +private: + typedef typename __promote_imp<_A1>::type __type1; + typedef typename __promote_imp<_A2>::type __type2; + typedef typename __promote_imp<_A3>::type __type3; +public: + typedef decltype(__type1() + __type2() + __type3()) type; + static const bool value = true; +}; + +template +class __promote_imp<_A1, _A2, void, true> +{ +private: + typedef typename __promote_imp<_A1>::type __type1; + typedef typename __promote_imp<_A2>::type __type2; +public: + typedef decltype(__type1() + __type2()) type; + static const bool value = true; +}; + +template +class __promote_imp<_A1, void, void, true> +{ +public: + typedef typename __numeric_type<_A1>::type type; + static const bool value = true; +}; + +template +class __promote : public __promote_imp<_A1, _A2, _A3> {}; + +} // namespace std +)ESCAPE"; +#else + return get_traits_string(); +#endif +} + +#ifdef USE_ROCM +const std::string jit_preamble = R"ESCAPE( +#pragma clang force_cuda_host_device begin +)ESCAPE"; +const std::string jit_epilogue = R"ESCAPE( +#pragma clang force_cuda_host_device end +)ESCAPE"; +#else +const std::string jit_preamble; +const std::string jit_epilogue; +#endif + const std::string jit_common_types = R"ESCAPE( + #ifdef __HIPCC__ + #define ERROR_UNSUPPORTED_CAST ; + // corresponds to aten/src/ATen/native/cuda/thread_constants.h + #define CUDA_OR_ROCM_NUM_THREADS 256 + // corresponds to aten/src/ATen/cuda/detail/OffsetCalculator.cuh + #define MAX_DIMS 16 + #ifndef __forceinline__ + #define __forceinline__ inline __attribute__((always_inline)) + #endif + #else + //TODO use _assert_fail, because assert is disabled in non-debug builds + #define ERROR_UNSUPPORTED_CAST assert(false); + #define CUDA_OR_ROCM_NUM_THREADS 128 + #define MAX_DIMS 25 + #endif #define POS_INFINITY __int_as_float(0x7f800000) #define INFINITY POS_INFINITY #define NEG_INFINITY __int_as_float(0xff800000) @@ -54,11 +196,9 @@ const std::string jit_common_types = R"ESCAPE( static_assert(sizeof(int64_t) == 8, "expected size does not match"); static_assert(sizeof(uint32_t) == 4, "expected size does not match"); static_assert(sizeof(int8_t) == 1, "expected size does not match"); - constexpr int num_threads = 128; + constexpr int num_threads = CUDA_OR_ROCM_NUM_THREADS; constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live constexpr int block_work_size = thread_work_size * num_threads; - //TODO use _assert_fail, because assert is disabled in non-debug builds - #define ERROR_UNSUPPORTED_CAST assert(false); ${traits_string} ${cmath_string} @@ -146,15 +286,22 @@ struct alignas(2) Half { Half() = default; inline __host__ __device__ Half(float value){ +#ifdef __HIPCC__ + x = __half_as_short(__float2half(value)); +#else asm("{ cvt.rn.f16.f32 %0, %1;}\n" : "=h"(x) : "f"(value)); +#endif } inline __host__ __device__ operator float() const{ +#ifdef __HIPCC__ + return __half2float(*reinterpret_cast(&x)); +#else float val; asm("{ cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(x)); // do we need const cast here? //asm("{ cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(x))); return val; +#endif } - }; } )ESCAPE"; @@ -201,9 +348,18 @@ struct alignas(2) BFloat16 { } inline __host__ __device__ operator float() const{ +#ifdef __HIPCC__ + union + { + uint32_t int32; + float fp32; + } u = {uint32_t(x) << 16}; + return u.fp32; +#else float val; asm("{ mov.b32 %0, {0,%1};}\n" : "=f"(val) : "h"(x)); //do we need const cast here? return val; +#endif } }; @@ -450,7 +606,7 @@ const std::string offset_calc_template = R"ESCAPE( } #pragma unroll - for (int dim = 0; dim < 25; ++dim) { + for (int dim = 0; dim < MAX_DIMS; ++dim) { if (dim == dims) { break; } @@ -469,9 +625,9 @@ const std::string offset_calc_template = R"ESCAPE( } int dims; - IntDivider sizes_[25]; + IntDivider sizes_[MAX_DIMS]; // NOTE: this approach will not support nInputs == 0 - ${index_type} strides_[25][NARGS]; + ${index_type} strides_[MAX_DIMS][NARGS]; }; @@ -501,7 +657,7 @@ const std::string jit_code_template = R"ESCAPE( int idx = blockIdx.x; int remaining = numel - block_work_size * idx; - auto thread_idx = threadIdx.x; + int thread_idx = threadIdx.x; #pragma unroll for (int j = 0; j < thread_work_size; j++){ @@ -592,7 +748,7 @@ const std::string jit_vectorized_code_template = R"ESCAPE( constexpr int vec_size = ${vec_size}; using scalar_t = ${scalar_type}; int remaining = N - block_work_size * blockIdx.x; - auto thread_idx = threadIdx.x; + int thread_idx = threadIdx.x; int idx = blockIdx.x; ${declare_load_arrays} ${declare_store_arrays} @@ -651,6 +807,49 @@ const std::string jit_vectorized_code_template = R"ESCAPE( } )ESCAPE"; +static void replace_all(std::string& s, const std::string& to_replace, const std::string& replace_with) { + std::ostringstream oss; + std::size_t pos = 0; + std::size_t prev_pos = pos; + + while (true) { + prev_pos = pos; + pos = s.find(to_replace, pos); + if (pos == std::string::npos) + break; + oss << s.substr(prev_pos, pos - prev_pos); + oss << replace_with; + pos += to_replace.size(); + } + + oss << s.substr(prev_pos); + s = oss.str(); +} + +// hipify replaces certain device math functions, e.g., std::max -> ::max +// See torch/utils/hipify/cuda_to_hip_mappings.py. +// Replace them back. Search for " ::" to avoid duplicate replacements. +static std::string unhipify_math_functions(const std::string &original) { + static std::vector> mappings = { + {" std::max", " ::max"}, + {" std::min", " ::min"}, + {" std::ceil", " ::ceil"}, + {" std::floor", " ::floor"}, + {" std::exp", " ::exp"}, + {" std::log", " ::log"}, + {" std::pow", " ::pow"}, + {" std::fabs", " ::fabs"}, + {" std::fmod", " ::fmod"}, + {" std::remainder", " ::remainder"}, + {" std::frexp", " ::frexp"} + }; + std::string ret = original; + for (const auto& mapping : mappings) { + replace_all(ret, mapping.second, mapping.first); + } + return ret; +} + // The following is copied from fused_kernel.cpp // TODO: refactor codegenOutputQuery into its own file // that can be included by both files @@ -668,7 +867,12 @@ void codegenOutputQuery( int& nvrtc_major, int& nvrtc_minor, bool& compile_to_sass) { - +#ifdef USE_ROCM + AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor)); + cuda_major = prop->major; + cuda_minor = prop->minor; + compile_to_sass = false; +#else AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor)); TORCH_CHECK( nvrtc_major >= 6, "NVRTC versions less than 6 are not supported. Is: ", nvrtc_major); @@ -711,6 +915,7 @@ void codegenOutputQuery( // compile to sass is not allowed prior to CUDA 11.1 compile_to_sass = false; #endif +#endif } // TODO: another copy paste from jit, refactor so it's usable from both @@ -764,7 +969,7 @@ constexpr int thread_work_size = THREAD_WORK_SIZE; std::string generate_code( int nInputs, int nOutputs, - const std::string& func, + const std::string& func_, const std::string& name, const std::string& f_inputs_type, const std::string& compute_type, @@ -776,6 +981,7 @@ std::string generate_code( bool vectorized, int vec_size, bool return_by_ref) { + std::string func = func_; at::jit::TemplateEnv env; env.s("index_type", "unsigned int"); @@ -887,11 +1093,16 @@ std::string generate_code( f_inputs_type == "std::complex" || result_type == "std::complex" || f_inputs_type == "std::complex" || result_type == "std::complex") { // complex depends on complex and Half dtypes. - env.s("traits_string", get_traits_string()); + env.s("traits_string", get_traits_string_but_hiprtc_safe()); env.s("complex_body_string", get_complex_body_string()); env.s("complex_math_string", get_complex_math_string()); +#ifdef USE_ROCM + // unhipify math functions, but only if std::complex is used. + func = unhipify_math_functions(func); + env.s("functor", func); +#endif } else if (dynamic_casting) { - env.s("traits_string", get_traits_string()); + env.s("traits_string", get_traits_string_but_hiprtc_safe()); env.s("complex_body_string", get_complex_body_string()); env.s("complex_math_string", ""); } else { @@ -948,7 +1159,8 @@ std::string generate_code( } env.s("store_outputs", store_outputs.str()); - static auto cuda_template = at::jit::CodeTemplate(jit_common_types + offset_calc_template + jit_code_template); + static auto cuda_template = at::jit::CodeTemplate( + jit_preamble + jit_common_types + offset_calc_template + jit_code_template + jit_epilogue); const auto code = cuda_template.format(env); return code; } @@ -1014,7 +1226,8 @@ std::string generate_code( } env.s("store_unrolled_outputs", store_unrolled_outputs.str()); - static auto cuda_template = at::jit::CodeTemplate(jit_common_types + jit_vectorized_code_template); + static auto cuda_template = at::jit::CodeTemplate( + jit_preamble + jit_common_types + jit_vectorized_code_template + jit_epilogue); const auto code = cuda_template.format(env); return code; } @@ -1114,7 +1327,7 @@ std::string generate_reduction_code( std::string generate_reduction_code( int nOutputs, - const std::string& func, + const std::string& func_, const std::string& name, const int vt0, const std::string& f_inputs_type, @@ -1124,6 +1337,7 @@ std::string generate_reduction_code( bool vectorized, int vec_size, int max_threads_codegen) { + std::string func = func_; at::jit::TemplateEnv env; env.s("index_type", "unsigned int"); env.s("scalar_type", f_inputs_type); @@ -1149,10 +1363,14 @@ std::string generate_reduction_code( f_inputs_type == "std::complex" || f_inputs_type == "std::complex" ) { // complex depends on complex and Half dtypes. - env.s("traits_string", get_traits_string()); + env.s("traits_string", get_traits_string_but_hiprtc_safe()); env.s("complex_body_string", get_complex_body_string()); env.s("complex_math_string", get_complex_math_string()); env.s("complex", std::to_string(1)); +#ifdef USE_ROCM + // unhipify math functions, but only if std::complex is used. + func = unhipify_math_functions(func); +#endif } else { env.s("traits_string", ""); env.s("complex_body_string", ""); @@ -1168,7 +1386,7 @@ std::string generate_reduction_code( env.s("functor", func); env.s("output_vec_size", std::to_string(vec_size)); static auto cuda_template = at::jit::CodeTemplate( - jit_common_types + offset_calc_template + get_reduction_template()); + jit_preamble + jit_common_types + offset_calc_template + get_reduction_template() + jit_epilogue); const auto code = cuda_template.format(env); return code; } @@ -1312,6 +1530,9 @@ NvrtcFunction jit_pwise_function( AT_CUDA_NVRTC_CHECK(nvrtc.nvrtcCreateProgram( &program, code.c_str(), nullptr, 0, nullptr, nullptr)); +#ifdef USE_ROCM + std::vector args = {"--std=c++14"}; +#else // Constructs nvrtc build arguments // CUDA 11.1 allows going directly to SASS (sm_) instead of PTX (compute_) // which gives better backwards compatibility to work on older driver, @@ -1326,6 +1547,7 @@ NvrtcFunction jit_pwise_function( // NOLINTNEXTLINE(cppcoreguidelines-init-variables) std::vector args = { "--std=c++14", compute.c_str(), "-default-device"}; +#endif #ifndef NDEBUG // Add line info to generated kernels diff --git a/aten/src/ATen/native/cuda/jit_utils.h b/aten/src/ATen/native/cuda/jit_utils.h index 13aa723db2756..8206f67316e11 100644 --- a/aten/src/ATen/native/cuda/jit_utils.h +++ b/aten/src/ATen/native/cuda/jit_utils.h @@ -8,7 +8,6 @@ #include #include #include -#include namespace at { namespace cuda { namespace jit { diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 96d700c761ebf..ae09f0aaad8f8 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -684,7 +684,7 @@ void LayerNormKernelImplInternal( auto can_vectorize = [&](const T * ptr, int alignment){uint64_t addr = reinterpret_cast(ptr); return addr % alignment == 0;}; constexpr int num_vec_elems = vec_size; constexpr int alignment = num_vec_elems * sizeof(T); - if ((std::is_same::value || std::is_same::value) && + if ((std::is_same::value || std::is_same::value || std::is_same::value) && N <= 1ULL << std::numeric_limits::digits && N % num_vec_elems == 0 && can_vectorize(X_data, alignment) && can_vectorize(Y_data, alignment)) { launch_vectorized_layer_norm_kernel(static_cast(N), M, eps, X_data, gamma_data, beta_data, Y_data, mean_data, rstd_data); diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp index 320c799f23bce..061e7e86de8bb 100644 --- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp +++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp @@ -24,7 +24,6 @@ #include #else #include -#include #include #include #include @@ -115,20 +114,6 @@ void magmaLuNoPivBatched( magma_int_t m, magma_int_t n, scalar_t** dA_array, magma_int_t ldda, magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue); -template -inline magma_int_t magmaGetriOptimalBlocksize(magma_int_t n); - -template -void magmaGetri( - magma_int_t n, scalar_t* dA, magma_int_t ldda, magma_int_t* ipiv, scalar_t* dwork, - magma_int_t lwork, magma_int_t* info); - -template -void magmaGetriBatched( - magma_int_t n, scalar_t** dA_array, magma_int_t ldda, - magma_int_t** ipiv_array, scalar_t** dinvA_array, magma_int_t lddia, - magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue); - template void magmaCholeskySolve( magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, scalar_t* dA, magma_int_t ldda, @@ -400,154 +385,6 @@ void magmaLuNoPivBatched>( AT_CUDA_CHECK(cudaGetLastError()); } -template<> -inline magma_int_t magmaGetriOptimalBlocksize(magma_int_t n) { - return magma_get_dgetri_nb(n); -} - -template<> -inline magma_int_t magmaGetriOptimalBlocksize(magma_int_t n) { - return magma_get_sgetri_nb(n); -} - -template <> -inline magma_int_t magmaGetriOptimalBlocksize>( - magma_int_t n) { - return magma_get_zgetri_nb(n); -} - -template <> -inline magma_int_t magmaGetriOptimalBlocksize>( - magma_int_t n) { - return magma_get_cgetri_nb(n); -} - -template<> -void magmaGetri( - magma_int_t n, double* dA, magma_int_t ldda, magma_int_t* ipiv, double* dwork, - magma_int_t lwork, magma_int_t* info) { - MagmaStreamSyncGuard guard; - magma_dgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template<> -void magmaGetri( - magma_int_t n, float* dA, magma_int_t ldda, magma_int_t* ipiv, float* dwork, - magma_int_t lwork, magma_int_t* info) { - MagmaStreamSyncGuard guard; - magma_sgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template <> -void magmaGetri>( - magma_int_t n, - c10::complex* dA, - magma_int_t ldda, - magma_int_t* ipiv, - c10::complex* dwork, - magma_int_t lwork, - magma_int_t* info) { - MagmaStreamSyncGuard guard; - magma_zgetri_gpu( - n, - reinterpret_cast(dA), - ldda, - ipiv, - reinterpret_cast(dwork), - lwork, - info); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template <> -void magmaGetri>( - magma_int_t n, - c10::complex* dA, - magma_int_t ldda, - magma_int_t* ipiv, - c10::complex* dwork, - magma_int_t lwork, - magma_int_t* info) { - MagmaStreamSyncGuard guard; - magma_cgetri_gpu( - n, - reinterpret_cast(dA), - ldda, - ipiv, - reinterpret_cast(dwork), - lwork, - info); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template<> -void magmaGetriBatched( - magma_int_t n, double** dA_array, magma_int_t ldda, - magma_int_t** ipiv_array, double** dinvA_array, magma_int_t lddia, - magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) { - magma_dgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue()); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template<> -void magmaGetriBatched( - magma_int_t n, float** dA_array, magma_int_t ldda, - magma_int_t** ipiv_array, float** dinvA_array, magma_int_t lddia, - magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) { - magma_sgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue()); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template <> -void magmaGetriBatched>( - magma_int_t n, - c10::complex** dA_array, - magma_int_t ldda, - magma_int_t** ipiv_array, - c10::complex** dinvA_array, - magma_int_t lddia, - magma_int_t* info_array, - magma_int_t batchsize, - const MAGMAQueue& magma_queue) { - magma_zgetri_outofplace_batched( - n, - reinterpret_cast(dA_array), - ldda, - ipiv_array, - reinterpret_cast(dinvA_array), - lddia, - info_array, - batchsize, - magma_queue.get_queue()); - AT_CUDA_CHECK(cudaGetLastError()); -} - -template <> -void magmaGetriBatched>( - magma_int_t n, - c10::complex** dA_array, - magma_int_t ldda, - magma_int_t** ipiv_array, - c10::complex** dinvA_array, - magma_int_t lddia, - magma_int_t* info_array, - magma_int_t batchsize, - const MAGMAQueue& magma_queue) { - magma_cgetri_outofplace_batched( - n, - reinterpret_cast(dA_array), - ldda, - ipiv_array, - reinterpret_cast(dinvA_array), - lddia, - info_array, - batchsize, - magma_queue.get_queue()); - AT_CUDA_CHECK(cudaGetLastError()); -} - template<> void magmaCholeskySolve( magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, double* dA, magma_int_t ldda, @@ -1319,156 +1156,6 @@ void ldl_solve_kernel( REGISTER_CUDA_DISPATCH(ldl_factor_stub, &ldl_factor_kernel) REGISTER_CUDA_DISPATCH(ldl_solve_stub, &ldl_solve_kernel) -// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -/* -Computes the inverse of n-by-n matrix 'self', it is saved to 'self_inv'. -'infos' is an int Tensor containing error codes for each matrix in the batched input. -'infos_lu' is for holding magmaLU errors, and 'infos_getri' is for holding magmaGetri errors -For more information see MAGMA's documentation for GETRI and GETRF routines. -*/ -template -static void apply_batched_inverse(Tensor& self, Tensor& self_inv, Tensor& infos_lu, Tensor& infos_getri) { -#if !AT_MAGMA_ENABLED() -AT_ERROR("inverse: MAGMA library not found in " - "compilation. Please rebuild with MAGMA."); -#else - auto self_data = self.data_ptr(); - auto self_mat_stride = matrixStride(self); - auto self_inv_data = self_inv.data_ptr(); - auto self_inv_mat_stride = matrixStride(self_inv); - - auto infos_lu_data = infos_lu.data_ptr(); - auto infos_getri_data = infos_getri.data_ptr(); - - magma_int_t batch_size = magma_int_cast(batchCount(self), "batchCount"); - // MAGMA does not work with batch_size == 0, let's return early in this case - if (batch_size == 0) { - return; - } - - magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)"); - magma_int_t lda = std::max(1, n); - - magma_int_t* ipiv_data; - magma_int_t** ipiv_array; - scalar_t** self_array; - scalar_t** self_inv_array; - - ALLOCATE_ARRAY(ipiv_data, magma_int_t, batch_size * lda); - ALLOCATE_ARRAY(ipiv_array, magma_int_t*, batch_size); - ALLOCATE_ARRAY(self_array, scalar_t*, batch_size); - ALLOCATE_ARRAY(self_inv_array, scalar_t*, batch_size); - - // Set up the created arrays - for (int64_t i = 0; i < batch_size; i++) { - self_array[i] = &self_data[i * self_mat_stride]; - self_inv_array[i] = &self_inv_data[i * self_inv_mat_stride]; - ipiv_array[i] = &ipiv_data[i * n]; - } - // magmaLuBatched leaves ipiv_data values unwritten for singular matrices. - // Initialize to avoid memory access violations inside magma kernels (gh-51930). - std::fill_n(ipiv_data, batch_size * n, 1); - - MAGMAQueue magma_queue(self.get_device()); - magmaLuBatched( - n, n, self_array, lda, ipiv_array, infos_lu_data, - batch_size, magma_queue); - - constexpr int64_t batch_limit = 65535; - // Compute as many batches of 65535 possible - // The number of "mini"-batches are floor(batch_size / batch_limit) - // and these cover floor(batch_size / batch_limit) * batch_limit matrix solves - int64_t mini_batches = batch_size / batch_limit, mini_idx; - for (mini_idx = 0; mini_idx < mini_batches * batch_limit; mini_idx += batch_limit) { - scalar_t** self_array_cur = &self_array[mini_idx]; - scalar_t** self_inv_array_cur = &self_inv_array[mini_idx]; - magma_int_t** ipiv_array_cur = &ipiv_array[mini_idx]; - magma_int_t* info_array_cur_getri = &infos_getri_data[mini_idx]; - - magmaGetriBatched( - n, self_array_cur, lda, ipiv_array_cur, self_inv_array_cur, - lda, info_array_cur_getri, batch_limit, magma_queue); - } - - // Compute whatever is left = batch_size - floor(batch_size / batch_limit) * batch_limit - // which concisely is equal to batch_size % batch_limit - if (batch_size % batch_limit != 0) { - magmaGetriBatched( - n, &self_array[mini_idx], lda, &ipiv_array[mini_idx], &self_inv_array[mini_idx], - lda, &infos_getri_data[mini_idx], batch_size % batch_limit, magma_queue); - } -#endif -} - -template -static void apply_single_inverse(Tensor& self, Tensor& info_lu, Tensor& info_getri) { -#if !AT_MAGMA_ENABLED() -AT_ERROR("inverse: MAGMA library not found in " - "compilation. Please rebuild with MAGMA."); -#else - auto self_data = self.data_ptr(); - magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)"); - magma_int_t lda = std::max(1, n); - magma_int_t lwork = n * magmaGetriOptimalBlocksize(n); - - // magmaLu and magmaGetri requires info argument to live on CPU - // but info_lu and info_getri tensors are on the same device as self - magma_int_t info_lu_cpu = 0; - magma_int_t info_getri_cpu = 0; - - Tensor ipiv = at::empty({lda}, at::kInt); - Tensor dwork = at::empty({lwork}, self.options()); - magmaLu(n, n, self_data, lda, ipiv.data_ptr(), &info_lu_cpu); - magmaGetri( - n, self_data, lda, ipiv.data_ptr(), dwork.data_ptr(), lwork, &info_getri_cpu); - info_lu.fill_(info_lu_cpu); - info_getri.fill_(info_getri_cpu); -#endif -} - - -// This is a type dispatching helper function for 'apply_batched_inverse' and 'singleCheckErrors' -Tensor& _linalg_inv_out_helper_cuda_legacy(Tensor& result, Tensor& infos_lu, Tensor& infos_getri) { - // assuming result is in column major order and contains the matrices to invert - if (result.dim() > 2) { - auto input_working_copy = cloneBatchedColumnMajor(result); - AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{ - apply_batched_inverse( - input_working_copy, result, infos_lu, infos_getri); - }); - } else { - AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{ - apply_single_inverse(result, infos_lu, infos_getri); - }); - } - return result; -} - -// This is a MAGMA/cuSOLVER dispatching helper function -Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) { - // This function calculates the inverse matrix in-place - // result should be in column major order and contain matrices to invert -#ifdef USE_CUSOLVER - auto preferred_backend = at::globalContext().linalgPreferredBackend(); - switch (preferred_backend) { - case at::LinalgBackend::Cusolver: - return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri); // cusolver or cublas - case at::LinalgBackend::Magma: - return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri); // magma-cuda - default: - if (batchCount(result) <= 2 || !use_magma_) { - return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri); // cusolver or cublas - } else { - return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri); // magma-cuda - } - } -#else - return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri); // magma-cuda -#endif - return result; -} - // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ template @@ -1928,7 +1615,12 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i const auto preferred_backend = at::globalContext().linalgPreferredBackend(); #ifdef USE_CUSOLVER const auto lu_factor_cusolver = [batch_size, m, n](const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) { - if (batch_size == 1 || m != n || m >= 512) { + // In CUDA 10.2, lu_factor_looped_cusolver does not finish the computations when the input + // matrix is exactly singular. The returned pivots contain garbage. This breaks linalg.det + // Now, batched_cublas does not handle rectangular matrices, so we still dispatch to + // looped_cusolver even if m != n. + constexpr bool looped_correct = CUSOLVER_VERSION >= 11100; + if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) { lu_factor_looped_cusolver(input, pivots, infos, compute_pivots); } else { lu_factor_batched_cublas(input, pivots, infos, compute_pivots); @@ -3254,8 +2946,7 @@ struct DispatchInitializer { DispatchInitializer() { cuda::detail::LinalgDispatch disp{ _symeig_helper_cuda, _cholesky_solve_helper_cuda, - legacy_lstsq_cuda, - _linalg_inv_out_helper_cuda}; + legacy_lstsq_cuda}; cuda::detail::registerLinalgDispatch(disp); }; } initializer; diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp index d3109d866a592..d80b93b3da098 100644 --- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp +++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp @@ -471,101 +471,6 @@ inline static Tensor column_major_identity_matrix_like(const Tensor& self) { return at::ones(size_slice, self.options()).diag_embed().mT(); } -template -inline static void _apply_single_inverse_helper(scalar_t* self_ptr, scalar_t* self_inv_ptr, int* ipiv_ptr, int* info_getrf_ptr, int* info_getrs_ptr, int n, int lda) { - // self_inv_ptr should already be an identity matrix - - auto handle = at::cuda::getCurrentCUDASolverDnHandle(); - at::cuda::solver::getrf(handle, n, n, self_ptr, lda, ipiv_ptr, info_getrf_ptr); - at::cuda::solver::getrs(handle, n, n, self_ptr, lda, ipiv_ptr, self_inv_ptr, lda, info_getrs_ptr, CUBLAS_OP_N); -} - -template -static void apply_batched_inverse_lib(Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) { - const int batch_size = cuda_int_cast(batchCount(self), "batchCount"); - const int n = cuda_int_cast(self.size(-2), "self.size(-2)"); - const int lda = std::max(1, n); - - auto self_data = self.data_ptr(); - auto self_mat_stride = matrixStride(self); - auto self_inv_data = self_inv.data_ptr(); - auto self_inv_mat_stride = matrixStride(self_inv); - - auto infos_getrf_data = infos_getrf.data_ptr(); - auto infos_getrs_data = infos_getrs.data_ptr(); - - auto& allocator = *::c10::cuda::CUDACachingAllocator::get(); - - // Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of - // calling the batched cublas routine. - if (batch_size <= 8 || /* batch_size > 8 && */ n >= 512) { - for (int64_t i = 0; i < batch_size; i++) { - auto dataPtr = allocator.allocate(sizeof(int) * lda); - int* pivot = reinterpret_cast(dataPtr.get()); - - int* infos_getrf_working_ptr = &infos_getrf_data[i]; - int* infos_getrs_working_ptr = &infos_getrs_data[i]; - - _apply_single_inverse_helper( - &self_data[i * self_mat_stride], &self_inv_data[i * self_inv_mat_stride], pivot, infos_getrf_working_ptr, infos_getrs_working_ptr, n, lda); - } - } else { - // cublas batched kernels require input be "device array of device pointers" - Tensor self_array = at::arange( - reinterpret_cast(self_data), - reinterpret_cast(&self_data[(batch_size-1) * self_mat_stride]) + 1, - static_cast(self_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong)); - Tensor self_inv_array = at::arange( - reinterpret_cast(self_inv_data), - reinterpret_cast(&self_inv_data[(batch_size-1) * self_inv_mat_stride]) + 1, - static_cast(self_inv_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong)); - - auto dataPtr = allocator.allocate(sizeof(int)*batch_size*lda); - int* ipiv_array = reinterpret_cast(dataPtr.get()); - - at::cuda::blas::getrfBatched(n, reinterpret_cast(self_array.data_ptr()), lda, - ipiv_array, infos_getrf_data, batch_size); - - at::cuda::blas::getriBatched(n, reinterpret_cast(self_array.data_ptr()), lda, - ipiv_array, reinterpret_cast(self_inv_array.data_ptr()), lda, infos_getrs_data, batch_size); - } -} - -template -static void apply_single_inverse_lib(const Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) { - int n = cuda_int_cast(self.size(-2), "self.size(-2)"); - int lda = std::max(1, n); - - Tensor ipiv = at::empty({lda}, self.options().dtype(at::kInt)); - - _apply_single_inverse_helper( - self.data_ptr(), self_inv.data_ptr(), ipiv.data_ptr(), infos_getrf.data_ptr(), infos_getrs.data_ptr(), n, lda); -} - -// This is a type dispatching helper function for 'apply_batched_inverse_lib' and 'apply_single_inverse_lib' -Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs) { - // assuming result is in column major order and contains the matrices to invert - Tensor input_working_copy = cloneBatchedColumnMajor(result); - - // for getrf + getrs (cusolver path) - // result should be filled with identity matrices - result.zero_(); - result.diagonal(/*offset=*/0, /*dim1=*/-2, /*dim2=*/-1).fill_(1); - - if (result.dim() > 2) { - AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{ - apply_batched_inverse_lib( - input_working_copy, result, infos_getrf, infos_getrs); - }); - } else { - AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{ - apply_single_inverse_lib(input_working_copy, result, infos_getrf, infos_getrs); - }); - } - - return result; -} - // call cusolver gesvd function to calculate svd template inline static void apply_svd_cusolver_gesvd(const Tensor& A, const Tensor& U, const Tensor& S, const Tensor& V, diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h index adee8cc9eb4ea..65b4f9b577178 100644 --- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h +++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h @@ -59,10 +59,6 @@ void lu_solve_batched_cublas(const Tensor& LU, const Tensor& pivots, const Tenso #ifdef USE_CUSOLVER -// entrance of calculations of `inverse` using cusolver getrf + getrs, cublas getrfBatched + getriBatched -Tensor _inverse_helper_cuda_lib(const Tensor& self); -Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs); - // entrance of calculations of `svd` using cusolver gesvdj and gesvdjBatched void svd_cusolver(const Tensor& A, const bool full_matrices, const bool compute_uv, const c10::optional& driver, const Tensor& U, const Tensor& S, const Tensor& V, const Tensor& info); @@ -91,7 +87,6 @@ struct LinalgDispatch { std::tuple (*symeig_helper)(const Tensor& self, bool eigenvectors, bool upper); Tensor (*cholesky_solve_helper)(const Tensor& self, const Tensor& A, bool upper); std::tuple (*legacy_lstsq)(const Tensor &B, const Tensor &A); - Tensor& (*inv_out_helper)(Tensor &result, Tensor& infos_lu, Tensor& infos_getri); }; C10_EXPORT void registerLinalgDispatch(const LinalgDispatch&); }} // namespace cuda::detail diff --git a/aten/src/ATen/native/cuda/reduction_template.cuh b/aten/src/ATen/native/cuda/reduction_template.cuh index 4d9d559d8ec8a..a38edb538256d 100644 --- a/aten/src/ATen/native/cuda/reduction_template.cuh +++ b/aten/src/ATen/native/cuda/reduction_template.cuh @@ -4,11 +4,22 @@ namespace cuda { const std::string reduction_template_0 = R"ESCAPE( #define C10_HOST_DEVICE __host__ __device__ #define C10_DEVICE __device__ + #if defined(__clang__) && defined(__HIP__) + #ifndef __forceinline__ + #define __forceinline__ inline __attribute__((always_inline)) + #endif + // until ROCm support for kernel asserts is restored + #define assert(expr) (static_cast(0)) + #endif template __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff) { + #if defined(__clang__) && defined(__HIP__) + return __shfl_down(value, delta, width); + #else return __shfl_down_sync(mask, value, delta, width); + #endif } @@ -17,8 +28,13 @@ const std::string reduction_template_0 = R"ESCAPE( __device__ __forceinline__ std::complex WARP_SHFL_DOWN(std::complex value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff) { return std::complex( + #if defined(__clang__) && defined(__HIP__) + __shfl_down(value.real(), delta, width), + __shfl_down(value.imag(), delta, width)); + #else __shfl_down_sync(mask, value.real(), delta, width), __shfl_down_sync(mask, value.imag(), delta, width)); + #endif } #endif diff --git a/aten/src/ATen/native/cudnn/Conv_v7.cpp b/aten/src/ATen/native/cudnn/Conv_v7.cpp index 5225fff3bc234..63968fd2072f9 100644 --- a/aten/src/ATen/native/cudnn/Conv_v7.cpp +++ b/aten/src/ATen/native/cudnn/Conv_v7.cpp @@ -131,7 +131,7 @@ struct Workspace { // Sometimes cuDNN returns a workspace size > 2^63, this could makes the allocation of // workspace fail with some 64bit indexing error instead of an OOM error. In such case, // we manually fail with OOM. - TORCH_CHECK_WITH(CUDAOutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!"); + TORCH_CHECK_WITH(OutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!"); data = c10::cuda::CUDACachingAllocator::raw_alloc(size); } Workspace(const Workspace&) = delete; @@ -505,7 +505,7 @@ class AlgoIterator { try { f(algoPerf); return; - } catch (c10::CUDAOutOfMemoryError &e) { + } catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -516,7 +516,7 @@ class AlgoIterator { f(algoPerf); cache.insert(args.params, algoPerf); return; - } catch (c10::CUDAOutOfMemoryError &e) { + } catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } catch (c10::CuDNNError &e) { cudaGetLastError(); // clear CUDA error @@ -530,7 +530,7 @@ inline Tensor allocate_workspace(size_t size, const Tensor &other) { // Sometimes cuDNN returns a workspace size > 2^63, this could makes the allocation of // workspace fail with some 64bit indexing error instead of an OOM error. In such case, // we manually fail with OOM. - TORCH_CHECK_WITH(CUDAOutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!"); + TORCH_CHECK_WITH(OutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!"); return at::empty({static_cast(size)}, other.options().dtype(kByte)); } diff --git a/aten/src/ATen/native/cudnn/Conv_v8.cpp b/aten/src/ATen/native/cudnn/Conv_v8.cpp index 7d5664b12cf51..2ad8d4ffe37c1 100644 --- a/aten/src/ATen/native/cudnn/Conv_v8.cpp +++ b/aten/src/ATen/native/cudnn/Conv_v8.cpp @@ -332,13 +332,13 @@ void generate_and_filter_plans(const cudnnHandle_t handle, cudnn_frontend::Opera valid_plans.emplace_back(std::move(plan)); } }); - TORCH_CHECK_WITH(CUDAOutOfMemoryError, max_workspace_size < 1_TiB, "Not enough memory for workspace!"); + TORCH_CHECK_WITH(OutOfMemoryError, max_workspace_size < 1_TiB, "Not enough memory for workspace!"); bool remove_invalid = false; while (max_workspace_size) { try { workspace_ptr = c10::cuda::CUDACachingAllocator::get()->allocate(max_workspace_size); break; - } catch (c10::CUDAOutOfMemoryError &e) { + } catch (c10::OutOfMemoryError &e) { max_workspace_size /= 2; cudaGetLastError(); // clear CUDA error remove_invalid = true; @@ -449,7 +449,7 @@ void try_plans(cudnn_frontend::executionPlans_t& plans, const CacheKey& key, con benchmark_cache.emplace(key, plan); return; } catch (cudnn_frontend::cudnnException &e) {} catch (CuDNNError &e) {} - catch (c10::CUDAOutOfMemoryError &e) { + catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -463,7 +463,7 @@ void try_plans_fused(cudnn_frontend::executionPlans_t& plans, const CacheKeyFuse benchmark_cache_fused.emplace(key, plan); return; } catch (cudnn_frontend::cudnnException &e) {} catch (CuDNNError &e) {} - catch (c10::CUDAOutOfMemoryError &e) { + catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -484,7 +484,7 @@ void try_configs(cudnn_frontend::EngineConfigList& configs, const std::string& o benchmark_cache.emplace(key, plan); return; } catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {} - catch (c10::CUDAOutOfMemoryError &e) { + catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -505,7 +505,7 @@ void try_configs_fused(cudnn_frontend::EngineConfigList& configs, const std::str benchmark_cache_fused.emplace(key, plan); return; } catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {} - catch (c10::CUDAOutOfMemoryError &e) { + catch (c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -525,7 +525,7 @@ void run_single_conv(const cudnnBackendDescriptorType_t operation, try { run_conv_plan(handle, x, y, w, *search); return; - } catch(c10::CUDAOutOfMemoryError &e) { + } catch(c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } @@ -561,7 +561,7 @@ void run_fused_conv(const Tensor& x, const Tensor& y, const Tensor& w, const Ten try { run_conv_plan_fused(handle, x, y, w, z, b, *search); return; - } catch(c10::CUDAOutOfMemoryError &e) { + } catch(c10::OutOfMemoryError &e) { cudaGetLastError(); // clear CUDA error } } diff --git a/aten/src/ATen/native/layer_norm.cpp b/aten/src/ATen/native/layer_norm.cpp index f9ef7e097e017..0b0245896dfa4 100644 --- a/aten/src/ATen/native/layer_norm.cpp +++ b/aten/src/ATen/native/layer_norm.cpp @@ -206,7 +206,17 @@ std::tuple math_native_layer_norm( const int normalized_ndim = normalized_shape.size(); // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions) const int axis = input_ndim - normalized_ndim; - at::Tensor input_reshaped = input.view({1, M, -1}); + + // Properly handle zero-size inputs: the view(1, M, -1) call below breaks on this. + if (input.numel() == 0) { + auto result_type = c10::promoteTypes(input.scalar_type(), kFloat); + return std::make_tuple( + at::empty_like(input), + at::empty_like(input, c10::TensorOptions().dtype(result_type)), + at::empty_like(input, c10::TensorOptions().dtype(result_type)) + ); + } + at::Tensor input_reshaped = input.reshape({1, M, -1}); // Unlike Batch Normalization, which applies scalar scale and bias for each // entire channel/plane with the affine option, Layer Normalization applies // per-element scale and bias. E.g. For input {N, C, H, W}, weight for diff --git a/aten/src/ATen/native/miopen/Conv_miopen.cpp b/aten/src/ATen/native/miopen/Conv_miopen.cpp index 61eb209d5adc1..be92f5a311a55 100644 --- a/aten/src/ATen/native/miopen/Conv_miopen.cpp +++ b/aten/src/ATen/native/miopen/Conv_miopen.cpp @@ -102,6 +102,20 @@ std::tuple miopen_depthwise_convolution_backwa AT_ERROR("miopen_depthwise_convolution_backward: ATen not compiled with MIOpen support"); } + +at::Tensor miopen_convolution_add_relu( + const at::Tensor& input, const at::Tensor& weight, const at::Tensor& z, + const c10::optional& alpha, const c10::optional& bias, IntArrayRef stride, + IntArrayRef padding, IntArrayRef dilation, int64_t groups) { + AT_ERROR("miopen_convolution_add_relu: ATen not compiled with MIOpen support"); +} + +at::Tensor miopen_convolution_relu( + const at::Tensor& input, const at::Tensor& weight, const c10::optional& bias, + IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, int64_t groups) { + AT_ERROR("miopen_convolution_relu: ATen not compiled with MIOpen support"); +} + }} #else // AT_ROCM_ENABLED @@ -1449,6 +1463,219 @@ Tensor miopen_convolution_transpose( return output_t; } +// MIOpen fused convolution bias activation forward +void raw_miopen_convolution_relu_out( + const Tensor& output, + const Tensor& input, + const Tensor& weight, + const Tensor& bias, + IntArrayRef stride, + IntArrayRef padding, + IntArrayRef dilation, + int64_t groups, + bool benchmark, + bool deterministic) { + + auto dataType = getMiopenDataType(input); + miopenConvolutionMode_t c_mode = miopenConvolution; + + ConvolutionArgs args{ input, output, weight }; + args.handle = getMiopenHandle(); + setConvolutionParams(&args.params, args.handle, input, weight, padding, stride, dilation, groups, deterministic); + args.idesc.set(input); + args.wdesc.set(weight, input.suggest_memory_format(), 0); + args.odesc.set(output); + args.cdesc.set(dataType, c_mode, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups); + + TensorDescriptor bdesc; + bdesc.set(bias.expand({1, bias.size(0)}), output.dim()); + + // Create the fusion plan + miopenFusionPlanDescriptor_t fusePlanDesc; + miopenFusionOpDescriptor_t convoOp; + miopenFusionOpDescriptor_t biasOp; + miopenFusionOpDescriptor_t activOp; + MIOPEN_CHECK(miopenCreateFusionPlan(&fusePlanDesc, miopenVerticalFusion, args.idesc.desc())); + MIOPEN_CHECK(miopenCreateOpConvForward(fusePlanDesc, &convoOp, args.cdesc.desc(), args.wdesc.desc())); + MIOPEN_CHECK(miopenCreateOpBiasForward(fusePlanDesc, &biasOp, bdesc.desc())); + MIOPEN_CHECK(miopenCreateOpActivationForward(fusePlanDesc, &activOp, miopenActivationRELU)); + + // compile fusion plan + MIOPEN_CHECK(miopenCompileFusionPlan(args.handle, fusePlanDesc)); + + // Set the Args + float alpha = static_cast(1); + float beta = static_cast(0); + float activ_alpha = static_cast(0); + float activ_beta = static_cast(0); + float activ_gamma = static_cast(0); + miopenOperatorArgs_t fusionArgs; + MIOPEN_CHECK(miopenCreateOperatorArgs(&fusionArgs)); + MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.data_ptr())); + MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.data_ptr())); + MIOPEN_CHECK(miopenSetOpArgsActivForward(fusionArgs, activOp, &alpha, &beta, activ_alpha, activ_beta, activ_gamma)); + + miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs); + + // Cleanup + miopenDestroyFusionPlan(fusePlanDesc); +} + +static at::Tensor self_or_new_memory_format(at::Tensor& self, at::MemoryFormat memory_format) { + if (self.is_contiguous(memory_format)) { + return self; + } + return at::empty_like(self, self.options(), memory_format); +} + +Tensor miopen_convolution_add_relu( + const Tensor& input, + const Tensor& weight, + const Tensor& z, + const c10::optional& alpha, + const c10::optional& bias, + IntArrayRef stride, + IntArrayRef padding, + IntArrayRef dilation, + int64_t groups) { + + // MIOpen does not support fusion of add, the alpha2 * z step of the below cuDNN function: + // y = act ( alpha1 * conv(x) + alpha2 * z + bias ) + + auto memory_format = input.suggest_memory_format(); + + auto& ctx = at::globalContext(); + bool benchmark = ctx.benchmarkCuDNN(); + + TensorArg input_arg { input, "input", 1 }, + weight_arg { weight, "weight", 2 }; + auto output = miopen_convolution_forward( + "miopen_convolution_add_relu", + input_arg, + weight_arg, + padding, + stride, + dilation, + groups, + benchmark, + false // deterministic + ); + + auto contig_output = self_or_new_memory_format(output, memory_format); + + if (!output.is_same(contig_output)) { + contig_output.copy_(output); + } + + auto _alpha = alpha.has_value() ? alpha.value().to() : 1.0; + auto _bias = bias.has_value() + ? bias.value() + : at::native::zeros( + {contig_output.size(1)}, + optTypeMetaToScalarType(contig_output.options().dtype_opt()), + contig_output.options().layout_opt(), + contig_output.options().device_opt(), + contig_output.options().pinned_memory_opt()); + + at::Tensor alpha_mul_z_add_bias = at::native::reshape_bias(input.dim(), _bias).add(z, _alpha); + contig_output.add_(alpha_mul_z_add_bias); + contig_output.relu_(); + + return contig_output; +} + +Tensor miopen_convolution_relu( + const Tensor& input, + const Tensor& weight, + const c10::optional& bias, + IntArrayRef stride, + IntArrayRef padding, + IntArrayRef dilation, + int64_t groups) { + + auto memory_format = input.suggest_memory_format(); + + auto& ctx = at::globalContext(); + bool benchmark = ctx.benchmarkCuDNN(); + + // MIOpen currently only supports MemoryFormat::Contiguous and fp32 and 2d + if (input.suggest_memory_format() == at::MemoryFormat::Contiguous + && input.scalar_type() == at::kFloat + && input.ndimension() == 4) { + + // FuseFrozenConvAddRelu performs some tensor shape checking + Tensor output_t = at::detail::empty_cuda( + conv_output_size( + input.sizes(), weight.sizes(), padding, stride, dilation), + input.options().memory_format(input.suggest_memory_format())); + if (output_t.numel() == 0) { + return output_t; + } + + auto _bias = bias.has_value() + ? bias.value() + : at::native::zeros( + {output_t.size(1)}, + optTypeMetaToScalarType(output_t.options().dtype_opt()), + output_t.options().layout_opt(), + output_t.options().device_opt(), + output_t.options().pinned_memory_opt()); + + raw_miopen_convolution_relu_out( + output_t, + input, + weight, + _bias, + stride, + padding, + dilation, + groups, + benchmark, // benchmark + false // deterministic + ); + + return output_t; + } + else { + // fallback + + TensorArg input_arg { input, "input", 1 }, + weight_arg { weight, "weight", 2 }; + auto output = miopen_convolution_forward( + "miopen_convolution_relu", + input_arg, + weight_arg, + padding, + stride, + dilation, + groups, + benchmark, + false // deterministic + ); + + auto contig_output = self_or_new_memory_format(output, memory_format); + + if (!output.is_same(contig_output)) { + contig_output.copy_(output); + } + + auto _bias = bias.has_value() + ? bias.value() + : at::native::zeros( + {contig_output.size(1)}, + optTypeMetaToScalarType(contig_output.options().dtype_opt()), + contig_output.options().layout_opt(), + contig_output.options().device_opt(), + contig_output.options().pinned_memory_opt()); + + at::Tensor reshaped_bias = at::native::reshape_bias(input.dim(), _bias); + contig_output.add_(reshaped_bias); + contig_output.relu_(); + + return contig_output; + } +} + REGISTER_CUDA_DISPATCH(miopen_convolution_backward_stub, &miopen_convolution_backward); REGISTER_CUDA_DISPATCH(miopen_convolution_transpose_backward_stub, &miopen_convolution_transpose_backward); REGISTER_CUDA_DISPATCH(miopen_depthwise_convolution_backward_stub, &miopen_depthwise_convolution_backward); diff --git a/aten/src/ATen/native/mkldnn/Common.h b/aten/src/ATen/native/mkldnn/Common.h new file mode 100644 index 0000000000000..4e048ebce7597 --- /dev/null +++ b/aten/src/ATen/native/mkldnn/Common.h @@ -0,0 +1,46 @@ +#pragma once + +#include +#include + +#if AT_MKLDNN_ENABLED() + +#include + +namespace at { +namespace native { +namespace mkldnn { + +struct ContextConv final { + ideep::tensor weight_packed_; + c10::optional at_bias_; + std::vector padding_; + std::vector stride_; + std::vector dilation_; + int64_t groups_; + ideep::attr_t attr_; + + ContextConv() = delete; + + ContextConv( + ideep::tensor&& weight_packed, + c10::optional at_bias, + std::vector padding, + std::vector stride, + std::vector dilation, + int64_t groups, + ideep::attr_t attr) + : weight_packed_(std::move(weight_packed)), + at_bias_(std::move(at_bias)), + padding_(padding), + stride_(stride), + dilation_(dilation), + groups_(groups), + attr_(attr) {} +}; + +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mkldnn/Conv.cpp b/aten/src/ATen/native/mkldnn/Conv.cpp index 0096a1cda6743..7f285e2cfcb8b 100644 --- a/aten/src/ATen/native/mkldnn/Conv.cpp +++ b/aten/src/ATen/native/mkldnn/Conv.cpp @@ -155,9 +155,17 @@ static void check_shape_forward(const Tensor& input, // but weight/bias and grad_weight/grad_bias are always CPU tensor. // +static inline at::MemoryFormat mkldnn_convolution_memory_format(int64_t dims, bool is_channels_last) { + auto memory_format = at::MemoryFormat::Contiguous; + if (is_channels_last) { + memory_format = dims == 4 ? at::MemoryFormat::ChannelsLast : at::MemoryFormat::ChannelsLast3d; + } + return memory_format; +} + Tensor mkldnn_convolution( - const Tensor& input, - const Tensor& weight, + const Tensor& input_t, + const Tensor& weight_t, const c10::optional& bias_opt, IntArrayRef padding, IntArrayRef stride, @@ -167,15 +175,18 @@ Tensor mkldnn_convolution( c10::MaybeOwned bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt); const Tensor& bias = *bias_maybe_owned; - if (input.scalar_type() == ScalarType::BFloat16) { + if (input_t.scalar_type() == ScalarType::BFloat16) { TORCH_CHECK(mkldnn_bf16_device_check(), "mkldnn_convolution: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq"); } - check_shape_forward(input, weight, bias, padding, stride, dilation, groups); + check_shape_forward(input_t, weight_t, bias, padding, stride, dilation, groups); - bool is_channels_last = input.suggest_memory_format() == at::MemoryFormat::ChannelsLast; + bool is_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t); + auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last); + auto input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format); + auto weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format); auto output_sizes = conv_output_size(input.sizes(), weight.sizes(), padding, stride, dilation); auto output = at::empty({0}, input.options()); @@ -184,12 +195,12 @@ Tensor mkldnn_convolution( ideep::tensor y; if (is_channels_last) { - output.resize_(output_sizes, input.suggest_memory_format()); + output.resize_(output_sizes, memory_format); y = itensor_from_tensor(output); } if (bias.defined()) { const ideep::tensor b = itensor_from_tensor(bias); - ideep::convolution_forward::compute( + ideep::convolution_forward::compute_v3( x, w, b, @@ -199,9 +210,10 @@ Tensor mkldnn_convolution( {dilation.begin(), dilation.end()}, {padding.begin(), padding.end()}, {padding.begin(), padding.end()}, - groups); + groups, + is_channels_last); } else { - ideep::convolution_forward::compute( + ideep::convolution_forward::compute_v3( x, w, {output_sizes.cbegin(), output_sizes.cend()}, @@ -210,7 +222,8 @@ Tensor mkldnn_convolution( {dilation.begin(), dilation.end()}, {padding.begin(), padding.end()}, {padding.begin(), padding.end()}, - groups); + groups, + is_channels_last); } if (input.is_mkldnn()) { @@ -224,10 +237,15 @@ Tensor mkldnn_convolution( } Tensor mkldnn_convolution_backward_input( - IntArrayRef input_size, const Tensor& grad_output, const Tensor& weight, - IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined) -{ - bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast; + IntArrayRef input_size, + const Tensor& grad_output, + const Tensor& weight, + IntArrayRef padding, + IntArrayRef stride, + IntArrayRef dilation, + int64_t groups, + bool bias_defined, + bool is_channels_last) { auto grad_input = at::empty({0}, grad_output.options()); auto grad_y = itensor_from_tensor(grad_output); @@ -235,10 +253,11 @@ Tensor mkldnn_convolution_backward_input( ideep::tensor grad_x; if (is_channels_last) { - grad_input.resize_(input_size, grad_output.suggest_memory_format()); + auto memory_format = mkldnn_convolution_memory_format(grad_output.ndimension(), is_channels_last); + grad_input.resize_(input_size, memory_format); grad_x = itensor_from_tensor(grad_input); } - ideep::convolution_backward_data::compute( + ideep::convolution_backward_data::compute_v2( grad_y, w, input_size.vec(), @@ -247,7 +266,8 @@ Tensor mkldnn_convolution_backward_input( dilation.vec(), padding.vec(), padding.vec(), - groups); + groups, + is_channels_last); if (grad_output.is_mkldnn()) { return MKLDNNTensor(grad_x, grad_output.options()); @@ -260,17 +280,21 @@ Tensor mkldnn_convolution_backward_input( } std::tuple mkldnn_convolution_backward_weights( - IntArrayRef weight_size, const Tensor& grad_output, const Tensor& input, - IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined) -{ - bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast; - + IntArrayRef weight_size, + const Tensor& grad_output, + const Tensor& input, + IntArrayRef padding, + IntArrayRef stride, + IntArrayRef dilation, + int64_t groups, + bool bias_defined, + bool is_channels_last) { const ideep::tensor grad_y = itensor_from_tensor(grad_output); const ideep::tensor x = itensor_from_tensor(input); ideep::tensor grad_w, grad_b; if (bias_defined) { - ideep::convolution_backward_weights::compute( + ideep::convolution_backward_weights::compute_v2( x, grad_y, weight_size.vec(), @@ -280,9 +304,10 @@ std::tuple mkldnn_convolution_backward_weights( dilation.vec(), padding.vec(), padding.vec(), - groups); + groups, + is_channels_last); } else { - ideep::convolution_backward_weights::compute( + ideep::convolution_backward_weights::compute_v2( x, grad_y, weight_size.vec(), @@ -291,7 +316,8 @@ std::tuple mkldnn_convolution_backward_weights( dilation.vec(), padding.vec(), padding.vec(), - groups); + groups, + is_channels_last); } if (!is_channels_last) { @@ -306,20 +332,23 @@ std::tuple mkldnn_convolution_backward_weights( } std::tuple mkldnn_convolution_backward( - const Tensor& input, const Tensor& grad_output_t, const Tensor& weight, + const Tensor& input_t, const Tensor& grad_output_t, const Tensor& weight_t, IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, std::array output_mask) { - auto memory_format = input.suggest_memory_format(); + bool is_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t); + auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last); Tensor grad_output = grad_output_t.is_mkldnn() ? grad_output_t : grad_output_t.contiguous(memory_format); + Tensor input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format); + Tensor weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format); Tensor grad_input, grad_weight, grad_bias; if (output_mask[0]) { grad_input = mkldnn_convolution_backward_input( - input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2]); + input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2], is_channels_last); } if (output_mask[1] || output_mask[2]) { std::tie(grad_weight, grad_bias) = mkldnn_convolution_backward_weights( - weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2]); + weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2], is_channels_last); } return std::make_tuple(grad_input, grad_weight, grad_bias); diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.cpp b/aten/src/ATen/native/mkldnn/ConvPrepack.cpp new file mode 100644 index 0000000000000..7670b259b4aec --- /dev/null +++ b/aten/src/ATen/native/mkldnn/ConvPrepack.cpp @@ -0,0 +1,289 @@ +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#if AT_MKLDNN_ENABLED() + +namespace at { +namespace native { +namespace mkldnn { +namespace internal { +namespace convolution { + +c10::intrusive_ptr createConvPrePackOpContext( + Tensor weight, + c10::optional bias, + std::vector stride, + std::vector padding, + std::vector dilation, + int64_t groups, + std::vector input_size, + std::string attr) { + auto it = fusion_attr_map.find(attr); + TORCH_CHECK(it != fusion_attr_map.end(), "Fusion behavior undefined."); + ideep::attr_t op_attr = it->second; + + return mkldnn::MkldnnConvOpContext::create_context( + std::move(weight), + std::move(bias), + std::move(padding), + std::move(stride), + std::move(dilation), + groups, + std::move(input_size), + op_attr); +} + +ContextConv create( + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef padding, + const IntArrayRef stride, + const IntArrayRef dilation, + const int64_t groups, + const IntArrayRef input_size, + const ideep::attr_t& attr) { + auto k = weight.ndimension(); + int64_t dim = k - 2; + const auto padding_expanded = expand_param_if_needed(padding, "padding", dim); + const auto stride_expanded = expand_param_if_needed(stride, "stride", dim); + const auto dilation_expanded = + expand_param_if_needed(dilation, "dilation", dim); + const auto input_size_expanded = + expand_param_if_needed(input_size, "input_size", k); + + c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset); + auto w = itensor_view_from_dense(weight); + // TODO: what if input is nhwc but w is nchw + bool is_channels_last = + weight.suggest_memory_format() == at::MemoryFormat::ChannelsLast; + ideep::tensor::desc expected_weight_desc = + ideep::convolution_forward::expected_weights_desc( + w.get_dims(), + w.get_data_type(), + {stride_expanded.begin(), stride_expanded.end()}, + {padding_expanded.begin(), padding_expanded.end()}, + {padding_expanded.begin(), padding_expanded.end()}, + {dilation_expanded.begin(), dilation_expanded.end()}, + groups, + ideep::algorithm::convolution_direct, + ideep::prop_kind::forward, + /*x_dtype*/ w.get_data_type(), + {input_size_expanded.begin(), input_size_expanded.end()}, + attr, + is_channels_last); + + ideep::tensor packed_weight; + packed_weight.init(expected_weight_desc); + packed_weight.feed_from(w); + + return ContextConv{ + std::move(packed_weight), + bias.has_value() ? c10::make_optional(*bias) : c10::nullopt, + {padding_expanded.begin(), padding_expanded.end()}, + {stride_expanded.begin(), stride_expanded.end()}, + {dilation_expanded.begin(), dilation_expanded.end()}, + groups, + std::move(attr)}; +} + +void _mkldnn_convolution_out( + const ideep::tensor& x, + ideep::tensor& y, + const ideep::tensor& w, + const c10::optional& b, + IntArrayRef padding, + IntArrayRef stride, + IntArrayRef dilation, + IntArrayRef output_sizes, + int64_t groups, + const ideep::attr_t& attr = ideep::attr_t()) { + if (b.has_value()) { + ideep::convolution_forward::compute_v2( + x, + w, + b.value(), + {output_sizes.cbegin(), output_sizes.cend()}, + y, + {stride.begin(), stride.end()}, + {dilation.begin(), dilation.end()}, + {padding.begin(), padding.end()}, + {padding.begin(), padding.end()}, + groups, + ideep::scale_t(), + ideep::scale_t(), + ideep::scale_t(), + ideep::zero_point_t(), + ideep::zero_point_t(), + attr); + } else { + ideep::convolution_forward::compute_v2( + x, + w, + {output_sizes.cbegin(), output_sizes.cend()}, + y, + {stride.begin(), stride.end()}, + {dilation.begin(), dilation.end()}, + {padding.begin(), padding.end()}, + {padding.begin(), padding.end()}, + groups, + ideep::scale_t(), + ideep::scale_t(), + ideep::scale_t(), + ideep::zero_point_t(), + ideep::zero_point_t(), + attr); + } +} + +void mkldnn_convolution_out( + const Tensor& input, + ideep::tensor& mkldnn_output, + const ideep::tensor& mkldnn_weight, + const c10::optional& bias_opt, + IntArrayRef padding, + IntArrayRef stride, + IntArrayRef dilation, + IntArrayRef output_sizes, + int64_t groups, + const ideep::attr_t& attr = ideep::attr_t()) { + c10::MaybeOwned bias_maybe_owned = + at::borrow_from_optional_tensor(bias_opt); + const Tensor& bias = *bias_maybe_owned; + + c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset); + const ideep::tensor mkldnn_input = itensor_from_tensor(input); + c10::optional mkldnn_bias{c10::nullopt}; + if (bias.defined()) { + mkldnn_bias = itensor_from_tensor(bias); + } + + _mkldnn_convolution_out( + mkldnn_input, + mkldnn_output, + mkldnn_weight, + mkldnn_bias, + padding, + stride, + dilation, + output_sizes, + groups, + attr); +} + +std::vector get_output_sizes( + ContextConv& context, + const Tensor& input) { + const ideep::tensor& mkldnn_weight = context.weight_packed_; + IntArrayRef padding = context.padding_; + IntArrayRef stride = context.stride_; + IntArrayRef dilation = context.dilation_; + + auto kernel_size = mkldnn_weight.get_dims(); + + std::vector input_size = input.sizes().vec(); + return conv_output_size(input_size, kernel_size, padding, stride, dilation); +} + +Tensor run(ContextConv& context, const Tensor& input) { + std::vector output_sizes = get_output_sizes(context, input); + auto output = at::empty( + output_sizes, + input.options().memory_format(input.suggest_memory_format())); + + bool is_channels_last = + input.suggest_memory_format() == at::MemoryFormat::ChannelsLast; + ideep::tensor y; + + c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset); + ideep::tensor mkldnn_output = itensor_from_tensor(output); + + if (is_channels_last) { + mkldnn_convolution_out( + input, + mkldnn_output, + context.weight_packed_, + context.at_bias_, + context.padding_, + context.stride_, + context.dilation_, + output_sizes, + context.groups_, + context.attr_); + } else { + mkldnn_convolution_out( + input, + y, + context.weight_packed_, + context.at_bias_, + context.padding_, + context.stride_, + context.dilation_, + output_sizes, + context.groups_, + context.attr_); + mkldnn_output.feed_from(y); + } + return output; +} + +void run(ContextConv& context, const Tensor& input, void* output) { + std::vector output_sizes = get_output_sizes(context, input); + + bool is_channels_last = + input.suggest_memory_format() == at::MemoryFormat::ChannelsLast; + ideep::tensor y; + + ideep::tag o_tag = is_channels_last ? ideep::tag::nhwc : ideep::tag::nchw; + ideep::tensor::desc o_desc = { + output_sizes, get_mkldnn_dtype(input.scalar_type()), o_tag}; + ideep::tensor mkldnn_output = {o_desc, output}; + + if (is_channels_last) { + mkldnn_convolution_out( + input, + mkldnn_output, + context.weight_packed_, + context.at_bias_, + context.padding_, + context.stride_, + context.dilation_, + output_sizes, + context.groups_, + context.attr_); + } else { + mkldnn_convolution_out( + input, + y, + context.weight_packed_, + context.at_bias_, + context.padding_, + context.stride_, + context.dilation_, + output_sizes, + context.groups_, + context.attr_); + mkldnn_output.feed_from(y); + } +} + +Tensor conv_run( + const Tensor& input, + const c10::intrusive_ptr& op_context) { + return op_context->run(input); +} + +} // namespace convolution +} // namespace internal +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.h b/aten/src/ATen/native/mkldnn/ConvPrepack.h new file mode 100644 index 0000000000000..03189c5f5e706 --- /dev/null +++ b/aten/src/ATen/native/mkldnn/ConvPrepack.h @@ -0,0 +1,49 @@ +#pragma once + +#include +#include +#include + +#if AT_MKLDNN_ENABLED() + +namespace at { +namespace native { +namespace mkldnn { +namespace internal { +namespace convolution { + +c10::intrusive_ptr createConvPrePackOpContext( + Tensor weight, + c10::optional bias, + std::vector stride, + std::vector padding, + std::vector dilation, + int64_t groups, + std::vector input_size, + std::string attr); + +Tensor conv_run( + const Tensor& input, + const c10::intrusive_ptr& op_context); + +ContextConv create( + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef padding, + const IntArrayRef stride, + const IntArrayRef dilation, + const int64_t groups, + const IntArrayRef input_size, + const ideep::attr_t& attr); + +Tensor run(ContextConv& context, const Tensor& input); + +void run(ContextConv& context, const Tensor& input, void* output); + +} // namespace convolution +} // namespace internal +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mkldnn/Matmul.cpp b/aten/src/ATen/native/mkldnn/Matmul.cpp index e399e2143dea6..9b07dbfcee5fb 100644 --- a/aten/src/ATen/native/mkldnn/Matmul.cpp +++ b/aten/src/ATen/native/mkldnn/Matmul.cpp @@ -71,7 +71,9 @@ bool mkldnn_bf16_gemm( op_attr = ideep::attr_t::fuse_sum(); } - ideep::tensor::dims a_strides{{1, lda}}, b_strides{{1, ldb}}, c_strides{{1, ldc}}; + // NOTE: View as c-contiguous to avoid extra reordering in mkldnn + // Use identity: C = AB <=> C^T = B^T A^T + ideep::tensor::dims a_strides{{lda, 1}}, b_strides{{ldb, 1}}, c_strides{{ldc, 1}}; if (transa != TransposeType::NoTranspose) { std::swap(a_strides[0], a_strides[1]); } @@ -80,23 +82,23 @@ bool mkldnn_bf16_gemm( } ideep::tensor a({ - /*sizes=*/{m, k}, + /*sizes=*/{k, m}, ideep::tensor::data_type::bf16, /*strides=*/a_strides}, const_cast(a_data)); ideep::tensor b({ - /*sizes=*/{k, n}, + /*sizes=*/{n, k}, ideep::tensor::data_type::bf16, /*strides=*/b_strides}, const_cast(b_data)); ideep::tensor c({ - /*sizes=*/{m, n}, + /*sizes=*/{n, m}, ideep::tensor::data_type::bf16, /*strides=*/c_strides}, c_data); ideep::matmul_forward::compute( - a, b, c, alpha, beta, + b, a, c, alpha, beta, ideep::scale_t(), ideep::scale_t(), ideep::scale_t(), op_attr); if (c.get_data_handle() != c_data){ @@ -104,7 +106,7 @@ bool mkldnn_bf16_gemm( // if given output format is not expected, ideep will re-init an output buffer // under this case, we need copy the re-inited buffer back to given buffer ideep::tensor real_output({ - /*sizes=*/{m, n}, + /*sizes=*/{n, m}, ideep::tensor::data_type::bf16, /*strides=*/c_strides}, c_data); diff --git a/aten/src/ATen/native/mkldnn/OpContext.cpp b/aten/src/ATen/native/mkldnn/OpContext.cpp new file mode 100644 index 0000000000000..2716b4908eb30 --- /dev/null +++ b/aten/src/ATen/native/mkldnn/OpContext.cpp @@ -0,0 +1,47 @@ +#include +#include + +#if AT_MKLDNN_ENABLED() + +namespace at { +namespace native { +namespace mkldnn { + +c10::intrusive_ptr MkldnnConvOpContext::create_context( + at::Tensor&& weight, + c10::optional&& bias, + std::vector&& padding, + std::vector&& stride, + std::vector&& dilation, + int64_t groups, + std::vector&& input_size, + const ideep::attr_t& attr) { + auto op_context = mkldnn::internal::convolution::create( + weight, bias, padding, stride, dilation, groups, input_size, attr); + + auto conv_op_context = c10::make_intrusive( + std::move(weight), + std::move(bias), + std::move(padding), + std::move(stride), + std::move(dilation), + groups, + std::move(input_size), + std::move(op_context)); + + return conv_op_context; +} + +Tensor MkldnnConvOpContext::run(const Tensor& input) { + return mkldnn::internal::convolution::run(op_context_, input); +} + +void MkldnnConvOpContext::run(const Tensor& input, void* output) { + return mkldnn::internal::convolution::run(op_context_, input, output); +} + +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mkldnn/OpContext.h b/aten/src/ATen/native/mkldnn/OpContext.h new file mode 100644 index 0000000000000..21e8cc78a5134 --- /dev/null +++ b/aten/src/ATen/native/mkldnn/OpContext.h @@ -0,0 +1,99 @@ +#pragma once + +#include +#include +#include + +#if AT_MKLDNN_ENABLED() + +namespace at { +namespace native { +namespace mkldnn { + +const static std::map fusion_attr_map = { + {"none", ideep::attr_t()}, + {"relu", ideep::attr_t::fuse_relu()}, +}; + +using SerializationTypeConvPrePack = std::tuple< + Tensor, + c10::optional, + std::vector, + std::vector, + std::vector, + int64_t, + std::vector, + std::string>; + +class ConvOpContext : public torch::jit::CustomClassHolder { + protected: + Tensor orig_weight_; + c10::optional orig_bias_; + std::vector stride_; + std::vector padding_; + std::vector dilation_; + int64_t groups_; + std::vector input_size_; + std::string attr_; + + public: + SerializationTypeConvPrePack unpack() { + return std::make_tuple( + orig_weight_, + orig_bias_, + stride_, + padding_, + dilation_, + groups_, + input_size_, + attr_); + } + + virtual Tensor run(const Tensor& input) = 0; + virtual void run(const Tensor& input, void* output) = 0; +}; + +class MkldnnConvOpContext final : public ConvOpContext { + private: + ContextConv op_context_; + + public: + MkldnnConvOpContext( + Tensor&& weight, + c10::optional&& bias, + std::vector&& padding, + std::vector&& stride, + std::vector&& dilation, + uint64_t groups, + std::vector&& input_size, + ContextConv&& op_context) + : op_context_(std::move(op_context)) { + orig_weight_ = std::move(weight); + orig_bias_ = std::move(bias); + padding_ = std::move(padding); + stride_ = std::move(stride); + dilation_ = std::move(dilation); + groups_ = groups; + input_size_ = std::move(input_size); + } + + Tensor run(const Tensor& input) override; + + void run(const Tensor& input, void* output) override; + + static c10::intrusive_ptr create_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& padding, + std::vector&& stride, + std::vector&& dilation, + int64_t groups, + std::vector&& input_size, + const ideep::attr_t& attr); +}; + +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mkldnn/Pooling.cpp b/aten/src/ATen/native/mkldnn/Pooling.cpp index 5800bd2247b6e..80cfa2efcc107 100644 --- a/aten/src/ATen/native/mkldnn/Pooling.cpp +++ b/aten/src/ATen/native/mkldnn/Pooling.cpp @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -80,6 +81,12 @@ Tensor mkldnn_adaptive_avg_pool2d(Tensor const& input, IntArrayRef output_size) TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d: ATen not compiled with MKLDNN support"); } +Tensor& mkldnn_adaptive_avg_pool2d_out_stub(const Tensor& input, + IntArrayRef output_size, + Tensor& output) { + TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out_stub: ATen not compiled with MKLDNN support"); +} + Tensor& mkldnn_adaptive_avg_pool2d_out(const Tensor& input, IntArrayRef output_size, Tensor& output) { @@ -498,10 +505,19 @@ Tensor mkldnn_adaptive_avg_pool2d( /*algo*/ ideep::algorithm::pooling_avg); } +Tensor& mkldnn_adaptive_avg_pool2d_out_stub(const Tensor& input, + IntArrayRef output_size, + Tensor& output) { + TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out_stub: in-place mkldnn operations are not supported yet"); +} + Tensor& mkldnn_adaptive_avg_pool2d_out(const Tensor& input, IntArrayRef output_size, Tensor& output) { - TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out: in-place mkldnn operations are not supported yet"); + auto tmp_output = at::native::mkldnn_adaptive_avg_pool2d(input, output_size); + at::native::resize_output(output, tmp_output.sizes()); + output.copy_(tmp_output); + return output; } Tensor mkldnn_max_pool2d_backward( diff --git a/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp new file mode 100644 index 0000000000000..44447441f6daa --- /dev/null +++ b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp @@ -0,0 +1,60 @@ +#include +#include +#include +#include +#include + +#if AT_MKLDNN_ENABLED() + +namespace at { +namespace native { +namespace mkldnn { + +using namespace internal::convolution; + +TORCH_LIBRARY(mkldnn, m) { + m.class_(TORCH_SELECTIVE_CLASS("ConvOpContext")) + .def_pickle( + [](const c10::intrusive_ptr& op_context) + -> SerializationTypeConvPrePack { // __getstate__ + return op_context->unpack(); + }, + [](SerializationTypeConvPrePack state) + -> c10::intrusive_ptr { // __setstate__ + return createConvPrePackOpContext( + std::move(std::get<0>(state)), + std::move(std::get<1>(state)), + std::move(std::get<2>(state)), + std::move(std::get<3>(state)), + std::move(std::get<4>(state)), + // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers) + std::move(std::get<5>(state)), + // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers) + std::move(std::get<6>(state)), + // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers) + std::move(std::get<7>(state))); + }); +} + +TORCH_LIBRARY(mkldnn_prepacked, m) { + m.def(TORCH_SELECTIVE_SCHEMA( + "mkldnn_prepacked::conv2d_prepack(Tensor W, Tensor? B, int[2] stride, int[2] padding, int[2] dilation, int groups, int[4] input_size, str attr) -> __torch__.torch.classes.mkldnn.ConvOpContext")); + + m.def(TORCH_SELECTIVE_SCHEMA( + "mkldnn_prepacked::conv2d_run(Tensor X, __torch__.torch.classes.mkldnn.ConvOpContext W_prepack) -> Tensor Y")); +} + +TORCH_LIBRARY_IMPL(mkldnn_prepacked, CPU, m) { + m.impl( + TORCH_SELECTIVE_NAME("mkldnn_prepacked::conv2d_prepack"), + TORCH_FN(createConvPrePackOpContext)); + + m.impl( + TORCH_SELECTIVE_NAME("mkldnn_prepacked::conv2d_run"), TORCH_FN(conv_run)); +} + +} // namespace mkldnn +} // namespace native +} // namespace at + +#endif // AT_MKLDNN_ENABLED() diff --git a/aten/src/ATen/native/mps/OperationUtils.mm b/aten/src/ATen/native/mps/OperationUtils.mm index 7e22e2c103a8f..97012dd3b0c9e 100644 --- a/aten/src/ATen/native/mps/OperationUtils.mm +++ b/aten/src/ATen/native/mps/OperationUtils.mm @@ -231,7 +231,12 @@ void printTensorNDArray(const Tensor& t) { MPSGraphTensorData* tdata = [[[MPSGraphTensorData alloc] initWithMTLBuffer:selfBuf shape:selfShape dataType:selfDType] autorelease]; + C10_CLANG_DIAGNOSTIC_PUSH() + #if C10_CLANG_HAS_WARNING("-Wobjc-method-access") + C10_CLANG_DIAGNOSTIC_IGNORE("-Wobjc-method-access") + #endif [tdata printNDArray]; + C10_CLANG_DIAGNOSTIC_POP() } Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor, const Tensor& src, MPSShape *mpsShape) : _tensor(src) @@ -247,7 +252,7 @@ void printTensorNDArray(const Tensor& t) { if (!_tensor.has_storage()) { // if we cannot gather, we make the the tensor contiguous implicitly, and keep // it in placeholder to be able to retrieve it when we return from constructor - _tensor = src.contiguous(); + _tensor = src.clone(); } srcBuf = getMTLBufferStorage(_tensor); } diff --git a/aten/src/ATen/native/mps/operations/Activation.mm b/aten/src/ATen/native/mps/operations/Activation.mm index b741276b45e01..e929a41be2ce1 100644 --- a/aten/src/ATen/native/mps/operations/Activation.mm +++ b/aten/src/ATen/native/mps/operations/Activation.mm @@ -417,6 +417,10 @@ Tensor relu_mps(const Tensor& self) { using CachedGraph = MPSUnaryCachedGraph; TORCH_CHECK(output.is_mps()); + if(output.numel() == 0) { + return; + } + MPSGraphCache* cache_ = MPSGraphCache::getInstance(); MPSStream* stream = getCurrentMPSStream(); @@ -1452,10 +1456,13 @@ Tensor glu_backward_mps (const Tensor& grad_output, if(result.numel() == 0) return; + auto beta_f = beta.to(); + struct CachedGraph : public MPSCachedGraph { CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {} MPSGraphTensor *inputTensor_ = nil; + MPSGraphTensor *betaTensor_ = nil; MPSGraphTensor *outputTensor_ = nil; }; @@ -1475,18 +1482,16 @@ Tensor glu_backward_mps (const Tensor& grad_output, @autoreleasepool { MPSGraph* mpsGraph = make_mps_graph(); newCachedGraph = new CachedGraph(mpsGraph); - MPSGraphTensor *inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self); + MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self); - MPSGraphTensor *reluTensor = [mpsGraph reLUWithTensor:inputTensor - name:nil]; + MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta); + MPSGraphTensor* reluTensor = [mpsGraph reLUWithTensor:inputTensor + name:nil]; MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0 shape:@[@1] dataType:getMPSDataType(self.scalar_type())]; - MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta.to() - shape:@[@1] - dataType:getMPSDataType(self.scalar_type())]; MPSGraphTensor* reciprocalBetaTensor = [mpsGraph reciprocalWithTensor:betaTensor name:nil]; MPSGraphTensor* bxTensor = [mpsGraph multiplicationWithPrimaryTensor:inputTensor @@ -1516,7 +1521,8 @@ Tensor glu_backward_mps (const Tensor& grad_output, name:nil]; newCachedGraph->inputTensor_ = inputTensor; - newCachedGraph->outputTensor_ = outputTensor; + newCachedGraph->betaTensor_ = betaTensor; + newCachedGraph->outputTensor_ = outputTensor; } return newCachedGraph; }); @@ -1527,7 +1533,8 @@ Tensor glu_backward_mps (const Tensor& grad_output, // Create dictionary of inputs and outputs NSDictionary* feeds = @{ - selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData() + selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), + cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32) }; NSDictionary* results = @{ outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData() @@ -1550,11 +1557,14 @@ Tensor glu_backward_mps (const Tensor& grad_output, if(grad_input.numel() == 0) return; + auto beta_f = beta.to(); + struct CachedGraph : public MPSCachedGraph { CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {} MPSGraphTensor *gradOutputTensor_ = nil; MPSGraphTensor *inputTensor_ = nil; + MPSGraphTensor *betaTensor_ = nil; MPSGraphTensor *outputTensor_ = nil; }; @@ -1574,16 +1584,15 @@ Tensor glu_backward_mps (const Tensor& grad_output, @autoreleasepool { MPSGraph* mpsGraph = make_mps_graph(); newCachedGraph = new CachedGraph(mpsGraph); - MPSGraphTensor *gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output); + MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output); - MPSGraphTensor *inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self); + MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self); + + MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta); MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0 shape:@[@1] dataType:getMPSDataType(self.scalar_type())]; - MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta.to() - shape:@[@1] - dataType:getMPSDataType(self.scalar_type())]; MPSGraphTensor* bxTensor = [mpsGraph multiplicationWithPrimaryTensor:inputTensor secondaryTensor:betaTensor name:nil]; @@ -1611,6 +1620,7 @@ Tensor glu_backward_mps (const Tensor& grad_output, newCachedGraph->gradOutputTensor_ = gradOutputTensor; newCachedGraph->inputTensor_ = inputTensor; + newCachedGraph->betaTensor_ = betaTensor; newCachedGraph->outputTensor_ = outputTensor; } return newCachedGraph; @@ -1624,7 +1634,8 @@ Tensor glu_backward_mps (const Tensor& grad_output, // Create dictionary of inputs and outputs NSDictionary* feeds = @{ gradOutputPlaceholder.getMPSGraphTensor() : gradOutputPlaceholder.getMPSGraphTensorData(), - selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData() + selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), + cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32) }; NSDictionary* results = @{ gradInputPlaceholder.getMPSGraphTensor() : gradInputPlaceholder.getMPSGraphTensorData() diff --git a/aten/src/ATen/native/mps/operations/BinaryOps.mm b/aten/src/ATen/native/mps/operations/BinaryOps.mm index 6e325de38c830..b619307ef8aa1 100644 --- a/aten/src/ATen/native/mps/operations/BinaryOps.mm +++ b/aten/src/ATen/native/mps/operations/BinaryOps.mm @@ -403,5 +403,6 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp runMPSGraph(stream, cachedGraph->graph(), feeds, results); } } + } // namespace native } // namespace at diff --git a/aten/src/ATen/native/mps/operations/BitwiseOps.mm b/aten/src/ATen/native/mps/operations/BitwiseOps.mm new file mode 100644 index 0000000000000..c16818d7d542c --- /dev/null +++ b/aten/src/ATen/native/mps/operations/BitwiseOps.mm @@ -0,0 +1,336 @@ +#include +#include +#include +#include + +namespace { +static const char* BITWISE_OPS_TEMPLATE = R"METAL( + +kernel void bitwise_and_tensor(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + device {2} *b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] & b [offset]; +}} + +kernel void bitwise_and_scalar(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + constant {2} &b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] & b; +}} + + +kernel void bitwise_or_tensor(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + device {2} *b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] | b [offset]; +}} + +kernel void bitwise_or_scalar(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + constant {2} &b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] | b; +}} + +kernel void bitwise_xor_tensor(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + device {2} *b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] ^ b [offset]; +}} + +kernel void bitwise_xor_scalar(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + constant {2} &b [[buffer(3)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = a[offset] ^ b; +}} + +kernel void bitwise_not(constant uint& length [[buffer(0)]], + device {0} *out [[buffer(1)]], + device {1} *a [[buffer(2)]], + uint offset [[thread_position_in_grid]]) {{ + if (offset >= length) {{ + return; + }} + out[offset] = ~a[offset]; +}} +)METAL"; + + +const std::string& getMetalType(const c10::ScalarType& t) { + // Mapping from c10::ScalarType to integral type that can be used for bitwise ops + // As bitwise ops sign-agnostic map signed/unsigned char and boolean to the same type + static std::unordered_map scalar_to_metal_type = { + {c10::ScalarType::Long, "long"}, + {c10::ScalarType::Int, "int"}, + {c10::ScalarType::Short, "short"}, + {c10::ScalarType::Byte, "char"}, + {c10::ScalarType::Char, "char"}, + {c10::ScalarType::Bool, "char"}, + }; + + auto it = scalar_to_metal_type.find(t); + TORCH_CHECK(it != scalar_to_metal_type.end(), "Unsupported type ", t); + return it->second; +} + +const std::string& getMetalType(const at::Tensor& t) { + return getMetalType(t.scalar_type()); +} + +const std::string& getMetalType(const c10::Scalar& s) { + return getMetalType(s.type()); +} + + +static id compileBitwiseOpsLibrary(id device, + const std::string& t1, + const std::string& t2, + const std::string& t3) { + auto key = t1 + t2 + t3; + static std::unordered_map> libMap; + auto it = libMap.find(key); + if (it != libMap.end()) { + return it->second; + } + NSError *error = nil; + auto rc = [device newLibraryWithSource:[NSString stringWithUTF8String:fmt::format(BITWISE_OPS_TEMPLATE, t1, t2, t3).c_str()] + options:nil + error:&error]; + TORCH_CHECK(rc != nil && error == nil, "Failed to compile library: ", [[error localizedDescription] UTF8String]); + libMap[key] = rc; + return rc; +} + + +static id getCPLState(id device, + const std::string& t1, + const std::string& t2, + const std::string& t3, + const std::string& fname) { + auto key = t1 + t2 + t3 + fname; + static std::unordered_map> cplMap; + auto it = cplMap.find(key); + if (it != cplMap.end()) { + return it->second; + } + NSError *error = nil; + auto library = compileBitwiseOpsLibrary(device, t1, t2, t3); + id func = [library newFunctionWithName:[NSString stringWithUTF8String:fname.c_str()]]; + TORCH_CHECK(func != nil, "Can't get function ", fname); + auto rc = [device newComputePipelineStateWithFunction:func error:&error]; + TORCH_CHECK(rc != nil && error == nil, "Failed to construct pipeline state: ", [[error localizedDescription] UTF8String]); + cplMap[key] = rc; + return rc; +} + +void dispatch1DJob(id commandEncoder, id cplState, uint32_t length) +{ + uint32_t maxThreadsPerGroup = [cplState maxTotalThreadsPerThreadgroup]; + auto size = MTLSizeMake(length, 1, 1); + auto threadGroupSize = MTLSizeMake(std::min(maxThreadsPerGroup, length), 1, 1); + [commandEncoder dispatchThreads:size + threadsPerThreadgroup:threadGroupSize]; +} + +void handle_tensor_tensor_binary_op(const at::Tensor& self, const at::Tensor& other, at::Tensor& output, const std::string& kernel_name) { + using namespace at::mps; + MPSStream* stream = getCurrentMPSStream(); + id cplState = getCPLState(MPSDevice::getInstance()->device(), + getMetalType(output), + getMetalType(self), + getMetalType(other), + kernel_name); + uint32_t length = output.numel(); + dispatch_sync(stream->queue(), ^(){ + id buffer = stream->commandBuffer(); + id commandEncoder = [buffer computeCommandEncoder]; + + id outBuf = __builtin_bit_cast(id, output.storage().data()); + id selfBuf = __builtin_bit_cast(id, self.storage().data()); + id otherBuf = __builtin_bit_cast(id, other.storage().data()); + + [commandEncoder pushDebugGroup:[NSString stringWithFormat:@"Dispatch %s kernel", kernel_name.c_str()]]; + [commandEncoder setComputePipelineState:cplState]; + [commandEncoder setBytes:&length length:sizeof(length) atIndex:0]; + [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1]; + [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize() atIndex:2]; + [commandEncoder setBuffer:otherBuf offset:other.storage_offset()*other.itemsize() atIndex:3]; + dispatch1DJob(commandEncoder, cplState, length); + [commandEncoder endEncoding]; + stream->commit(true); + }); +} + +void handle_tensor_scalar_binary_op(const at::Tensor& self, const at::Scalar& other, at::Tensor& output, const std::string& kernel_name) { + using namespace at::mps; + MPSStream* stream = getCurrentMPSStream(); + id cplState = getCPLState(MPSDevice::getInstance()->device(), + getMetalType(output), + getMetalType(self), + getMetalType(other), + kernel_name); + uint64_t sval = other.to(); + uint32_t length = output.numel(); + dispatch_sync(stream->queue(), ^(){ + id buffer = stream->commandBuffer(); + id commandEncoder = [buffer computeCommandEncoder]; + + id outBuf = __builtin_bit_cast(id, output.storage().data()); + id selfBuf = __builtin_bit_cast(id, self.storage().data()); + + [commandEncoder pushDebugGroup:[NSString stringWithFormat:@"Dispatch %s kernel", kernel_name.c_str()]]; + [commandEncoder setComputePipelineState:cplState]; + [commandEncoder setBytes:&length length:sizeof(length) atIndex:0]; + [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1]; + [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize() atIndex:2]; + [commandEncoder setBytes:&sval length:sizeof(sval) atIndex:3]; + dispatch1DJob(commandEncoder, cplState, length); + [commandEncoder endEncoding]; + stream->commit(true); + }); +} + +at::Tensor& _bitwise_op_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output_, const std::string& op_name) { + using namespace at::mps; + const bool is_self_scalar = self.dim() == 0; + const bool is_other_scalar = other.dim() == 0; + + at::Tensor output = output_; + bool needs_output_copy = false; + + auto output_size = at::infer_size_dimvector(self.sizes(), other.sizes()); + at::native::resize_output(output, output_size); + if (!output.is_contiguous()) { + output = output.contiguous(); + needs_output_copy = true; + } + if (is_other_scalar && is_self_scalar) { + if (op_name == "and") { + output.fill_(c10::Scalar(self.item() & other.item())); + } else if (op_name == "or") { + output.fill_(c10::Scalar(self.item() | other.item())); + } else if (op_name == "xor") { + output.fill_(c10::Scalar(self.item() ^ other.item())); + } else { + TORCH_CHECK(false, "Unknown operation to be performed over scalars ", op_name); + } + } else if (is_other_scalar) { + handle_tensor_scalar_binary_op(self.contiguous(), other.item(), output, fmt::format("bitwise_{}_scalar", op_name)); + } else if (is_self_scalar) { + handle_tensor_scalar_binary_op(other.contiguous(), self.item(), output, fmt::format("bitwise_{}_scalar", op_name)); + } else { + handle_tensor_tensor_binary_op(self.expand(output_size).contiguous(), + other.expand(output_size).contiguous(), + output, + fmt::format("bitwise_{}_tensor", op_name)); + } + if (needs_output_copy) { + output_.copy_(output); + } + return output_; +} + +at::Tensor& bitwise_and_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) { + return _bitwise_op_out_mps(self, other, output, "and"); +} + +at::Tensor& bitwise_or_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) { + return _bitwise_op_out_mps(self, other, output, "or"); +} + +at::Tensor& bitwise_xor_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) { + return _bitwise_op_out_mps(self, other, output, "xor"); +} + +at::Tensor& bitwise_not_out_mps (const at::Tensor& self, at::Tensor& output_) { + // Handle boolean tensor using logical not + if (self.scalar_type() == c10::ScalarType::Bool) { + return at::native::logical_not_out_mps(self, output_); + } + + at::Tensor output = output_; + bool needs_output_copy = false; + + at::native::resize_output(output, self.sizes()); + if (!output.is_contiguous()) { + output = output.contiguous(); + needs_output_copy = true; + } + if (self.dim() == 0) { + if (self.scalar_type() == c10::ScalarType::Byte) { + // Unsigned types need a special handling to keep result of operation in 0..255 output + output.fill_(c10::Scalar(static_cast(~self.item()))); + } else { + output.fill_(c10::Scalar(~self.item())); + } + return output_; + } + using namespace at::mps; + MPSStream* stream = getCurrentMPSStream(); + id cplState = getCPLState(MPSDevice::getInstance()->device(), + getMetalType(output), + getMetalType(self), + getMetalType(self), + "bitwise_not"); + uint32_t length = output.numel(); + dispatch_sync(stream->queue(), ^(){ + id buffer = stream->commandBuffer(); + id commandEncoder = [buffer computeCommandEncoder]; + + id outBuf = __builtin_bit_cast(id, output.storage().data()); + id selfBuf = __builtin_bit_cast(id, self.storage().data()); + + [commandEncoder pushDebugGroup:@"Dispatch bitwise_not kernel"]; + [commandEncoder setComputePipelineState:cplState]; + [commandEncoder setBytes:&length length:sizeof(length) atIndex:0]; + [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1]; + [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize() atIndex:2]; + dispatch1DJob(commandEncoder, cplState, length); + [commandEncoder endEncoding]; + stream->commit(true); + }); + if (needs_output_copy) { + output_.copy_(output); + } + return output_; +} + + + +TORCH_LIBRARY_IMPL(aten, MPS, m) { + m.impl("bitwise_and.Tensor_out", bitwise_and_out_mps); + m.impl("bitwise_or.Tensor_out", bitwise_or_out_mps); + m.impl("bitwise_xor.Tensor_out", bitwise_xor_out_mps); + m.impl("bitwise_not.out", bitwise_not_out_mps); +} + +} // anonymous namespace diff --git a/aten/src/ATen/native/mps/operations/ConstantOps.mm b/aten/src/ATen/native/mps/operations/ConstantOps.mm index 0cfd7ccc2ff5b..a5ddd82a229eb 100644 --- a/aten/src/ATen/native/mps/operations/ConstantOps.mm +++ b/aten/src/ATen/native/mps/operations/ConstantOps.mm @@ -35,11 +35,15 @@ MPSGraph *mpsGraph = make_mps_graph(); newCachedGraph = new CachedGraph(mpsGraph); auto isBool = self.scalar_type() == c10::ScalarType::Bool; - auto dataType = (!isBool) ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8; + auto isUInt8 = self.scalar_type() == c10::ScalarType::Byte; + auto dataType = !isUInt8 ? !isBool ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8 : MPSDataTypeUInt32; // constantWithScalar does not work for boolTypes on MacOS-12.[34] // workaround by filing it as int8 tensor and than casting to bool // See https://github.com/pytorch/pytorch/issues/82427 - MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar:value.toDouble() + // constantWithScalar does not work for UInt8 Types on MacOS-12.[34]/Ventura preview + // workaround by filing it as uint32 tensor and than casting to uint8 + // See https://github.com/pytorch/pytorch/issues/83692 + MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar: value.toDouble() shape:getMPSShape(self) dataType:dataType]; MPSGraphTensor* outputTensor = [mpsGraph identityWithTensor:inputTensor @@ -49,6 +53,11 @@ toType:MPSDataTypeBool name:@"constWithBool-workaround"]; } + if (isUInt8) { + outputTensor = [mpsGraph castTensor:outputTensor + toType:MPSDataTypeUInt8 + name:@"constWithUInt8-workaround"]; + } newCachedGraph->outputTensor_ = outputTensor; } diff --git a/aten/src/ATen/native/mps/operations/Convolution.mm b/aten/src/ATen/native/mps/operations/Convolution.mm index 0fe690698c3b8..41d68d4d24c2e 100644 --- a/aten/src/ATen/native/mps/operations/Convolution.mm +++ b/aten/src/ATen/native/mps/operations/Convolution.mm @@ -33,8 +33,9 @@ void fill_conv_desc(MPSGraphConvolution2DOpDescriptor* descriptor_, descriptor_.dataLayout = (memory_format == at::MemoryFormat::Contiguous) ? MPSGraphTensorNamedDataLayoutNCHW : MPSGraphTensorNamedDataLayoutNHWC; - descriptor_.weightsLayout = (memory_format == at::MemoryFormat::Contiguous) ? - MPSGraphTensorNamedDataLayoutOIHW : MPSGraphTensorNamedDataLayoutHWIO; + + // PyTorch always uses OIHW memory layout for weights + descriptor_.weightsLayout = MPSGraphTensorNamedDataLayoutOIHW; descriptor_.groups = groups; } @@ -46,6 +47,8 @@ Tensor _mps_convolution( IntArrayRef stride, IntArrayRef dilation, int64_t groups) { + TORCH_CHECK(input_t.dim() < 5, "Conv3D is not supported on MPS"); + namespace native_mps = at::native::mps; CheckedFrom c = "mps_convolution"; TensorArg input { input_t, "input", 1 }, @@ -61,6 +64,7 @@ Tensor _mps_convolution( bias_defined = bias_opt->defined(); auto memory_format = input_t.suggest_memory_format(); + bool is_channels_last = (memory_format == at::MemoryFormat::ChannelsLast); auto output_t = at::empty( conv_output_size(input->sizes(), weight->sizes(), padding, stride, dilation), @@ -68,7 +72,7 @@ Tensor _mps_convolution( c10::nullopt, kMPS, c10::nullopt, - memory_format); + c10::nullopt); if (output_t.numel() == 0) { return output_t; @@ -122,6 +126,18 @@ Tensor _mps_convolution( + mps::getTensorsStringKey({input_t, weight_t}) + ":" + to_string(bias_defined) + ":" + bias_shape_key; CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); + MPSShape* inputShape = nil; + + if (is_channels_last) { + const auto inputSizes = input_t.sizes(); + const NSUInteger N = inputSizes[0]; + const NSUInteger C = inputSizes[1]; + const NSUInteger H = inputSizes[2]; + const NSUInteger W = inputSizes[3]; + inputShape = @[@(N), @(H), @(W), @(C)]; + } else { + inputShape = native_mps::getMPSShape(input_t); + } if(!cachedGraph) { native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () { @@ -133,26 +149,34 @@ Tensor _mps_convolution( newCachedGraph = new CachedGraph(mpsGraph); MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease]; - fill_conv_desc(descriptor_, stride[0], stride[1], - dilation[0], dilation[1], + fill_conv_desc(descriptor_, stride[1], stride[0], + dilation[1], dilation[0], padding[1], padding[0], memory_format, groups); - MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, input_t); + MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, native_mps::getMPSScalarType(input_t.scalar_type()), inputShape); MPSGraphTensor* weightTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, weight_t); + MPSGraphTensor* biasTensor = nil; if(bias_defined) biasTensor = native_mps::mpsGraphUnrankedPlaceHolder(mpsGraph, native_mps::getMPSDataType((bias_opt.value()).scalar_type())); - MPSGraphTensor* outputTensor = [mpsGraph convolution2DWithSourceTensor:inputTensor - weightsTensor:weightTensor - descriptor:descriptor_ - name:nil]; + MPSGraphTensor* outputTensor = [mpsGraph convolution2DWithSourceTensor: inputTensor + weightsTensor: weightTensor + descriptor: descriptor_ + name: nil]; + if (is_channels_last) { + // NHWC -> NCHW + outputTensor = [mpsGraph transposeTensor: [mpsGraph transposeTensor:outputTensor dimension:-1 withDimension:-2 name:nil] + dimension: -2 + withDimension: -3 + name: nil]; + } if(bias_defined) { - outputTensor = [mpsGraph additionWithPrimaryTensor:outputTensor - secondaryTensor:biasTensor - name:nil]; + outputTensor = [mpsGraph additionWithPrimaryTensor: outputTensor + secondaryTensor: biasTensor + name: nil]; } newCachedGraph->inputTensor_ = inputTensor; @@ -165,7 +189,7 @@ Tensor _mps_convolution( cachedGraph = static_cast(tmpCachedGraph); } - auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t); + auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t, inputShape); auto weightsPlaceholder = native_mps::Placeholder(cachedGraph->weightTensor_, weight_t); auto biasPlaceholder = native_mps::Placeholder(); // Reshape the bias to be broadcastable with output of conv2d @@ -207,7 +231,7 @@ Tensor mps_convolution_backward_input( c10::nullopt, kMPS, c10::nullopt, - memory_format); + c10::nullopt); // Avoid "grad_input" when this is being used as transposed convolution TensorArg grad_input{ grad_input_t, "result", 0 }; @@ -242,9 +266,7 @@ Tensor mps_convolution_backward_input( } MPSShape* mps_input_shape = getMPSShape(input_size); - NSString* ns_shape_key = [[mps_input_shape valueForKey:@"description"] componentsJoinedByString:@","]; - string key = "mps_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" @@ -263,8 +285,8 @@ Tensor mps_convolution_backward_input( newCachedGraph = new CachedGraph(mpsGraph); MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease]; - fill_conv_desc(descriptor_, stride[0], stride[1], - dilation[0], dilation[1], + fill_conv_desc(descriptor_, stride[1], stride[0], + dilation[1], dilation[0], padding[1], padding[0], memory_format, groups); @@ -320,7 +342,7 @@ Tensor mps_convolution_backward_weights( checkAllSameType(c, {grad_output, input}); checkAllSameGPU(c, {grad_output, input}); - auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), memory_format); + auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), c10::nullopt); TensorArg grad_weight{ grad_weight_t, "result", 0 }; convolution_shape_check(c, input, grad_weight, grad_output, padding, stride, dilation, groups); @@ -353,9 +375,7 @@ Tensor mps_convolution_backward_weights( } MPSShape* mps_weight_shape = getMPSShape(weight_size); - NSString* ns_shape_key = [[mps_weight_shape valueForKey:@"description"] componentsJoinedByString:@","]; - string key = "mps_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" @@ -374,8 +394,8 @@ Tensor mps_convolution_backward_weights( newCachedGraph = new CachedGraph(mpsGraph); MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease]; - fill_conv_desc(descriptor_, stride[0], stride[1], - dilation[0], dilation[1], + fill_conv_desc(descriptor_, stride[1], stride[0], + dilation[1], dilation[0], padding[1], padding[0], memory_format, groups); diff --git a/aten/src/ATen/native/mps/operations/Distributions.mm b/aten/src/ATen/native/mps/operations/Distributions.mm index a4b73bd75fb03..999b1cc79d5b2 100644 --- a/aten/src/ATen/native/mps/operations/Distributions.mm +++ b/aten/src/ATen/native/mps/operations/Distributions.mm @@ -1,16 +1,8 @@ // Copyright © 2022 Apple Inc. -#include -#include -#include -#include -#include #include #include -#include -#include #include -#include namespace at { namespace native { @@ -24,7 +16,6 @@ } double delta = (to - from); AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "check_uniform_bounds", [&] { - const auto dtype = input.dtype(); const auto min = static_cast(std::numeric_limits::lowest()); const auto max = static_cast(std::numeric_limits::max()); TORCH_CHECK(from <= to, "uniform_ expects to return a [from, to) range, but found from=", from, " > to=", to); @@ -198,11 +189,6 @@ Tensor normal_mps(const Tensor& mean, const Tensor& std, c10::optional gen, Tensor& output) { - TORCH_CHECK( - std.min().ge(0).item(), - "normal expects all elements of std >= 0.0"); - - Tensor mean_t = empty_mps( output.sizes(), output.scalar_type(), @@ -218,7 +204,6 @@ Tensor normal_mps(const Tensor& mean, const Tensor& std, c10::optional gen, Tensor& output) { TORCH_CHECK(!std.is_complex(), "normal expects standard deviation to be non-complex"); - TORCH_CHECK(std.numel() == 0 || std.min().ge(0).item(), "normal expects all elements of std >= 0.0"); // Check that mean and std have same number of elements TORCH_CHECK(mean.numel() == std.numel(), "normal_mps_out: mean and std must have same number of elements") @@ -528,8 +513,6 @@ static void check_from_to_in_range(int64_t from, int64_t to_inc, ScalarType scal MPSGraphTensor *outputTensor_ = nil; }; - MPSGraphCache* cache_ = MPSGraphCache::getInstance(); - MPSStream* stream = getCurrentMPSStream(); uint64_t seed_ = c10::detail::getNonDeterministicRandom(true); diff --git a/aten/src/ATen/native/mps/operations/Indexing.h b/aten/src/ATen/native/mps/operations/Indexing.h new file mode 100644 index 0000000000000..4227a0cf62c28 --- /dev/null +++ b/aten/src/ATen/native/mps/operations/Indexing.h @@ -0,0 +1,51 @@ +// Copyright © 2022 Apple Inc. + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace at::mps; + +namespace at { +namespace native { +namespace mps { + +std::string getMetalScalarType(ScalarType scalar_type) { + std::string res = ""; + switch (scalar_type) { + case ScalarType::Float: + res = "float"; break; + case ScalarType::Half: + res = "half"; break; + case ScalarType::Long: + res = "long"; break; + case ScalarType::Int: + res = "int"; break; + case ScalarType::Short: + res = "short"; break; + case ScalarType::Char: + res = "char"; break; + case ScalarType::Byte: + res = "uchar"; break; + case ScalarType::Bool: + res = "bool"; break; + default: + break; + } + return res; +} + +std::string getIndexFunctionName(ScalarType scalar_type, bool index_select, bool accumulate) { + std::string indexFunction = index_select ? "index_select_" : + (accumulate && (scalar_type != kBool)) ? "index_put_accumulate_" : "index_put_"; + + return indexFunction + getMetalScalarType(scalar_type); +} +} +} +} diff --git a/aten/src/ATen/native/mps/operations/Indexing.mm b/aten/src/ATen/native/mps/operations/Indexing.mm index 7c0d7544cf21a..c54f5b12d44af 100644 --- a/aten/src/ATen/native/mps/operations/Indexing.mm +++ b/aten/src/ATen/native/mps/operations/Indexing.mm @@ -1,5 +1,4 @@ // Copyright © 2022 Apple Inc. - #include #include #include @@ -12,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -20,6 +20,7 @@ #include #include #include +#include #ifdef __OBJC__ #include @@ -28,6 +29,137 @@ namespace at { namespace native { +static +bool dispatchIndexSelectKernel(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride) { + using namespace mps; + + if (iter.numel() == 0) + return true; + + const Tensor& inputTensor = iter.tensor(1); + Tensor outputTensor = iter.tensor(0); + id inputBuffer = getMTLBufferStorage(inputTensor); + id outputBuffer = getMTLBufferStorage(outputTensor); + MPSStream* mpsStream = getCurrentMPSStream(); + id device = MPSDevice::getInstance()->device(); + + dispatch_sync(mpsStream->queue(), ^(){ + @autoreleasepool { + NSError* error = nil; + constexpr uint32_t nOffsets = 3; + const int64_t num_indices = index_size.size(); + const uint32_t numThreads = iter.numel(); + const uint32_t nDim = iter.ndim(); + const IntArrayRef& iterShape = iter.shape(); + std::vector iterShapeData(iterShape.size()); + std::vector> strides(nDim); + + for (const auto i: c10::irange(iterShape.size())) { + TORCH_CHECK(i <= UINT32_MAX); + iterShapeData[i] = (uint32_t)(iterShape[i]); + } + + for (const auto i: c10::irange(nDim)) { + for (const auto offset: c10::irange(nOffsets)) { + strides[i][offset] = iter.strides(offset)[i]; + } + } + + MTLSize gridSize = MTLSizeMake(numThreads, 1, 1); + id commandBuffer = mpsStream->commandBuffer(); + id computeEncoder = [commandBuffer computeCommandEncoder]; + id kernelDataOffsetsFunction = MPSDevice::getInstance()->metalIndexingFunction("kernel_index_offsets", nil); + id kernelDataOffsetsPSO = [[device newComputePipelineStateWithFunction: kernelDataOffsetsFunction + error: &error] autorelease]; + id kernelDataOffsets = [[device newBufferWithLength: numThreads * sizeof(simd_uint3) + options: 0] autorelease]; + TORCH_CHECK(kernelDataOffsetsPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]); + + [computeEncoder setComputePipelineState:kernelDataOffsetsPSO]; + [computeEncoder setBytes:strides.data() length:sizeof(uint32_t) * nDim * nOffsets atIndex:0]; + [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:1]; + [computeEncoder setBytes:iterShapeData.data() length:sizeof(uint32_t) * iterShape.size() atIndex:2]; + [computeEncoder setBytes:&nDim length:sizeof(uint32_t) atIndex:3]; + [computeEncoder setBytes:&nOffsets length:sizeof(uint32_t) atIndex:4]; + + NSUInteger kernelOffsetsTGSize = kernelDataOffsetsPSO.maxTotalThreadsPerThreadgroup; + if (kernelOffsetsTGSize > numThreads) + kernelOffsetsTGSize = numThreads; + + MTLSize kernelOffsetsThreadGroupSize = MTLSizeMake(kernelOffsetsTGSize, 1, 1); + [computeEncoder dispatchThreads: gridSize + threadsPerThreadgroup: kernelOffsetsThreadGroupSize]; + + MTLFunctionConstantValues* constantValues = [[MTLFunctionConstantValues new] autorelease]; + [constantValues setConstantValue: &num_indices type:MTLDataTypeUInt atIndex:0]; + + std::string indexFunction = getIndexFunctionName(inputTensor.scalar_type(), true, false); + id indexKernelFunction = MPSDevice::getInstance()->metalIndexingFunction(indexFunction, constantValues); + id argumentEncoder = [[indexKernelFunction newArgumentEncoderWithBufferIndex:0] autorelease]; + NSUInteger argumentBufferLength = argumentEncoder.encodedLength; + id indexAB = [[device newBufferWithLength:argumentBufferLength options:0] autorelease]; + [argumentEncoder setArgumentBuffer:indexAB offset:0]; + + for (uint32_t idx = 0; idx < num_indices; idx++) { + const Tensor& indexTensor = iter.tensor(idx+2); + [argumentEncoder setBuffer: getMTLBufferStorage(indexTensor) + offset: indexTensor.storage_offset() * indexTensor.element_size() + atIndex: idx]; + TORCH_CHECK(indexTensor.scalar_type() == ScalarType::Long, "index(): Expected dtype int64 for Index"); + } + + // FIXME: PSO needs to be cached + id indexSelectPSO = [[device newComputePipelineStateWithFunction: indexKernelFunction + error: &error] autorelease]; + TORCH_CHECK(indexSelectPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]); + + for (uint32_t idx = 0; idx < num_indices; idx++) { + const Tensor& indexTensor = iter.tensor(idx+2); + [computeEncoder useResource:getMTLBufferStorage(indexTensor) usage:MTLResourceUsageRead]; + } + + [computeEncoder setComputePipelineState:indexSelectPSO]; + [computeEncoder setBuffer:indexAB offset:0 atIndex:0]; + [computeEncoder setBytes:index_size.data() length:sizeof(index_size[0]) * index_size.size() atIndex:1]; + [computeEncoder setBytes:index_stride.data() length:sizeof(index_stride[0]) * index_stride.size() atIndex:2]; + [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:3]; + [computeEncoder setBuffer:inputBuffer offset:inputTensor.storage_offset() * inputTensor.element_size() atIndex:4]; + [computeEncoder setBuffer:outputBuffer offset:outputTensor.storage_offset() * outputTensor.element_size() atIndex:5]; + + NSUInteger tgSize = indexSelectPSO.maxTotalThreadsPerThreadgroup; + if (tgSize > numThreads) + tgSize = numThreads; + + MTLSize threadGroupSize = MTLSizeMake(tgSize, 1, 1); + [computeEncoder dispatchThreads: gridSize + threadsPerThreadgroup: threadGroupSize]; + + [computeEncoder endEncoding]; + mpsStream->commit(true); + } + }); + + return true; +} + +void index_kernel_mps(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride) { + using namespace mps; + + @autoreleasepool { + int64_t num_indices = index_size.size(); + + AT_ASSERT(num_indices == index_stride.size()); + AT_ASSERT(num_indices == iter.ntensors() - 2); + const Tensor& inputTensor = iter.tensor(1); + + TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true) || + inputTensor.scalar_type() == ScalarType::Float || + inputTensor.scalar_type() == ScalarType::Half, + getMPSTypeString(inputTensor.scalar_type()) + std::string(" not supported for index.Tensor_out")); + dispatchIndexSelectKernel(iter, index_size, index_stride); + } +} + Tensor flip_mps(const Tensor& self, IntArrayRef dims) { using namespace mps; @@ -161,11 +293,6 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) { MPSGraphTensor* indexTensor = mpsGraphRankedPlaceHolder(mpsGraph, index); MPSGraphTensor* sourceTensor = mpsGraphRankedPlaceHolder(mpsGraph, source); MPSGraphTensor* alphaTensor = mpsGraphScalarPlaceHolder(mpsGraph, alpha_f); - MPSGraphTensor* inputSlice = [mpsGraph gatherWithUpdatesTensor:inputTensor - indicesTensor:indexTensor - axis:dim - batchDimensions:0 - name:nil]; MPSGraphTensor* alphaSourceSlice = [mpsGraph multiplicationWithPrimaryTensor:sourceTensor secondaryTensor:alphaTensor name:nil]; @@ -499,5 +626,7 @@ Tensor embedding_dense_backward_mps( return masked_fill__mps(self, mask, value.item()); } -} -} +REGISTER_DISPATCH(index_stub, &index_kernel_mps); + +} // native +} // at diff --git a/aten/src/ATen/native/mps/operations/Linear.mm b/aten/src/ATen/native/mps/operations/Linear.mm index a6710ea5fc2a5..b3f776d237514 100644 --- a/aten/src/ATen/native/mps/operations/Linear.mm +++ b/aten/src/ATen/native/mps/operations/Linear.mm @@ -46,6 +46,10 @@ Tensor _mps_linear( TORCH_CHECK(output.is_mps()); + if(output.numel() == 0) { + return output; + } + MPSStream *stream = getCurrentMPSStream(); struct CachedGraph : public MPSCachedGraph @@ -65,7 +69,6 @@ Tensor _mps_linear( MPSShape* wt_shape = getMPSShape(weight); string wt_key = string([[[wt_shape valueForKey:@"description"] componentsJoinedByString:@","] UTF8String]); - MPSShape* bias_shape = nil; string bias_key = "nobias"; if(is_bias_defined) { bias_key = "bias"; @@ -358,10 +361,10 @@ Tensor _mps_linear_backward_input( const Tensor& weight, std::array output_mask) { Tensor grad_input, grad_weight, grad_bias; if (output_mask[0]) { - grad_input = at::_mps_linear_backward_input(input.sizes(), grad_output, weight); + grad_input = _mps_linear_backward_input(input.sizes(), grad_output, weight); } if (output_mask[1] || output_mask[2]) { - std::tie(grad_weight, grad_bias) = at::_mps_linear_backward_weights(grad_output, input, weight, output_mask[2]); + std::tie(grad_weight, grad_bias) = _mps_linear_backward_weights(grad_output, input, weight, output_mask[2]); } return std::tuple{grad_input, grad_weight, grad_bias}; } diff --git a/aten/src/ATen/native/mps/operations/LinearAlgebra.mm b/aten/src/ATen/native/mps/operations/LinearAlgebra.mm index 8b69c65c17fae..31c8c88248d6a 100644 --- a/aten/src/ATen/native/mps/operations/LinearAlgebra.mm +++ b/aten/src/ATen/native/mps/operations/LinearAlgebra.mm @@ -125,17 +125,11 @@ void prepare_matrices_for_broadcasting( MPSStream* stream = getCurrentMPSStream(); - bool transpose_mat1 = false; - bool transpose_mat2 = false; - - prepare_matrices_for_broadcasting(NULL, self, other, NULL, NULL, transpose_mat1, transpose_mat2); - mps::MPSGraphCache *cache_ = mps::MPSGraphCache::getInstance(); @autoreleasepool { - string key = "mm_out_mps_impl" + getTensorsStringKey({self, other}) - + ":" + to_string(transpose_mat1) + ":" + to_string(transpose_mat2); + string key = "mm_out_mps_impl" + getTensorsStringKey({self, other}); CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); if(!cachedGraph) { @@ -147,31 +141,25 @@ void prepare_matrices_for_broadcasting( MPSGraph *mpsGraph = mps::make_mps_graph(); newCachedGraph = new CachedGraph(mpsGraph); - MPSGraphTensor *selfTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, self); - MPSGraphTensor *otherTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, other); + MPSGraphTensor *selfTensor = nil; + MPSGraphTensor *otherTensor = nil; + MPSGraphTensor *outputTensor = nil; - MPSGraphTensor* t1 = nil; - MPSGraphTensor* t2 = nil; + if(self.numel() == 0 || other.numel() == 0) { - if(transpose_mat1) - t1 = [mpsGraph transposeTensor:selfTensor - dimension:-1 - withDimension:-2 - name:nil]; - else - t1 = selfTensor; + outputTensor = [mpsGraph constantWithScalar:0. + shape:getMPSShape(output_sizes) + dataType:getMPSDataType(output.scalar_type())]; - if(transpose_mat2) - t2 = [mpsGraph transposeTensor:otherTensor - dimension:-1 - withDimension:-2 - name:nil]; - else - t2 = otherTensor; + } + else { - MPSGraphTensor* outputTensor = [mpsGraph matrixMultiplicationWithPrimaryTensor:t1 - secondaryTensor:t2 - name:nil]; + selfTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, self); + otherTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, other); + outputTensor = [mpsGraph matrixMultiplicationWithPrimaryTensor:selfTensor + secondaryTensor:otherTensor + name:nil]; + } newCachedGraph->selfTensor_ = selfTensor; newCachedGraph->otherTensor_ = otherTensor; @@ -181,14 +169,21 @@ void prepare_matrices_for_broadcasting( }); cachedGraph = static_cast(tmpCachedGraph); } - Placeholder selfPlaceholder = Placeholder(cachedGraph->selfTensor_, self); - Placeholder otherPlaceholder = Placeholder(cachedGraph->otherTensor_, other); + Placeholder selfPlaceholder = Placeholder(); + Placeholder otherPlaceholder = Placeholder(); + if(!(self.numel() == 0 || other.numel() == 0)) { + selfPlaceholder = Placeholder(cachedGraph->selfTensor_, self); + otherPlaceholder = Placeholder(cachedGraph->otherTensor_, other); + } Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor_, output); - NSDictionary* feeds = @{ - selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), - otherPlaceholder.getMPSGraphTensor() : otherPlaceholder.getMPSGraphTensorData() - }; + NSDictionary* feeds = nil; + + if(!(self.numel() == 0 || other.numel() == 0)) + feeds = @{ + selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), + otherPlaceholder.getMPSGraphTensor() : otherPlaceholder.getMPSGraphTensorData() + }; NSDictionary* results = @{ outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData() @@ -246,8 +241,6 @@ void prepare_matrices_for_broadcasting( MPSStream* stream = getCurrentMPSStream(); - MPSGraph* mpsGraph = make_mps_graph(); - bool transpose_mat1_times_mat2 = false; bool transpose_mat1 = false; bool transpose_mat2 = false; diff --git a/aten/src/ATen/native/mps/operations/LossOps.mm b/aten/src/ATen/native/mps/operations/LossOps.mm index 454a9512c23ab..cc112265a3a87 100644 --- a/aten/src/ATen/native/mps/operations/LossOps.mm +++ b/aten/src/ATen/native/mps/operations/LossOps.mm @@ -766,10 +766,6 @@ void smooth_l1_loss_impl( MPSGraphTensor *targetTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(target.scalar_type())); // Setup tensors - MPSGraphTensor *mpsGraphZeroTensor = [mpsGraph constantWithScalar: 0.0 - dataType: inputTensor.dataType]; - MPSGraphTensor *mpsGraphOneTensor = [mpsGraph constantWithScalar: 1.0 - dataType: inputTensor.dataType]; MPSGraphTensor *mpsGraphHalfTensor = [mpsGraph constantWithScalar: 0.5 dataType: inputTensor.dataType]; MPSGraphTensor *betaTensor = [mpsGraph constantWithScalar: beta @@ -1067,8 +1063,6 @@ void smooth_l1_loss_backward_template( }; MPSGraphCache* cache_ = MPSGraphCache::getInstance(); - MPSStream* stream = getCurrentMPSStream(); - @autoreleasepool { string key = op_name + ":" + reductionToString(reduction) + ":" + std::to_string(delta) + ":" + getTensorsStringKey({input, target}); CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); diff --git a/aten/src/ATen/native/mps/operations/Pad.mm b/aten/src/ATen/native/mps/operations/Pad.mm new file mode 100644 index 0000000000000..25cccfd6f4424 --- /dev/null +++ b/aten/src/ATen/native/mps/operations/Pad.mm @@ -0,0 +1,304 @@ +// Copyright © 2022 Apple Inc. + +#include +#include + +namespace at { +namespace native { +namespace mps { + +// Pad operations (1D/2D/3D forward and backward) +Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding, + const c10::optional& grad_output_opt, + MPSGraphPaddingMode mode, double constantValue, const string op_name) +{ + const int padding_size = (int) padding.size(); + const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D + + TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6, + "invalid padding argument of size ", padding_size); + + const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt)); + const bool is_backward_pass = grad_output_.defined(); + + int64_t nbatch = 1; + int64_t ndims = input_.ndimension(); + // number of input dims with ConstantPad could be less than 2 + int dim_w = ndims > 1 ? padding_dim : 0; + int dim_h = padding_dim - 1; + int dim_d = padding_dim - 2; + int dim_slices = 0; + + if (!is_backward_pass && ndims > 1) { + bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0; + TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) || + (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0), + "3D or 4D (batch mode) tensor expected for input, but got: ", input_); + } + + if (ndims == 2 + padding_dim) { + nbatch = input_.size(0); + dim_w++; + dim_h++; + dim_d++; + dim_slices++; + } + + int64_t pad_l = padding[0]; + int64_t pad_r = padding[1]; + int64_t pad_t = padding_dim > 1 ? padding[2] : 0; + int64_t pad_b = padding_dim > 1 ? padding[3] : 0; + int64_t pad_front = padding_dim > 2 ? padding[4] : 0; + int64_t pad_back = padding_dim > 2 ? padding[5] : 0; + + int64_t nplane = input_.size(dim_slices); + int64_t input_w = input_.size(dim_w); + int64_t output_w = input_w + pad_l + pad_r; + int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0; + int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0; + int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0; + int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0; + + Tensor grad_output, input = input_; + + if (!is_backward_pass) { + TORCH_CHECK(pad_l < input_w && pad_r < input_w, + "Argument #4: Padding size should be less than the corresponding " + "input dimension, but got: padding (", pad_l, ", ", pad_r, + ") at dimension ", dim_w, " of input ", ndims); + + if (padding_dim > 1) { + TORCH_CHECK(pad_t < input_h && pad_b < input_h, + "Argument #6: Padding size should be less than the corresponding " + "input dimension, but got: padding (", pad_t, ", ", pad_b, + ") at dimension ", dim_h, " of input ", ndims); + } + TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1, + "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated " + "output H: ", output_h, " W: ", output_w); + + if (ndims == 1 + padding_dim) { + if (padding_dim == 3) + output.resize_({nplane, output_d, output_h, output_w}); + else if (padding_dim == 2) + output.resize_({nplane, output_h, output_w}); + else + output.resize_({nplane, output_w}); + } else { + if (padding_dim == 3) + output.resize_({nbatch, nplane, output_d, output_h, output_w}); + else if (padding_dim == 2) + output.resize_({nbatch, nplane, output_h, output_w}); + else if (ndims > 1) + output.resize_({nbatch, nplane, output_w}); + else + output.resize_({output_w}); + } + if (output.numel() == 0 || input_.numel() == 0) + return output; + input = input_.contiguous(); + } else { + TORCH_CHECK(output_w == grad_output_.size(dim_w), + "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w)); + if (padding_dim > 1) { + TORCH_CHECK(output_h == grad_output_.size(dim_h), + "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h)); + } + grad_output = grad_output_.contiguous(); + } + + const int64_t input_dim = input.dim(); + MPSShape *leftPadding = nullptr, *rightPadding = nullptr; + if (padding_dim == 3) { + leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_front), @(pad_t), @(pad_l) } count:input_dim]; + rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_back), @(pad_b), @(pad_r) } count:input_dim]; + } else if (padding_dim == 2) { + leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_t), @(pad_l) } count:input_dim]; + rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_b), @(pad_r) } count:input_dim]; + } else if (padding_dim == 1) { + if (input_dim > 1) { + leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_l) } count:input_dim]; + rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_r) } count:input_dim]; + } else { + leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(pad_l) } count:input_dim]; + rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(pad_r) } count:input_dim]; + } + } + + struct CachedGraph : public MPSCachedGraph { + CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { } + MPSGraphTensor *inputTensor = nil, *outputTensor = nil; + MPSGraphTensor *gradOutputTensor = nil; + }; + MPSGraphCache* cache_ = MPSGraphCache::getInstance(); + + @autoreleasepool { + string key = op_name + getTensorsStringKey({input, grad_output}) + + ":L" + to_string(pad_l) + ":R" + to_string(pad_r) + + ":T" + to_string(pad_t) + ":B" + to_string(pad_b) + + ":F" + to_string(pad_front) + ":K" + to_string(pad_back); + + CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); + if(!cachedGraph) { + cachedGraph = static_cast(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () { + CachedGraph *newCachedGraph = nil; + @autoreleasepool { + MPSGraph* mpsGraph = make_mps_graph(); + newCachedGraph = new CachedGraph(mpsGraph); + newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input); + if (!is_backward_pass) { + newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor + withPaddingMode:mode + leftPadding:leftPadding + rightPadding:rightPadding + constantValue:constantValue + name:nil]; + } else { + newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output); + newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor + sourceTensor:newCachedGraph->inputTensor + paddingMode:mode + leftPadding:leftPadding + rightPadding:rightPadding + name:nil]; + } + } + return newCachedGraph; + })); + } + Placeholder inputPlaceholder = Placeholder(cachedGraph->inputTensor, input); + Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output); + + NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease]; + feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData(); + if (is_backward_pass) { + Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output); + feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData(); + } + NSDictionary* results = @{ + outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData() + }; + runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results); + } + return output; +} +} // namespace mps + +// 1D Reflection and Replication Padding +TORCH_IMPL_FUNC(reflection_pad1d_out_mps) +(const Tensor& input, IntArrayRef padding, const Tensor& output) +{ + mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, + MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps"); +} + +TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps) +(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, + MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps"); +} + +TORCH_IMPL_FUNC(replication_pad1d_out_mps) +(const Tensor& input, IntArrayRef padding, const Tensor& output) +{ + mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, + MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps"); +} + +TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps) +(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, + MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps"); +} + +// 2D Reflection and Replication Padding +Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output) +{ + return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__); +} + +Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding) +{ + Tensor output = at::empty({0}, input.options()); + return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__); +} + +Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__); +} + +Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) +{ + auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__); +} + +TORCH_IMPL_FUNC(replication_pad2d_out_mps) +(const Tensor& input, IntArrayRef padding, const Tensor& output) +{ + mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, + MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps"); +} + +Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); +} + +Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) +{ + auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); +} + +// 3D Reflection and Replication Padding +TORCH_IMPL_FUNC(reflection_pad3d_out_mps) +(const Tensor& input, IntArrayRef padding, const Tensor& output) +{ + mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, + MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps"); +} + +TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps) +(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, + MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps"); +} + +TORCH_IMPL_FUNC(replication_pad3d_out_mps) +(const Tensor& input, IntArrayRef padding, const Tensor& output) +{ + mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, + MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps"); +} + +Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) +{ + grad_input.resize_as_(input).zero_(); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); +} + +Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) +{ + auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); + return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); +} + +// backward pass is exlicitly handled in autograd by negating the "pad" argument +Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value) +{ + Tensor output = at::empty({0}, self.options()); + return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__); +} + +} // namespace native +} // namespace at diff --git a/aten/src/ATen/native/mps/operations/PointwiseOps.mm b/aten/src/ATen/native/mps/operations/PointwiseOps.mm index 66427c73e0c75..261749bd269f6 100644 --- a/aten/src/ATen/native/mps/operations/PointwiseOps.mm +++ b/aten/src/ATen/native/mps/operations/PointwiseOps.mm @@ -18,6 +18,11 @@ if (&output != &self) { output.resize_(output.sizes()); } + + if(output.numel() == 0) { + return output; + } + MPSStream* mpsStream = getCurrentMPSStream(); struct CachedGraph : public MPSCachedGraph diff --git a/aten/src/ATen/native/mps/operations/Pooling.mm b/aten/src/ATen/native/mps/operations/Pooling.mm index adf10b16fbfa8..1df24e073239e 100644 --- a/aten/src/ATen/native/mps/operations/Pooling.mm +++ b/aten/src/ATen/native/mps/operations/Pooling.mm @@ -104,7 +104,6 @@ Tensor _mps_max_pool2d( namespace native_mps = at::native::mps; using CachedGraph = native_mps::MPSUnaryCachedGraph; - CheckedFrom c = "mps_max_pool2d"; native_mps::MPSGraphCache* cache_ = native_mps::MPSGraphCache::getInstance(); @@ -241,7 +240,6 @@ Tensor mps_max_pool2d_backward( } namespace native_mps = at::native::mps; - CheckedFrom c = "mps_max_pool2d_backward"; // Derive from MPSCachedGraph struct CachedGraph : public native_mps::MPSCachedGraph @@ -383,7 +381,6 @@ Tensor mps_max_pool2d_backward( } /* sizes */ - const int64_t nbatch = input_t.ndimension() == 4 ? input_t.size(-4) : 1; const int64_t nInputPlane = input_t.size(-3); const int64_t inputHeight = input_t.size(-2); const int64_t inputWidth = input_t.size(-1); @@ -399,7 +396,6 @@ Tensor mps_max_pool2d_backward( outputHeight, outputWidth, memory_format); namespace native_mps = at::native::mps; - CheckedFrom c = "max_pool2d_with_indices_out_mps"; // Derive from MPSCachedGraph struct CachedGraph : public native_mps::MPSCachedGraph @@ -541,7 +537,6 @@ Tensor mps_max_pool2d_backward( } namespace native_mps = at::native::mps; - CheckedFrom c = "max_pool2d_with_indices_backward_out_mps"; // Derive from MPSCachedGraph struct CachedGraph : public native_mps::MPSCachedGraph @@ -655,13 +650,7 @@ Tensor mps_max_pool2d_backward( const int padW = safe_downcast(padW_); /* sizes */ - const int64_t nbatch = input_.ndimension() == 4 ? input_.size(-4) : 1; - const int64_t nInputPlane = input_.size(-3); - const int64_t inputHeight = input_.size(-2); - const int64_t inputWidth = input_.size(-1); - int64_t outputWidth = pooling_output_shape(inputWidth, kW, padW, dW, 1, ceil_mode); - int64_t outputHeight = pooling_output_shape(inputHeight, kH, padH, dH, 1, ceil_mode); const auto memory_format = input_.suggest_memory_format(); Tensor input = input_.contiguous(memory_format); @@ -778,8 +767,6 @@ Tensor mps_max_pool2d_backward( const Tensor input = input_.contiguous(memory_format); const Tensor gradOutput = gradOutput_.contiguous(memory_format); - const int64_t nbatch = input.ndimension() == 4 ? input.size(-4) : 1; - const int64_t nInputPlane = input.size(-3); const int64_t inputHeight = input.size(-2); const int64_t inputWidth = input.size(-1); @@ -791,11 +778,8 @@ Tensor mps_max_pool2d_backward( if (count == 0) { return; } - bool use_divisor = divisor_override.has_value(); - const auto divisor_override_value = use_divisor ? divisor_override.value() : 0; namespace native_mps = at::native::mps; - CheckedFrom c = "avg_pool2d_backward_out_mps"; // Derive from MPSCachedGraph struct CachedGraph : public native_mps::MPSCachedGraph diff --git a/aten/src/ATen/native/mps/operations/ReduceOps.mm b/aten/src/ATen/native/mps/operations/ReduceOps.mm index 67aeae4ca3cbe..d6e510a06e322 100644 --- a/aten/src/ATen/native/mps/operations/ReduceOps.mm +++ b/aten/src/ATen/native/mps/operations/ReduceOps.mm @@ -407,7 +407,6 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){ "norm_out_mps: reduction dim must be in the range of input shape") } namespace native_mps = at::native::mps; - CheckedFrom c = "norm_out_mps"; using CachedGraph = native_mps::MPSUnaryCachedGraph; diff --git a/aten/src/ATen/native/mps/operations/Repeat.mm b/aten/src/ATen/native/mps/operations/Repeat.mm index b9c465145ffeb..53bcddf405cc6 100644 --- a/aten/src/ATen/native/mps/operations/Repeat.mm +++ b/aten/src/ATen/native/mps/operations/Repeat.mm @@ -36,8 +36,8 @@ Tensor permute_mps(const Tensor& self, IntArrayRef dims) { return self.as_strided(newSizes, newStrides); } -void set_apparent_shapes(NSMutableArray * input_shape, - NSMutableArray * &apparent_input_shape, +void set_apparent_shapes(NSArray * input_shape, + NSArray * &apparent_input_shape, int64_t num_input_dims, IntArrayRef repeats, NSMutableArray * &repeats_shape, @@ -66,13 +66,14 @@ void set_apparent_shapes(NSMutableArray * input_shape, } // num_repeat_dims > num_input_dims else { - apparent_input_shape = [NSMutableArray arrayWithCapacity:num_repeat_dims]; + auto rc = [NSMutableArray arrayWithCapacity:num_repeat_dims]; for(int i = 0; i < num_repeat_dims - num_input_dims; i++) - apparent_input_shape[i] = @1; + rc[i] = @1; for(int i = num_repeat_dims - num_input_dims; i < num_repeat_dims; i++) - apparent_input_shape[i] = input_shape[i + num_input_dims - num_repeat_dims]; + rc[i] = input_shape[i + num_input_dims - num_repeat_dims]; + apparent_input_shape = rc; } } @@ -92,7 +93,7 @@ Tensor repeat_mps(const Tensor& self, IntArrayRef repeats) { MPSGraphCache* cache_ = MPSGraphCache::getInstance(); - NSMutableArray *apparent_input_shape = nil; + NSArray *apparent_input_shape = nil; NSMutableArray *repeats_shape = nil; auto input_shape = getMPSShape(self); diff --git a/aten/src/ATen/native/mps/operations/RnnOps.mm b/aten/src/ATen/native/mps/operations/RnnOps.mm index 0dd1bd6b47a21..f15e842b54b25 100644 --- a/aten/src/ATen/native/mps/operations/RnnOps.mm +++ b/aten/src/ATen/native/mps/operations/RnnOps.mm @@ -52,7 +52,6 @@ MPSGraphCache* cache_ = MPSGraphCache::getInstance(); MPSStream* stream = getCurrentMPSStream(); - int timesteps = (batch_first ? input.size(1) : input.size(0)); @autoreleasepool { string key = "lstm_" + getTensorsStringKey({input, hx[0], hx[1]}) + getMPSTypeString(input.scalar_type()) + "_num_layers_" + std::to_string(num_layers); @@ -82,7 +81,6 @@ opDesc.bidirectional = bidirectional; opDesc.produceCell = true; - MPSShape* inputShape = getMPSShape(input); MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(input)); MPSGraphTensor* stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(hx[0])); MPSGraphTensor* cellStateTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(hx[1])); @@ -332,7 +330,6 @@ NSMutableArray* gradRecWeightsArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; NSMutableArray* gradWeightsArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; NSMutableArray* gradBiasArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; - NSMutableArray* gradRecBiasArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; NSMutableArray* gradStateArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; NSMutableArray* gradCellStateArray = [[NSMutableArray alloc] initWithCapacity:num_layers]; diff --git a/aten/src/ATen/native/mps/operations/ScatterGather.mm b/aten/src/ATen/native/mps/operations/ScatterGather.mm index a8d73d5fc42a5..c4943d1242d96 100644 --- a/aten/src/ATen/native/mps/operations/ScatterGather.mm +++ b/aten/src/ATen/native/mps/operations/ScatterGather.mm @@ -9,9 +9,7 @@ #include #include -#ifdef __OBJC__ #include -#endif namespace at { namespace native { @@ -120,10 +118,10 @@ toType:getMPSDataType(ScalarType::Int) name:(NSString * _Nonnull)nil]; - MPSGraphTensor* outputTensor = [mpsGraph gatherAlongAxisWithUpdatesTensor:getInput - indicesTensor:castIndexTensor - axis:(NSInteger)dim - name:nil]; + MPSGraphTensor* outputTensor = [mpsGraph gatherAlongAxis: (NSInteger) dim + withUpdatesTensor: getInput + indicesTensor: castIndexTensor + name: nil]; newCachedGraph->inputTensor_ = inputTensor; newCachedGraph->indexTensor_ = indexTensor; @@ -279,7 +277,7 @@ getSrc = srcTensor; // Use in case input needs to be smaller to get scatter - NSMutableArray* scatterInputShape = nil; + NSArray* scatterInputShape = nil; // Slice into the input tensor IF NEEDED if(inputNeedSlice) { @@ -287,7 +285,7 @@ NSMutableArray *ends = [NSMutableArray arrayWithCapacity:num_input_dims]; NSMutableArray *strides = [NSMutableArray arrayWithCapacity:num_input_dims]; - scatterInputShape = [NSMutableArray arrayWithCapacity:num_input_dims]; + auto rc = [NSMutableArray arrayWithCapacity:num_input_dims]; for(int i = 0; i < num_input_dims; i++) { // All strides are 1 @@ -296,13 +294,14 @@ starts[i] = @0; if(i != dim) { ends[i] = index_shape[i]; - scatterInputShape[i] = index_shape[i]; + rc[i] = index_shape[i]; } else { ends[i] = input_shape[i]; - scatterInputShape[i] = input_shape[i]; + rc[i] = input_shape[i]; } } + scatterInputShape = rc; getInput = [mpsGraph sliceTensor:inputTensor starts:starts @@ -336,21 +335,21 @@ scatter_mode = MPSGraphScatterModeMin; if(!inputNeedSlice) { - outputTensor = [mpsGraph scatterAlongAxisWithDataTensor:getInput - updatesTensor:getSrc - indicesTensor:castIndexTensor - axis:(NSInteger)dim - mode:scatter_mode - name:nil]; + outputTensor = [mpsGraph scatterAlongAxis: (NSInteger) dim + withDataTensor: getInput + updatesTensor: getSrc + indicesTensor: castIndexTensor + mode: scatter_mode + name: nil]; } else { // Scatter this into the input with set mode - MPSGraphTensor* scatterTensor = [mpsGraph scatterAlongAxisWithDataTensor:getInput - updatesTensor:getSrc - indicesTensor:castIndexTensor - axis:(NSInteger)dim - mode:scatter_mode - name:nil]; + MPSGraphTensor* scatterTensor = [mpsGraph scatterAlongAxis: (NSInteger) dim + withDataTensor: getInput + updatesTensor: getSrc + indicesTensor: castIndexTensor + mode: scatter_mode + name: nil]; // Make an array of scatter indices tensors NSMutableArray* indicesTensors = [NSMutableArray arrayWithCapacity:num_input_dims]; @@ -372,9 +371,9 @@ for(int i = 0; i < num_input_dims; i++) { MPSGraphTensor* axisTensor = [mpsGraph constantWithScalar:i dataType:MPSDataTypeInt32]; - MPSGraphTensor* scatter_currentIndexTensor = [mpsGraph getCoordinateValueWithShapeTensor:scatterInputShapeTensor - axisTensor:axisTensor - name:nil]; + MPSGraphTensor* scatter_currentIndexTensor = [mpsGraph coordinateAlongAxisTensor: axisTensor + withShapeTensor: scatterInputShapeTensor + name: nil]; scatter_currentIndexTensor = [mpsGraph reshapeTensor:scatter_currentIndexTensor withShape:@[@-1, @1] name:nil]; diff --git a/aten/src/ATen/native/mps/operations/Shape.mm b/aten/src/ATen/native/mps/operations/Shape.mm index 977f9f1ce3fae..99dfbcecc24a9 100644 --- a/aten/src/ATen/native/mps/operations/Shape.mm +++ b/aten/src/ATen/native/mps/operations/Shape.mm @@ -16,288 +16,6 @@ namespace at { namespace native { -namespace mps { - -// Pad operations (1D/2D/3D forward and backward) -Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding, - const c10::optional& grad_output_opt, - MPSGraphPaddingMode mode, double constantValue, const string op_name) -{ - const int padding_size = (int) padding.size(); - const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D - - TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6, - "invalid padding argument of size ", padding_size); - - const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt)); - const bool is_backward_pass = grad_output_.defined(); - - int dim_w = padding_dim, dim_h = padding_dim - 1, dim_d = padding_dim - 2, dim_slices = 0; - int64_t nbatch = 1, ndims = input_.ndimension(); - - if (!is_backward_pass) { - bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0; - TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) || - (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0), - "3D or 4D (batch mode) tensor expected for input, but got: ", input_); - } - - if (ndims == 2 + padding_dim) { - nbatch = input_.size(0); - dim_w++; - dim_h++; - dim_d++; - dim_slices++; - } - - int64_t pad_l = padding[0]; - int64_t pad_r = padding[1]; - int64_t pad_t = padding_dim > 1 ? padding[2] : 0; - int64_t pad_b = padding_dim > 1 ? padding[3] : 0; - int64_t pad_front = padding_dim > 2 ? padding[4] : 0; - int64_t pad_back = padding_dim > 2 ? padding[5] : 0; - - int64_t nplane = input_.size(dim_slices); - int64_t input_w = input_.size(dim_w); - int64_t output_w = input_w + pad_l + pad_r; - int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0; - int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0; - int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0; - int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0; - - Tensor grad_output, input = input_; - - if (!is_backward_pass) { - TORCH_CHECK(pad_l < input_w && pad_r < input_w, - "Argument #4: Padding size should be less than the corresponding " - "input dimension, but got: padding (", pad_l, ", ", pad_r, - ") at dimension ", dim_w, " of input ", ndims); - - if (padding_dim > 1) { - TORCH_CHECK(pad_t < input_h && pad_b < input_h, - "Argument #6: Padding size should be less than the corresponding " - "input dimension, but got: padding (", pad_t, ", ", pad_b, - ") at dimension ", dim_h, " of input ", ndims); - } - TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1, - "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated " - "output H: ", output_h, " W: ", output_w); - - if (ndims == 1 + padding_dim) { - if (padding_dim == 3) - output.resize_({nplane, output_d, output_h, output_w}); - else if (padding_dim == 2) - output.resize_({nplane, output_h, output_w}); - else - output.resize_({nplane, output_w}); - } else { - if (padding_dim == 3) - output.resize_({nbatch, nplane, output_d, output_h, output_w}); - else if (padding_dim == 2) - output.resize_({nbatch, nplane, output_h, output_w}); - else - output.resize_({nbatch, nplane, output_w}); - } - if (output.numel() == 0 || input_.numel() == 0) - return output; - input = input_.contiguous(); - } else { - TORCH_CHECK(output_w == grad_output_.size(dim_w), - "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w)); - if (padding_dim > 1) { - TORCH_CHECK(output_h == grad_output_.size(dim_h), - "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h)); - } - grad_output = grad_output_.contiguous(); - } - - const int64_t input_dim = input.dim(); - MPSShape *leftPadding = nullptr, *rightPadding = nullptr; - if (padding_dim == 3) { - leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_front), @(pad_t), @(pad_l) } count:input_dim]; - rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_back), @(pad_b), @(pad_r) } count:input_dim]; - } else if (padding_dim == 2) { - leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_t), @(pad_l) } count:input_dim]; - rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_b), @(pad_r) } count:input_dim]; - } else if (padding_dim == 1) { - leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_l) } count:input_dim]; - rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_r) } count:input_dim]; - } - - struct CachedGraph : public MPSCachedGraph { - CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { } - MPSGraphTensor *inputTensor = nil, *outputTensor = nil; - MPSGraphTensor *gradOutputTensor = nil; - }; - MPSGraphCache* cache_ = MPSGraphCache::getInstance(); - - @autoreleasepool { - string key = op_name + getTensorsStringKey({input, grad_output}) + - ":L" + to_string(pad_l) + ":R" + to_string(pad_r) + - ":T" + to_string(pad_t) + ":B" + to_string(pad_b) + - ":F" + to_string(pad_front) + ":K" + to_string(pad_back); - - CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); - if(!cachedGraph) { - cachedGraph = static_cast(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () { - CachedGraph *newCachedGraph = nil; - @autoreleasepool { - MPSGraph* mpsGraph = make_mps_graph(); - newCachedGraph = new CachedGraph(mpsGraph); - newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input); - if (!is_backward_pass) { - newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor - withPaddingMode:mode - leftPadding:leftPadding - rightPadding:rightPadding - constantValue:constantValue - name:nil]; - } else { - newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output); - newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor - sourceTensor:newCachedGraph->inputTensor - paddingMode:mode - leftPadding:leftPadding - rightPadding:rightPadding - name:nil]; - } - } - return newCachedGraph; - })); - } - Placeholder inputPlaceholder = Placeholder(cachedGraph->inputTensor, input); - Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output); - - NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease]; - feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData(); - if (is_backward_pass) { - Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output); - feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData(); - } - NSDictionary* results = @{ - outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData() - }; - runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results); - } - return output; -} -} // namespace mps - -// 1D Reflection and Replication Padding -TORCH_IMPL_FUNC(reflection_pad1d_out_mps) -(const Tensor& input, IntArrayRef padding, const Tensor& output) -{ - mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, - MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps"); -} - -TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps) -(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, - MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps"); -} - -TORCH_IMPL_FUNC(replication_pad1d_out_mps) -(const Tensor& input, IntArrayRef padding, const Tensor& output) -{ - mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, - MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps"); -} - -TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps) -(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, - MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps"); -} - -// 2D Reflection and Replication Padding -Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output) -{ - return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__); -} - -Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding) -{ - Tensor output = at::empty({0}, input.options()); - return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__); -} - -Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__); -} - -Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) -{ - auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__); -} - -TORCH_IMPL_FUNC(replication_pad2d_out_mps) -(const Tensor& input, IntArrayRef padding, const Tensor& output) -{ - mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, - MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps"); -} - -Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); -} - -Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) -{ - auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); -} - -// 3D Reflection and Replication Padding -TORCH_IMPL_FUNC(reflection_pad3d_out_mps) -(const Tensor& input, IntArrayRef padding, const Tensor& output) -{ - mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, - MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps"); -} - -TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps) -(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - mps::pad_out_template(const_cast(grad_input), input, padding, grad_output, - MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps"); -} - -TORCH_IMPL_FUNC(replication_pad3d_out_mps) -(const Tensor& input, IntArrayRef padding, const Tensor& output) -{ - mps::pad_out_template(const_cast(output), input, padding, c10::nullopt, - MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps"); -} - -Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input) -{ - grad_input.resize_as_(input).zero_(); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); -} - -Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding) -{ - auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT); - return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__); -} - -// backward pass is exlicitly handled in autograd by negating the "pad" argument -Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value) -{ - Tensor output = at::empty({0}, self.options()); - return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__); -} // topk TORCH_IMPL_FUNC(topk_out_mps) @@ -534,7 +252,6 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second, }; const Tensor* notSkippedTensor = NULL; // non-owning reference - int nDims = 0; // Check for type promotion TORCH_CHECK( @@ -570,7 +287,6 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second, continue; } input_tensors.push_back(&t); - nDims = t.dim(); // TODO: Is this OK? notSkippedTensor = &t; tensor_idx++; @@ -879,7 +595,6 @@ void upsample_out_mps(const Tensor& input, int64_t output_width = output_size[1]; @autoreleasepool { MPSShape* input_shape = getMPSShape(input); - NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","]; string key = string("upsample_2d:") + mps::getMPSShapeString(input_shape) + ":" + getMPSTypeString(input.scalar_type()) + ":h" + to_string(output_height) + ":w" + to_string(output_width) + @@ -992,7 +707,6 @@ void upsample1d_out_mps(const Tensor& input, int64_t out_size = output_size[0]; @autoreleasepool { MPSShape* input_shape = getMPSShape(input); - NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","]; string key = string("upsample_1d:") + mps::getMPSShapeString(input_shape) + ":" + getMPSTypeString(input.scalar_type()) + ":size" + to_string(out_size) + diff --git a/aten/src/ATen/native/mps/operations/SoftMax.mm b/aten/src/ATen/native/mps/operations/SoftMax.mm index 4246a37671e99..e96d5ed2481c3 100644 --- a/aten/src/ATen/native/mps/operations/SoftMax.mm +++ b/aten/src/ATen/native/mps/operations/SoftMax.mm @@ -216,8 +216,6 @@ void get_shapes(MPSShape* input_shape_readonly, @autoreleasepool { MPSShape* grad_shape = mps::getMPSShape(grad); - int num_grad_dims = [grad_shape count]; - NSString* ns_shape_key = [[grad_shape valueForKey:@"description"] componentsJoinedByString:@","]; string key = "softmax_backward_mps_out:" + getMPSTypeString(output.scalar_type()) + ":" diff --git a/aten/src/ATen/native/mps/operations/TensorCompare.mm b/aten/src/ATen/native/mps/operations/TensorCompare.mm index a6c267290312b..fb3b93a602f1a 100644 --- a/aten/src/ATen/native/mps/operations/TensorCompare.mm +++ b/aten/src/ATen/native/mps/operations/TensorCompare.mm @@ -245,8 +245,6 @@ void clamp_scalar_out_mps(const Tensor& input_t, @autoreleasepool { - MPSShape* input_shape = getMPSShape(self); - string key = "where_self_out_mps:" + getTensorsStringKey({cond_bool, self, other}); CachedGraph* cachedGraph = static_cast(cache_->LookUp(key)); @@ -304,10 +302,6 @@ Tensor where_mps(const Tensor& condition, const Tensor& self, const Tensor& other) { - auto cond_shape = condition.sizes(); - auto self_shape = self.sizes(); - auto other_shape = other.sizes(); - bool cond_zero_shape = (condition.dim() == 0); bool self_zero_shape = (self.dim() == 0); bool other_zero_shape = (other.dim() == 0); diff --git a/aten/src/ATen/native/mps/operations/TriangularOps.mm b/aten/src/ATen/native/mps/operations/TriangularOps.mm index 6a29d080cb6c9..fb6e1c52ba49e 100644 --- a/aten/src/ATen/native/mps/operations/TriangularOps.mm +++ b/aten/src/ATen/native/mps/operations/TriangularOps.mm @@ -9,9 +9,7 @@ #include #include -#ifdef __OBJC__ #include -#endif namespace at { namespace native { @@ -275,9 +273,9 @@ MPSGraphTensor* inputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)] shape:@[@1] dataType:MPSDataTypeInt32]; - numDiagElementsRange = [mpsGraph getCoordinateValueWithShapeTensor:inputShapeTensor - axisTensor:zeroTensor - name:nil]; + numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor + withShapeTensor: inputShapeTensor + name: nil]; diagOffset = [mpsGraph constantWithScalar:diagonal dataType:MPSDataTypeInt32]; rowMultiplier = [mpsGraph constantWithScalar:[num_output_cols intValue] @@ -288,9 +286,9 @@ MPSGraphTensor* outputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)] shape:@[@1] dataType:MPSDataTypeInt32]; - numDiagElementsRange = [mpsGraph getCoordinateValueWithShapeTensor:outputShapeTensor - axisTensor:zeroTensor - name:nil]; + numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor + withShapeTensor: outputShapeTensor + name: nil]; diagOffset = [mpsGraph constantWithScalar:diagonal dataType:MPSDataTypeInt32]; rowMultiplier = [mpsGraph constantWithScalar:[num_input_cols intValue] diff --git a/aten/src/ATen/native/mps/operations/View.mm b/aten/src/ATen/native/mps/operations/View.mm index 4fa614ae6e2c6..a8a55b21d2468 100644 --- a/aten/src/ATen/native/mps/operations/View.mm +++ b/aten/src/ATen/native/mps/operations/View.mm @@ -86,7 +86,8 @@ static MPSGraphTensor* chainViewOperation(ViewCachedGraph* cachedGraph, const IntArrayRef& size, const IntArrayRef& stride, int64_t offset, - const IntArrayRef& base_shape, bool needsScatter) + const IntArrayRef& base_shape, bool needsScatter, + const bool needsBoolCast) { MPSGraph* mpsGraph = cachedGraph->graph(); MPSGraphTensor *outputTensor = nil; @@ -126,7 +127,17 @@ indicesTensor = [mpsGraph additionWithPrimaryTensor: indicesTensor secondaryTensor: cachedGraph->storageOffsetTensor name: nil]; - MPSGraphTensor *reshapedInputTensor = [mpsGraph reshapeTensor: cachedGraph->inputTensor + MPSGraphTensor *inputTensor = cachedGraph->inputTensor; + + // Workaround for bool scatter/gather deficiency + // See https://github.com/pytorch/pytorch/issues/82663 + if (needsBoolCast) { + inputTensor = [mpsGraph castTensor:inputTensor + toType:MPSDataTypeInt8 + name:@"Cast away from bool"]; + } + + MPSGraphTensor *reshapedInputTensor = [mpsGraph reshapeTensor: inputTensor withShape: @[@-1] name: nil]; MPSGraphTensor *reshapedIndicesTensor = [mpsGraph reshapeTensor: indicesTensor @@ -154,6 +165,14 @@ withShapeTensor: shapeTensor name: nil]; } + + // Workaround for bool scatter/gather deficiency + // See https://github.com/pytorch/pytorch/issues/82663 + if (needsBoolCast) { + outputTensor = [mpsGraph castTensor:outputTensor + toType:MPSDataTypeBool + name:@"Cast back to bool"]; + } } return outputTensor; } @@ -205,6 +224,7 @@ if (inputType == MPSDataTypeUInt8) { inputType = MPSDataTypeInt8; } + auto needsBoolCast = inputType == MPSDataTypeBool; // Self is the input tensor we are creating view of newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, inputType, getMPSShape(base_shape)); newCachedGraph->storageOffsetTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[@1]); @@ -214,7 +234,7 @@ if (needsScatter) { newCachedGraph->updatesTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type())); } - newCachedGraph->outputTensor = chainViewOperation(newCachedGraph, size, stride, storage_offset, base_shape, needsScatter); + newCachedGraph->outputTensor = chainViewOperation(newCachedGraph, size, stride, storage_offset, base_shape, needsScatter, needsBoolCast); } return newCachedGraph; })); diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index ab6d38e553d30..848a84a7a55e1 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -131,6 +131,7 @@ variants: function dispatch: CompositeExplicitAutograd: _new_zeros_with_same_feature_meta + autogen: _new_zeros_with_same_feature_meta.out # This function compares the storage numel of self with that of other, where # storage numel is cumputed as: `other.storage().nbytes() / other.itemsize()`. @@ -181,12 +182,14 @@ device_check: NoCheck # log_probs is expected to be on CUDA while targets is expected to be on CPU dispatch: CUDA: _cudnn_ctc_loss + autogen: _cudnn_ctc_loss.out - func: _use_cudnn_rnn_flatten_weight() -> bool - func: _cudnn_rnn_flatten_weight(Tensor[] weight_arr, int weight_stride0, int input_size, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, bool bidirectional) -> Tensor dispatch: CUDA: _cudnn_rnn_flatten_weight + autogen: _cudnn_rnn_flatten_weight.out - func: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor) # rnn_tanh may or may not redispatch to _cudnn_rnn based on algorithm and build. Thus it might hit dispatch or kernel device check. @@ -194,14 +197,17 @@ device_check: NoCheck dispatch: CUDA: _cudnn_rnn + autogen: _cudnn_rnn.out - func: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[]) dispatch: CUDA: _cudnn_rnn_backward + autogen: _cudnn_rnn_backward.out - func: _cudnn_init_dropout_state(float dropout, bool train, int dropout_seed, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor dispatch: CUDA: _cudnn_init_dropout_state + autogen: _cudnn_init_dropout_state.out - func: _debug_has_internal_overlap(Tensor self) -> int variants: function @@ -211,23 +217,28 @@ dispatch: CUDA: fused_dropout_cuda tags: nondeterministic_seeded + autogen: _fused_dropout.out - func: _masked_scale(Tensor self, Tensor mask, float scale) -> Tensor variants: function dispatch: CUDA: masked_scale_cuda + autogen: _masked_scale.out - func: native_dropout(Tensor input, float p, bool? train) -> (Tensor, Tensor) variants: function dispatch: CPU: native_dropout_cpu CUDA: native_dropout_cuda + NestedTensorCPU, NestedTensorCUDA: native_dropout_nested tags: nondeterministic_seeded + autogen: native_dropout.out - func: native_dropout_backward(Tensor grad_output, Tensor mask, float scale) -> Tensor dispatch: - CPU: native_dropout_backward_cpu + CPU, NestedTensorCPU, NestedTensorCUDA: native_dropout_backward CUDA: native_dropout_backward_cuda + autogen: native_dropout_backward.out - func: _sobol_engine_draw(Tensor quasi, int n, Tensor sobolstate, int dimension, int num_generated, ScalarType? dtype) -> (Tensor, Tensor) @@ -242,27 +253,28 @@ - func: _shape_as_tensor(Tensor self) -> Tensor - func: dropout(Tensor input, float p, bool train) -> Tensor - dispatch: - CompositeImplicitAutograd: dropout - NestedTensorCPU, NestedTensorCUDA: dropout_nested tags: nondeterministic_seeded - func: dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!) - dispatch: - CompositeImplicitAutograd: dropout_ - NestedTensorCPU, NestedTensorCUDA: dropout_nested_ + tags: nondeterministic_seeded - func: feature_dropout(Tensor input, float p, bool train) -> Tensor + tags: nondeterministic_seeded - func: feature_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!) + tags: nondeterministic_seeded - func: alpha_dropout(Tensor input, float p, bool train) -> Tensor + tags: nondeterministic_seeded - func: alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!) + tags: nondeterministic_seeded - func: feature_alpha_dropout(Tensor input, float p, bool train) -> Tensor + tags: nondeterministic_seeded - func: feature_alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!) + tags: nondeterministic_seeded - func: abs(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -393,6 +405,7 @@ dispatch: CompositeExplicitAutograd: _conj_physical SparseCsrCPU, SparseCsrCUDA: conj_physical_sparse_csr + autogen: _conj_physical.out - func: conj_physical(Tensor self) -> Tensor variants: function, method @@ -567,6 +580,7 @@ variants: function dispatch: CompositeExplicitAutograd: affine_grid_generator + autogen: affine_grid_generator.out - func: affine_grid_generator_backward(Tensor grad, int[] size, bool align_corners) -> Tensor variants: function @@ -594,6 +608,7 @@ - func: allclose(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False) -> bool variants: function, method + tags: data_dependent_output dispatch: CompositeExplicitAutograd: allclose @@ -626,15 +641,17 @@ dispatch: CompositeExplicitAutograd: arange -# Note [arange.start_step schema] -# We want `arange.start_step` to be grouped up with `arange.start_out`, -# But this doesn't happen automatically because the step argument -# is defaultable for .start_out but not for .start_step. -# We should probably just make "step" a defaultable param on arange.start, -# and kill arange.start_step. -- func: arange.start_step(Scalar start, Scalar end, Scalar step, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor +# This operator should be named `aragne.start_out` if following the naming convention. However that +# name is already taken. Disabled because of CI job failures. +# FIXME: enable this +#- func: arange.start_out_(Scalar start, Scalar end, *, Tensor(a!) out) -> Tensor(a!) +# dispatch: +# CompositeExplicitAutograd: arange_start_out + +- func: arange.start_step(Scalar start, Scalar end, Scalar step=1, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: arange + cpp_no_default_args: ['step'] - func: arange.out(Scalar end, *, Tensor(a!) out) -> Tensor(a!) dispatch: @@ -645,6 +662,7 @@ CPU, Meta: arange_out CUDA: arange_cuda_out MPS: arange_mps_out + cpp_no_default_args: ['step'] # This function is a temporary hack to allow tracing of arange like constructs with dynamic # bounds on arange. Normal arange is not traceable because it does not take any tensor inputs; @@ -888,16 +906,19 @@ - func: bartlett_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: bartlett_window + autogen: bartlett_window.out - func: bartlett_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: bartlett_window + autogen: bartlett_window.periodic_out - func: batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> Tensor - func: quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor dispatch: QuantizedCPU: quantized_batch_norm + autogen: quantized_batch_norm.out - func: _batch_norm_impl_index(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor, Tensor, Tensor, Tensor, int) @@ -914,6 +935,7 @@ - func: bernoulli.out(Tensor self, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: function + tags: nondeterministic_seeded dispatch: CPU, CUDA: bernoulli_out MPS: bernoulli_out_mps @@ -921,6 +943,7 @@ - func: bernoulli_.Tensor(Tensor(a!) self, Tensor p, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: method + tags: nondeterministic_seeded dispatch: CPU, CUDA: bernoulli_ MPS: bernoulli_mps_ @@ -929,6 +952,7 @@ - func: bernoulli_.float(Tensor(a!) self, float p=0.5, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: method + tags: nondeterministic_seeded dispatch: CPU, CUDA: bernoulli_ MPS: bernoulli_mps_ @@ -985,6 +1009,7 @@ variants: function dispatch: CompositeExplicitAutograd: binary_cross_entropy_with_logits + autogen: binary_cross_entropy_with_logits.out - func: bincount(Tensor self, Tensor? weights=None, int minlength=0) -> Tensor variants: function, method @@ -992,6 +1017,7 @@ CPU: _bincount_cpu CUDA: _bincount_cuda tags: dynamic_output_shape + autogen: bincount.out - func: bitwise_not(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -1116,10 +1142,12 @@ - func: blackman_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: blackman_window + autogen: blackman_window.out - func: blackman_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: blackman_window + autogen: blackman_window.periodic_out - func: bmm(Tensor self, Tensor mat2) -> Tensor structured_delegate: bmm.out @@ -1140,10 +1168,6 @@ SparseCUDA: bmm_out_sparse_cuda SparseCsrCUDA: bmm_out_sparse_csr_cuda -- func: _NestedTensor_GeneralizedBMM(Tensor self, Tensor mat2) -> Tensor - dispatch: - NestedTensorCPU, NestedTensorCUDA: _NestedTensor_GeneralizedBMM - - func: broadcast_tensors(Tensor[] tensors) -> Tensor[] device_check: NoCheck device_guard: False @@ -1189,6 +1213,7 @@ variants: function dispatch: CompositeExplicitAutograd: block_diag + autogen: block_diag.out - func: ceil(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -1396,6 +1421,7 @@ dispatch: CompositeExplicitAutograd: constant_pad_nd MPS: constant_pad_nd_mps + autogen: constant_pad_nd.out - func: contiguous(Tensor(a) self, *, MemoryFormat memory_format=contiguous_format) -> Tensor(a) variants: method @@ -1404,22 +1430,27 @@ - func: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor dispatch: CompositeExplicitAutograd: convolution + autogen: convolution.out - func: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: CompositeExplicitAutograd, CUDA: convolution_backward + autogen: convolution_backward.out - func: convolution_overrideable(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor dispatch: CompositeExplicitAutograd: convolution_overrideable + autogen: convolution_overrideable.out - func: convolution_backward_overrideable(Tensor grad_output, Tensor input, Tensor weight, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor grad_input, Tensor grad_weight, Tensor grad_bias) dispatch: CompositeExplicitAutograd: convolution_backward_overrideable + autogen: convolution_backward_overrideable.out - func: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor dispatch: CompositeExplicitAutograd: _convolution + autogen: _convolution.out - func: _convolution.deprecated(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled) -> Tensor @@ -1445,6 +1476,7 @@ - func: conv_tbc(Tensor self, Tensor weight, Tensor bias, int pad=0) -> Tensor dispatch: CompositeExplicitAutograd: conv_tbc + autogen: conv_tbc.out - func: conv_tbc_backward(Tensor self, Tensor input, Tensor weight, Tensor bias, int pad) -> (Tensor, Tensor, Tensor) @@ -1472,12 +1504,14 @@ - func: _copy_from(Tensor self, Tensor dst, bool non_blocking=False) -> Tensor dispatch: MPS: _copy_from_mps + autogen: _copy_from.out # We need this to be able to properly copy from a CPU to an XLA tensor with different sizes. # See https://github.com/pytorch/xla/issues/2881 - func: _copy_from_and_resize(Tensor self, Tensor dst) -> Tensor dispatch: MPS: _copy_from_and_resize_mps + autogen: _copy_from_and_resize.out - func: cos(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -1523,11 +1557,13 @@ CPU: count_nonzero_cpu CUDA: count_nonzero_cuda MPS: count_nonzero_mps + autogen: count_nonzero.dim_IntList_out - func: count_nonzero(Tensor self, int? dim=None) -> Tensor variants: function, method dispatch: CompositeExplicitAutograd: count_nonzero + autogen: count_nonzero.out - func: cov(Tensor self, *, int correction=1, Tensor? fweights=None, Tensor? aweights=None) -> Tensor variants: function, method @@ -1538,53 +1574,65 @@ - func: cudnn_affine_grid_generator(Tensor theta, int N, int C, int H, int W) -> Tensor grid dispatch: CUDA: cudnn_affine_grid_generator_forward + autogen: cudnn_affine_grid_generator.out # TODO: Why do I have to call this grad?! - func: cudnn_affine_grid_generator_backward(Tensor grad, int N, int C, int H, int W) -> Tensor grad_theta dispatch: CUDA: cudnn_affine_grid_generator_backward + autogen: cudnn_affine_grid_generator_backward.out - func: cudnn_batch_norm(Tensor input, Tensor weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float exponential_average_factor, float epsilon) -> (Tensor, Tensor, Tensor, Tensor) dispatch: CUDA: cudnn_batch_norm + autogen: cudnn_batch_norm.out # NB: You can only use this if you used cudnn_batch_norm training=True - func: cudnn_batch_norm_backward(Tensor input, Tensor grad_output, Tensor weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var, float epsilon, Tensor reserveSpace) -> (Tensor, Tensor, Tensor) dispatch: CUDA: cudnn_batch_norm_backward + autogen: cudnn_batch_norm_backward.out - func: cudnn_convolution(Tensor self, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic, bool allow_tf32) -> Tensor dispatch: CUDA: cudnn_convolution + autogen: cudnn_convolution.out - func: cudnn_convolution_transpose(Tensor self, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic, bool allow_tf32) -> Tensor dispatch: CUDA: cudnn_convolution_transpose + autogen: cudnn_convolution_transpose.out - func: _mps_convolution_transpose(Tensor self, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups) -> Tensor dispatch: MPS: _mps_convolution_transpose + autogen: _mps_convolution_transpose.out - func: mps_convolution_transpose_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool[2] output_mask) -> (Tensor, Tensor) dispatch: MPS: mps_convolution_transpose_backward + autogen: mps_convolution_transpose_backward.out - func: cudnn_convolution_relu(Tensor self, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor dispatch: CUDA: cudnn_convolution_relu + autogen: cudnn_convolution_relu.out - func: cudnn_convolution_add_relu(Tensor self, Tensor weight, Tensor z, Scalar? alpha, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor dispatch: CUDA: cudnn_convolution_add_relu + autogen: cudnn_convolution_add_relu.out # NB: input is special cased in a way I don't quite understand - func: cudnn_grid_sampler(Tensor self, Tensor grid) -> Tensor output dispatch: CUDA: cudnn_grid_sampler_forward + autogen: cudnn_grid_sampler.out - func: cudnn_grid_sampler_backward(Tensor self, Tensor grid, Tensor grad_output) -> (Tensor grad_self, Tensor grad_grid) dispatch: CUDA: cudnn_grid_sampler_backward + autogen: cudnn_grid_sampler_backward.out - func: cummax(Tensor self, int dim) -> (Tensor values, Tensor indices) device_check: NoCheck # TensorIterator @@ -1707,16 +1755,19 @@ dispatch: CPU: ctc_loss_cpu CUDA: ctc_loss_gpu + autogen: _ctc_loss.out - func: _ctc_loss_backward(Tensor grad, Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, Tensor neg_log_likelihood, Tensor log_alpha, int blank, bool zero_infinity=False) -> Tensor dispatch: CPU: ctc_loss_backward_cpu CUDA: ctc_loss_backward_gpu + autogen: _ctc_loss_backward.out - func: diag_embed(Tensor self, int offset=0, int dim1=-2, int dim2=-1) -> Tensor variants: function, method dispatch: CompositeExplicitAutograd: diag_embed + autogen: diag_embed.out - func: diagflat(Tensor self, int offset=0) -> Tensor variants: function, method @@ -1739,6 +1790,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: diagonal_backward + autogen: diagonal_backward.out - func: fill_diagonal_(Tensor(a!) self, Scalar fill_value, bool wrap=False) -> Tensor(a!) variants: method @@ -1918,6 +1970,7 @@ dispatch: CompositeExplicitAutograd: embedding NestedTensorCPU, NestedTensorCUDA: NestedTensor_embedding + autogen: embedding.out - func: embedding_backward(Tensor grad, Tensor indices, int num_weights, int padding_idx, bool scale_grad_by_freq, bool sparse) -> Tensor @@ -1926,6 +1979,7 @@ CPU: embedding_dense_backward_cpu CUDA: embedding_dense_backward_cuda MPS: embedding_dense_backward_mps + autogen: embedding_dense_backward.out - func: embedding_renorm_(Tensor(a!) self, Tensor indices, float max_norm, float norm_type) -> Tensor(a!) dispatch: @@ -1949,6 +2003,7 @@ dispatch: CPU: _embedding_bag_forward_only_cpu CUDA: _embedding_bag_forward_only_cuda + autogen: _embedding_bag_forward_only.out - func: _rowwise_prune(Tensor weight, Tensor mask, ScalarType compressed_indices_dtype) -> (Tensor, Tensor) @@ -1969,6 +2024,7 @@ dispatch: CPU: _embedding_bag_cpu CUDA: _embedding_bag_cuda + autogen: _embedding_bag.out - func: _embedding_bag_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, int num_weights, bool scale_grad_by_freq, int mode, bool sparse, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor @@ -1978,17 +2034,20 @@ dispatch: CPU: _embedding_bag_dense_backward_cpu CUDA: _embedding_bag_dense_backward_cuda + autogen: _embedding_bag_dense_backward.out - func: _embedding_bag_per_sample_weights_backward(Tensor grad, Tensor weight, Tensor indices, Tensor offsets, Tensor offset2bag, int mode, int padding_idx=-1) -> Tensor dispatch: CPU: _embedding_bag_per_sample_weights_backward_cpu CUDA: _embedding_bag_per_sample_weights_backward_cuda + autogen: _embedding_bag_per_sample_weights_backward.out - func: empty.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor device_check: NoCheck device_guard: False dispatch: CompositeExplicitAutograd: empty + autogen: empty.names_out - func: empty.memory_format(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor dispatch: @@ -2018,6 +2077,7 @@ SparseCPU, SparseCUDA, SparseMeta: empty_symint_sparse SparseCsrCPU, SparseCsrCUDA: empty_symint_sparse_compressed QuantizedCPU, QuantizedCUDA: empty_symint_unknown_quantized + autogen: empty.SymInt_out # We do not make new_empty a composite that calls into new_empty_strided, as the strided version # is significantly more difficult to implement by different backends @@ -2025,16 +2085,19 @@ variants: method dispatch: CompositeExplicitAutograd: new_empty + autogen: new_empty.out - func: new_empty.SymInt(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor variants: method dispatch: CompositeExplicitAutograd: new_empty_symint + autogen: new_empty.SymInt_out - func: new_empty_strided(Tensor self, int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor variants: method dispatch: CompositeExplicitAutogradNonFunctional: new_empty_strided + autogen: new_empty_strided.out - func: new_full(Tensor self, int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor variants: method @@ -2042,6 +2105,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: new_full + autogen: new_full.out - func: new_zeros(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor variants: method @@ -2049,6 +2113,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: new_zeros + autogen: new_zeros.out - func: new_ones(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor variants: method @@ -2056,12 +2121,14 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: new_ones + autogen: new_ones.out # other overrides are to provide a more helpful error message that dtype is required - func: _empty_affine_quantized(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, float scale=1, int zero_point=0, MemoryFormat? memory_format=contiguous_format) -> Tensor dispatch: CPU: empty_affine_quantized_other_backends_stub QuantizedCPU, QuantizedCUDA: empty_affine_quantized + autogen: _empty_affine_quantized.out # it's a factory function receiving a tensor argument, thus overriding explicitly # other overrides are to provide a more helpful error message that dtype is required @@ -2070,12 +2137,14 @@ dispatch: CPU: empty_per_channel_affine_quantized_other_backends_stub QuantizedCPU, QuantizedCUDA: empty_per_channel_affine_quantized + autogen: _empty_per_channel_affine_quantized.out - func: resize_(Tensor(a!) self, int[] size, *, MemoryFormat? memory_format=None) -> Tensor(a!) use_const_ref_for_mutable_tensors: True variants: method device_check: NoCheck device_guard: False + tags: inplace_view dispatch: CPU, Meta: resize_ CUDA: resize_cuda_ @@ -2099,6 +2168,7 @@ variants: function dispatch: QuantizedCPU, QuantizedCUDA: empty_quantized + autogen: empty_quantized.out - func: empty.out(int[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck @@ -2112,6 +2182,7 @@ QuantizedCPU, QuantizedCUDA: empty_like_quantized SparseCPU, SparseCUDA, SparseMeta: empty_like_sparse_coo SparseCsrCPU, SparseCsrCUDA: empty_like_sparse_csr + autogen: empty_like.out - func: empty_strided(int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: @@ -2120,6 +2191,7 @@ MPS: empty_strided_mps Meta: empty_strided_meta QuantizedCPU, QuantizedCUDA: empty_strided_unknown_quantized + autogen: empty_strided.out - func: erf(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -2387,6 +2459,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: full + autogen: full.names_out - func: full(int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: @@ -2401,10 +2474,12 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: full_like + autogen: full_like.out - func: from_file(str filename, bool? shared=None, int? size=0, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CPU: from_file + autogen: from_file.out - func: gcd.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!) structured: True @@ -2457,6 +2532,7 @@ dispatch: CPU, QuantizedCPU: grid_sampler_2d_cpu CUDA: grid_sampler_2d_cuda + autogen: grid_sampler_2d.out # `grid_sampler_2d_backward` takes in `output_mask` to optimize performance for # the case where `input` doesn't require gradient. Gradient for `grid` is always @@ -2465,11 +2541,13 @@ dispatch: CPU: grid_sampler_2d_backward_cpu CUDA: grid_sampler_2d_backward_cuda + autogen: grid_sampler_2d_backward.out # See NOTE [ grid_sample CPU fallback ] - func: _grid_sampler_2d_cpu_fallback(Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners) -> Tensor dispatch: CompositeExplicitAutograd: _grid_sampler_2d_cpu_fallback + autogen: _grid_sampler_2d_cpu_fallback.out - func: _grid_sampler_2d_cpu_fallback_backward(Tensor grad_output, Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners) -> (Tensor, Tensor) @@ -2477,6 +2555,7 @@ dispatch: CPU: grid_sampler_3d_cpu CUDA: grid_sampler_3d_cuda + autogen: grid_sampler_3d.out # `grid_sampler_3d_backward` takes in `output_mask` to optimize performance for # the case where `input` doesn't require gradient. Gradient for `grid` is always @@ -2485,42 +2564,52 @@ dispatch: CPU: grid_sampler_3d_backward_cpu CUDA: grid_sampler_3d_backward_cuda + autogen: grid_sampler_3d_backward.out - func: hann_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hann_window + autogen: hann_window.out - func: hann_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hann_window + autogen: hann_window.periodic_out - func: hamming_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hamming_window + autogen: hamming_window.out - func: hamming_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hamming_window + autogen: hamming_window.periodic_out - func: hamming_window.periodic_alpha(int window_length, bool periodic, float alpha, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hamming_window + autogen: hamming_window.periodic_alpha_out - func: hamming_window.periodic_alpha_beta(int window_length, bool periodic, float alpha, float beta, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: hamming_window + autogen: hamming_window.periodic_alpha_beta_out - func: kaiser_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: kaiser_window + autogen: kaiser_window.out - func: kaiser_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: kaiser_window + autogen: kaiser_window.periodic_out - func: kaiser_window.beta(int window_length, bool periodic, float beta, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: kaiser_window + autogen: kaiser_window.beta_out - func: hinge_embedding_loss(Tensor self, Tensor target, float margin=1.0, int reduction=Mean) -> Tensor @@ -2530,10 +2619,12 @@ dispatch: CPU, CUDA: native_group_norm CompositeExplicitAutograd: math_group_norm + autogen: native_group_norm.out - func: native_group_norm_backward(Tensor grad_out, Tensor input, Tensor mean, Tensor rstd, Tensor? weight, int N, int C, int HxW, int group, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: CPU, CUDA: native_group_norm_backward + autogen: native_group_norm_backward.out # Real to complex forward FFT - func: _fft_r2c(Tensor self, int[] dim, int normalization, bool onesided) -> Tensor @@ -2608,7 +2699,7 @@ precomputed: - indices -> DimVector sizes, DimVector strides dispatch: - CPU, CUDA: index_out + CPU, CUDA, MPS: index_out - func: index_copy.out(Tensor self, int dim, Tensor index, Tensor source, *, Tensor(a!) out) -> Tensor(a!) structured: True @@ -2661,15 +2752,6 @@ - func: instance_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool use_input_stats, float momentum, float eps, bool cudnn_enabled) -> Tensor variants: function -- func: inverse(Tensor self) -> Tensor - variants: function, method - dispatch: - CompositeExplicitAutograd: inverse - -- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) - dispatch: - CompositeExplicitAutograd: inverse_out - - func: isclose(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False) -> Tensor variants: function, method @@ -2711,6 +2793,7 @@ CPU, CUDA, MPS: isnan SparseCPU, SparseCUDA: isnan_sparse SparseCsrCPU, SparseCsrCUDA: isnan_sparse_csr + autogen: isnan.out - func: is_distributed(Tensor self) -> bool variants: function, method @@ -2802,12 +2885,14 @@ CUDA: layer_norm_cuda MPS: layer_norm_mps CompositeExplicitAutograd: math_native_layer_norm + autogen: native_layer_norm.out - func: native_layer_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: CPU: layer_norm_backward_cpu CUDA: layer_norm_backward_cuda MPS: layer_norm_backward_mps + autogen: native_layer_norm_backward.out - func: nan_to_num(Tensor self, float? nan=None, float? posinf=None, float? neginf=None) -> Tensor variants: function, method @@ -2831,52 +2916,39 @@ dispatch: CompositeImplicitAutograd: linear NestedTensorCPU, NestedTensorCUDA: nested_linear + MPS: _mps_linear - func: linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: NestedTensorCPU, NestedTensorCUDA: nested_linear_backward + MPS: mps_linear_backward + autogen: linear_backward.out - func: linear.out(Tensor input, Tensor weight, Tensor? bias=None, *, Tensor(a!) out) -> Tensor(a!) python_module: nn dispatch: CompositeExplicitAutograd: linear_out -# TODO: Add this function to MPS dispatch key so that we avoid declaring it in -# native_functions.yaml -# https://github.com/pytorch/pytorch/issues/77394 -- func: _mps_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor - python_module: nn - dispatch: - MPS: _mps_linear - - func: mkldnn_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor python_module: nn dispatch: MkldnnCPU: mkldnn_linear + autogen: mkldnn_linear.out - func: mkldnn_linear_backward_input(int[] input_size, Tensor grad_output, Tensor weight) -> Tensor dispatch: MkldnnCPU: mkldnn_linear_backward_input + autogen: mkldnn_linear_backward_input.out - func: mkldnn_linear_backward_weights(Tensor grad_output, Tensor input, Tensor weight, bool bias_defined) -> (Tensor, Tensor) dispatch: MkldnnCPU: mkldnn_linear_backward_weights + autogen: mkldnn_linear_backward_weights.out - func: mkldnn_linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: MkldnnCPU: mkldnn_linear_backward - -- func: _mps_linear_backward_input(int[] input_size, Tensor grad_output, Tensor weight) -> Tensor - dispatch: - MPS: _mps_linear_backward_input - -- func: _mps_linear_backward_weights(Tensor grad_output, Tensor input, Tensor weight, bool bias_defined) -> (Tensor, Tensor) - dispatch: - MPS: _mps_linear_backward_weights - -- func: mps_linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor) - dispatch: - MPS: mps_linear_backward + autogen: mkldnn_linear_backward.out - func: fbgemm_linear_int8_weight_fp32_activation(Tensor input, Tensor weight, Tensor packed, Tensor col_offsets, Scalar weight_scale, Scalar weight_zero_point, Tensor bias) -> Tensor @@ -3152,8 +3224,19 @@ - func: matmul(Tensor self, Tensor other) -> Tensor variants: function, method + dispatch: + CompositeImplicitAutograd: matmul + NestedTensorCPU, NestedTensorCUDA: matmul_nested + +- func: matmul_backward(Tensor grad, Tensor self, Tensor other, bool[2] mask) -> (Tensor, Tensor) + dispatch: + NestedTensorCPU, NestedTensorCUDA: matmul_backward_nested + autogen: matmul_backward.out - func: matmul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!) + dispatch: + CompositeImplicitAutograd: matmul_out + NestedTensorCPU, NestedTensorCUDA: matmul_out_nested - func: matrix_rank.tol(Tensor self, float tol, bool symmetric=False) -> Tensor @@ -3177,11 +3260,13 @@ - func: _aminmax(Tensor self) -> (Tensor, Tensor) dispatch: CPU, CUDA: _aminmax_all + autogen: _aminmax.out # DEPRECATED: Use torch.aminmax instead - func: _aminmax.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor, Tensor) dispatch: CPU, CUDA: _aminmax + autogen: _aminmax.dim_out - func: aminmax(Tensor self, *, int? dim=None, bool keepdim=False) -> (Tensor min, Tensor max) device_check: NoCheck # TensorIterator @@ -3253,35 +3338,43 @@ - func: _mps_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MPS: _mps_max_pool2d + autogen: _mps_max_pool2d.out - func: mps_max_pool2d_backward(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MPS: mps_max_pool2d_backward + autogen: mps_max_pool2d_backward.out - func: mkldnn_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MkldnnCPU: mkldnn_max_pool2d + autogen: mkldnn_max_pool2d.out - func: mkldnn_max_pool2d_backward(Tensor grad_output, Tensor output, Tensor input, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MkldnnCPU: mkldnn_max_pool2d_backward + autogen: mkldnn_max_pool2d_backward.out - func: mkldnn_max_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MkldnnCPU: mkldnn_max_pool3d + autogen: mkldnn_max_pool3d.out - func: mkldnn_max_pool3d_backward(Tensor grad_output, Tensor output, Tensor input, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor dispatch: MkldnnCPU: mkldnn_max_pool3d_backward + autogen: mkldnn_max_pool3d_backward.out - func: quantized_max_pool1d(Tensor self, int[1] kernel_size, int[1] stride=[], int[1] padding=0, int[1] dilation=1, bool ceil_mode=False) -> Tensor dispatch: QuantizedCPU: quantized_max_pool1d + autogen: quantized_max_pool1d.out - func: quantized_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor dispatch: QuantizedCPU: quantized_max_pool2d QuantizedCUDA: quantized_max_pool2d_cudnn + autogen: quantized_max_pool2d.out - func: max_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor @@ -3293,6 +3386,13 @@ dispatch: CompositeExplicitAutograd: mean +# For normal naming convention this should be `mean.out`. However since we already have `mean.out` we have to rename this. +# FIXME: fix CI jobs and re-enable this +#- func: mean.dtype_out(Tensor self, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) +# device_check: NoCheck # TensorIterator +# dispatch: +# CompositeExplicitAutograd: mean_dtype_out + - func: mean.dim(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor structured_delegate: mean.out device_check: NoCheck # TensorIterator @@ -3315,11 +3415,11 @@ - func: mean.names_out(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator -- func: nanmean(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None) -> Tensor +- func: nanmean(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor device_check: NoCheck # Composite variants: function, method -- func: nanmean.out(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) +- func: nanmean.out(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # Composite - func: median(Tensor self) -> Tensor @@ -3327,6 +3427,7 @@ dispatch: CPU: median_cpu CUDA: median_cuda + autogen: median.out - func: median.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices) variants: function, method @@ -3348,6 +3449,7 @@ dispatch: CPU: nanmedian_cpu CUDA: nanmedian_cuda + autogen: nanmedian.out - func: nanmedian.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices) variants: function, method @@ -3403,42 +3505,60 @@ - func: _mps_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor dispatch: MPS: _mps_convolution + autogen: _mps_convolution.out - func: mps_convolution_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: MPS: mps_convolution_backward + autogen: mps_convolution_backward.out - func: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor dispatch: CompositeExplicitAutograd: mkldnn_convolution + autogen: mkldnn_convolution.out - func: miopen_batch_norm(Tensor input, Tensor weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float exponential_average_factor, float epsilon) -> (Tensor, Tensor, Tensor) dispatch: CUDA: miopen_batch_norm + autogen: miopen_batch_norm.out - func: miopen_batch_norm_backward(Tensor input, Tensor grad_output, Tensor weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var, float epsilon) -> (Tensor, Tensor, Tensor) dispatch: CUDA: miopen_batch_norm_backward + autogen: miopen_batch_norm_backward.out - func: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor dispatch: CUDA: miopen_convolution + autogen: miopen_convolution.out - func: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor dispatch: CUDA: miopen_convolution_transpose + autogen: miopen_convolution_transpose.out - func: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor dispatch: CUDA: miopen_depthwise_convolution + autogen: miopen_depthwise_convolution.out + +- func: miopen_convolution_relu(Tensor self, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor + dispatch: + CUDA: miopen_convolution_relu + +- func: miopen_convolution_add_relu(Tensor self, Tensor weight, Tensor z, Scalar? alpha, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor + dispatch: + CUDA: miopen_convolution_add_relu - func: miopen_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor hx, Tensor? cx, int mode, int hidden_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor) dispatch: CUDA: miopen_rnn + autogen: miopen_rnn.out - func: miopen_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[]) dispatch: CUDA: miopen_rnn_backward + autogen: miopen_rnn_backward.out - func: mm(Tensor self, Tensor mat2) -> Tensor structured_delegate: mm.out @@ -3463,11 +3583,13 @@ dispatch: SparseCPU: sparse_sparse_matmul_cpu SparseCUDA: sparse_sparse_matmul_cuda + autogen: _sparse_sparse_matmul.out - func: _sparse_mask_helper(Tensor t, Tensor mask_indices) -> Tensor dispatch: SparseCPU: sparse_mask_helper_cpu SparseCUDA: sparse_mask_helper_cuda + autogen: _sparse_mask_helper.out - func: mode(Tensor self, int dim=-1, bool keepdim=False) -> (Tensor values, Tensor indices) variants: function, method @@ -3587,6 +3709,7 @@ variants: function, method dispatch: CompositeExplicitAutograd: narrow_copy_symint + autogen: narrow_copy.SymInt_out - func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!) dispatch: @@ -3617,6 +3740,7 @@ - func: batch_norm_stats(Tensor input, float eps) -> (Tensor, Tensor) dispatch: CUDA: batch_norm_stats_cuda + autogen: batch_norm_stats.out - func: batch_norm_elemt(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor invstd, float eps) -> Tensor dispatch: @@ -3630,10 +3754,12 @@ - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor) dispatch: CUDA: batch_norm_gather_stats_cuda + autogen: batch_norm_gather_stats.out - func: batch_norm_gather_stats_with_counts(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor) dispatch: CUDA: batch_norm_gather_stats_with_counts_cuda + autogen: batch_norm_gather_stats_with_counts.out - func: native_batch_norm_backward(Tensor grad_out, Tensor input, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_invstd, bool train, float eps, bool[3] output_mask) -> (Tensor, Tensor, Tensor) dispatch: @@ -3641,19 +3767,23 @@ CUDA: batch_norm_backward_cuda MPS: batch_norm_backward_mps MkldnnCPU: mkldnn_batch_norm_backward + autogen: native_batch_norm_backward.out - func: batch_norm_backward_reduce(Tensor grad_out, Tensor input, Tensor mean, Tensor invstd, Tensor? weight, bool input_g, bool weight_g, bool bias_g) -> (Tensor, Tensor, Tensor, Tensor) dispatch: CUDA: batch_norm_backward_reduce_cuda + autogen: batch_norm_backward_reduce.out - func: batch_norm_backward_elemt(Tensor grad_out, Tensor input, Tensor mean, Tensor invstd, Tensor? weight, Tensor mean_dy, Tensor mean_dy_xmu, Tensor count) -> Tensor dispatch: CUDA: batch_norm_backward_elemt_cuda + autogen: batch_norm_backward_elemt.out - func: batch_norm_update_stats(Tensor input, Tensor? running_mean, Tensor? running_var, float momentum) -> (Tensor, Tensor) dispatch: CPU: batch_norm_update_stats_cpu CUDA: batch_norm_update_stats_cuda + autogen: batch_norm_update_stats.out - func: is_vulkan_available() -> bool @@ -3663,12 +3793,14 @@ variants: function dispatch: CompositeExplicitAutograd: _nnpack_spatial_convolution + autogen: _nnpack_spatial_convolution.out - func: ones.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor device_check: NoCheck device_guard: False dispatch: CompositeExplicitAutograd: ones + autogen: ones.names_out - func: ones(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: @@ -3683,6 +3815,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: ones_like + autogen: ones_like.out - func: pairwise_distance(Tensor x1, Tensor x2, float p=2, float eps=1e-06, bool keepdim=False) -> Tensor @@ -3691,24 +3824,29 @@ - func: _euclidean_dist(Tensor x1, Tensor x2) -> Tensor dispatch: CompositeExplicitAutograd: _euclidean_dist + autogen: _euclidean_dist.out - func: _cdist_forward(Tensor x1, Tensor x2, float p, int? compute_mode) -> Tensor dispatch: CPU, CUDA: _cdist_forward + autogen: _cdist_forward.out - func: _cdist_backward(Tensor grad, Tensor x1, Tensor x2, float p, Tensor cdist) -> Tensor dispatch: CPU, CUDA: _cdist_backward + autogen: _cdist_backward.out - func: pdist(Tensor self, float p=2) -> Tensor - func: _pdist_forward(Tensor self, float p=2) -> Tensor dispatch: CPU, CUDA: _pdist_forward + autogen: _pdist_forward.out - func: _pdist_backward(Tensor grad, Tensor self, float p, Tensor pdist) -> Tensor dispatch: CPU, CUDA: _pdist_backward + autogen: _pdist_backward.out - func: cosine_similarity(Tensor x1, Tensor x2, int dim=1, float eps=1e-08) -> Tensor variants: function @@ -3762,16 +3900,19 @@ dispatch: CPU: pixel_shuffle_cpu CompositeExplicitAutogradNonFunctional: math_pixel_shuffle + autogen: pixel_shuffle.out - func: pixel_unshuffle(Tensor self, int downscale_factor) -> Tensor dispatch: CPU: pixel_unshuffle_cpu CompositeExplicitAutogradNonFunctional: math_pixel_unshuffle + autogen: pixel_unshuffle.out - func: channel_shuffle(Tensor self, int groups) -> Tensor dispatch: CPU: channel_shuffle QuantizedCPU: channel_shuffle_quantized_cpu + autogen: channel_shuffle.out - func: native_channel_shuffle(Tensor self, int groups) -> Tensor dispatch: @@ -3795,6 +3936,7 @@ dispatch: CUDA: _pin_memory_cuda MPS: _pin_memory_mps + autogen: _pin_memory.out - func: pinverse(Tensor self, float rcond=1e-15) -> Tensor variants: function, method @@ -3836,18 +3978,23 @@ - func: scalar_tensor(Scalar s, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: scalar_tensor + autogen: scalar_tensor.out - func: rand.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor device_check: NoCheck device_guard: False dispatch: CompositeExplicitAutograd: rand + autogen: rand.names_out + tags: nondeterministic_seeded - func: rand.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor device_check: NoCheck device_guard: False + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: rand + autogen: rand.generator_with_names_out - func: rand(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor tags: nondeterministic_seeded @@ -3855,14 +4002,17 @@ CompositeExplicitAutograd: rand - func: rand.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: rand - func: rand.out(int[] size, *, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: rand_out - func: rand.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded - func: rand_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor tags: nondeterministic_seeded @@ -3870,6 +4020,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: rand_like + autogen: rand_like.out - func: randint(int high, int[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor tags: nondeterministic_seeded @@ -3877,6 +4028,7 @@ CompositeExplicitAutograd: randint - func: randint.generator(int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint @@ -3886,22 +4038,27 @@ CompositeExplicitAutograd: randint - func: randint.low_generator(int low, int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint - func: randint.out(int high, int[] size, *, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint_out - func: randint.generator_out(int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint_out - func: randint.low_out(int low, int high, int[] size, *, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint_out - func: randint.low_generator_out(int low, int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randint_out @@ -3911,6 +4068,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: randint_like + autogen: randint_like.out - func: randint_like.low_dtype(Tensor self, int low, int high, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor tags: nondeterministic_seeded @@ -3918,6 +4076,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: randint_like + autogen: randint_like.low_dtype_out - func: randn(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor tags: nondeterministic_seeded @@ -3925,24 +4084,31 @@ CompositeExplicitAutograd: randn - func: randn.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randn - func: randn.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded device_check: NoCheck device_guard: False dispatch: CompositeExplicitAutograd: randn + autogen: randn.names_out - func: randn.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded device_check: NoCheck device_guard: False dispatch: CompositeExplicitAutograd: randn + autogen: randn.generator_with_names_out - func: randn.out(int[] size, *, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded - func: randn.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded - func: randn_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor tags: nondeterministic_seeded @@ -3950,6 +4116,7 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: randn_like + autogen: randn_like.out - func: randperm(int n, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor tags: nondeterministic_seeded @@ -3957,14 +4124,17 @@ CompositeExplicitAutograd: randperm - func: randperm.generator(int n, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randperm - func: randperm.out(int n, *, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: randperm_out - func: randperm.generator_out(int n, *, Generator? generator, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CPU: randperm_out_cpu CUDA: randperm_out_cuda @@ -3977,10 +4147,15 @@ dispatch: CompositeExplicitAutograd: range +- func: range.out_(Scalar start, Scalar end, *, Tensor(a!) out) -> Tensor(a!) + dispatch: + CompositeExplicitAutograd: range_out_no_step + - func: range.out(Scalar start, Scalar end, Scalar step=1, *, Tensor(a!) out) -> Tensor(a!) dispatch: CPU, Meta: range_out CUDA: range_cuda_out + cpp_no_default_args: ['step'] - func: ravel(Tensor(a) self) -> Tensor(a) variants: function, method @@ -4043,6 +4218,7 @@ dispatch: CompositeExplicitAutograd: repeat MPS: repeat_mps + autogen: repeat.out - func: repeat_interleave.Tensor(Tensor repeats, *, int? output_size=None) -> Tensor variants: function @@ -4050,6 +4226,7 @@ CPU: repeat_interleave_cpu CUDA: repeat_interleave_cuda tags: dynamic_output_shape + autogen: repeat_interleave.Tensor_out - func: repeat_interleave.self_Tensor(Tensor self, Tensor repeats, int? dim=None, *, int? output_size=None) -> Tensor variants: function, method @@ -4065,10 +4242,12 @@ - func: _reshape_nested(Tensor self, int[] shape) -> Tensor dispatch: NestedTensorCPU, NestedTensorCUDA: _reshape_nested + autogen: _reshape_nested.out - func: _reshape_nested_backward(Tensor self, Tensor grad) -> Tensor dispatch: NestedTensorCPU, NestedTensorCUDA: _reshape_nested_backward + autogen: _reshape_nested_backward.out # NOTE [ _reshape_alias ] is meant to be used in the implementation of reshape. # They are not user-facing, hence the leading underscore. Please don't use it @@ -4086,6 +4265,7 @@ device_guard: False dispatch: MkldnnCPU: mkldnn_reshape + autogen: _mkldnn_reshape.out - func: reshape_as(Tensor(a) self, Tensor other) -> Tensor(a) variants: method @@ -4142,6 +4322,7 @@ tags: nondeterministic_seeded - func: rrelu_(Tensor(a!) self, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!) + tags: nondeterministic_seeded device_check: NoCheck # TensorIterator - func: relu(Tensor self) -> Tensor @@ -4179,6 +4360,7 @@ CUDA: prelu_cuda MPS: prelu_mps QuantizedCPU: prelu_quantized_cpu + autogen: prelu.out - func: prelu_backward(Tensor grad_output, Tensor self, Tensor weight) -> (Tensor, Tensor) variants: function, method @@ -4187,6 +4369,7 @@ CPU: prelu_backward_cpu CUDA: prelu_backward_cuda MPS: prelu_backward_mps + autogen: prelu_backward.out - func: gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!) structured: True @@ -4296,6 +4479,7 @@ device_guard: False dispatch: CompositeExplicitAutogradNonFunctional: select_backward + autogen: select_backward.out - func: selu(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -4483,6 +4667,7 @@ variants: function, method dispatch: CompositeExplicitAutograd: detach + NestedTensorCPU, NestedTensorCUDA: detach # Like `detach()`, but modifies this `Variable` in-place. This method may # only be called on non-view `Variable`s. You can use `is_view()` to check @@ -4510,6 +4695,8 @@ device_guard: False dispatch: CompositeExplicitAutograd: slice +# NOTE: The implementation of split_with_sizes bypasses the dispatcher to call this; undo +# that if adding specific implementations here! - func: slice_backward(Tensor grad_output, int[] input_sizes, int dim, int start, int end, int step) -> Tensor variants: function @@ -4517,6 +4704,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: slice_backward + autogen: slice_backward.out - func: slice_scatter(Tensor self, Tensor src, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor variants: function, method @@ -4524,6 +4712,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: slice_scatter + autogen: slice_scatter.out - func: select_scatter(Tensor self, Tensor src, int dim, int index) -> Tensor variants: function, method @@ -4531,6 +4720,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: select_scatter + autogen: select_scatter.out - func: diagonal_scatter(Tensor self, Tensor src, int offset=0, int dim1=0, int dim2=1) -> Tensor variants: function, method @@ -4538,6 +4728,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: diagonal_scatter + autogen: diagonal_scatter.out - func: as_strided_scatter(Tensor self, Tensor src, int[] size, int[] stride, int? storage_offset=None) -> Tensor variants: function, method @@ -4545,6 +4736,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: as_strided_scatter + autogen: as_strided_scatter.out - func: smm(Tensor self, Tensor mat2) -> Tensor variants: function, method @@ -4552,9 +4744,6 @@ # softmax allows positional dtype, unlike most operators, because kwonly is BC-breaking when loading jit models. - func: softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor variants: function, method - dispatch: - CompositeImplicitAutograd: softmax - NestedTensorCPU, NestedTensorCUDA: softmax - func: softmax.int_out(Tensor self, int dim, ScalarType? dtype=None, *, Tensor(a!) out) -> Tensor(a!) variants: function @@ -4579,6 +4768,8 @@ - func: _softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor structured_delegate: _softmax_backward_data.out + dispatch: + NestedTensorCPU, NestedTensorCUDA: nested_softmax_backward - func: _softmax_backward_data.out(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype, *, Tensor(a!) grad_input) -> Tensor(a!) structured: True @@ -4593,6 +4784,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: unsafe_split + autogen: unsafe_split.Tensor_out - func: split.Tensor(Tensor(a -> *) self, int split_size, int dim=0) -> Tensor(a)[] variants: function, method @@ -4611,6 +4803,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: unsafe_split_with_sizes + autogen: unsafe_split_with_sizes.out - func: split_with_sizes(Tensor(a -> *) self, int[] split_sizes, int dim=0) -> Tensor(a)[] variants: function, method @@ -4748,12 +4941,7 @@ dispatch: CompositeExplicitAutograd: sum SparseCsrCPU, SparseCsrCUDA: sum_csr - -- func: sum.SymInt(Tensor self, SymInt[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor - device_check: NoCheck # TensorIterator - variants: function, method - dispatch: - CompositeExplicitAutograd: sum_symint + autogen: sum.out - func: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor structured_delegate: sum.IntList_out @@ -4776,12 +4964,17 @@ - func: sum.DimnameList_out(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator -- func: nansum(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None) -> Tensor +# TODO: this function will be replaced once nested expand semantics have been settled on +- func: _nested_sum_backward(Tensor grad, Tensor self, int[1]? dim, bool keepdim=False) -> Tensor + dispatch: + NestedTensorCPU: _nested_sum_backward_cpu + +- func: nansum(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor variants: function, method dispatch: CPU, CUDA: nansum -- func: nansum.out(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) +- func: nansum.out(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!) dispatch: CPU, CUDA: nansum_out @@ -4830,7 +5023,7 @@ device_check: NoCheck # TensorIterator variants: function, method -- func: std.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor +- func: std.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor device_check: NoCheck # TensorIterator variants: function, method @@ -4846,7 +5039,7 @@ device_check: NoCheck # TensorIterator variants: function -- func: std_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) +- func: std_mean.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) device_check: NoCheck # TensorIterator variants: function @@ -4855,6 +5048,7 @@ variants: function dispatch: CPU, CUDA: std_mean + autogen: std_mean.correction_out - func: std_mean.names_dim(Tensor self, Dimname[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) device_check: NoCheck # TensorIterator @@ -4864,7 +5058,7 @@ device_check: NoCheck # TensorIterator variants: function -- func: std.out(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!) +- func: std.out(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator - func: std.correction_out(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False, Tensor(a!) out) -> Tensor(a!) @@ -4894,6 +5088,7 @@ dispatch: CPU, CUDA: prod MPS: prod_mps + autogen: prod.out - func: prod.dim_int(Tensor self, int dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor structured_delegate: prod.int_out @@ -5073,6 +5268,7 @@ dispatch: CPU, QuantizedCPU, CUDA, QuantizedCUDA: flip MPS: flip_mps + autogen: flip.out - func: fliplr(Tensor self) -> Tensor variants: function, method @@ -5085,6 +5281,7 @@ dispatch: CPU: roll_cpu CUDA: roll_cuda + autogen: roll.out # default int[] value [0,1] should not add space after comma, since codegen parser uses ', ' to split args @@ -5092,6 +5289,7 @@ variants: function, method dispatch: CompositeExplicitAutograd: rot90 + autogen: rot90.out - func: trapezoid.x(Tensor y, Tensor x, *, int dim=-1) -> Tensor @@ -5106,10 +5304,12 @@ dispatch: CPU, NestedTensorCPU: transform_bias_rescale_qkv_cpu CUDA, NestedTensorCUDA: transform_bias_rescale_qkv_cuda + autogen: _transform_bias_rescale_qkv.out -- func: _nested_tensor_from_mask(Tensor t, Tensor mask) -> Tensor +- func: _nested_tensor_from_mask(Tensor t, Tensor mask, bool mask_check=True) -> Tensor dispatch: CPU, CUDA: NestedTensor_nested_tensor_from_mask + autogen: _nested_tensor_from_mask.out - func: _nested_tensor_from_mask_left_aligned(Tensor t, Tensor mask) -> bool dispatch: @@ -5120,22 +5320,26 @@ dispatch: CPU: nested_from_padded_generic CUDA: nested_from_padded_cuda + autogen: _nested_from_padded.out - func: _nested_tensor_size(Tensor self) -> Tensor variants: method dispatch: NestedTensorCPU, NestedTensorCUDA: NestedTensor_get_nested_size_tensor + autogen: _nested_tensor_size.out # _nested_from_padded is not usable from Python, so # _nested_from_padded_and_nested_example is available for testing. - func: _nested_from_padded_and_nested_example(Tensor padded, Tensor nt_example) -> Tensor dispatch: NestedTensorCPU, NestedTensorCUDA: NestedTensor_from_padded_and_nested_example + autogen: _nested_from_padded_and_nested_example.out - func: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor dispatch: - # calls unsqueeze + # calls unsqueeze CompositeExplicitAutogradNonFunctional: _trilinear + autogen: _trilinear.out - func: triplet_margin_loss(Tensor anchor, Tensor positive, Tensor negative, float margin=1.0, float p=2, float eps=1e-06, bool swap=False, int reduction=Mean) -> Tensor @@ -5185,6 +5389,7 @@ dispatch: CPU: _unique_cpu CUDA: _unique_cuda + autogen: _unique.out - func: unique_dim(Tensor self, int dim, bool sorted=True, bool return_inverse=False, bool return_counts=False) -> (Tensor, Tensor, Tensor) variants: function @@ -5192,6 +5397,7 @@ CPU: unique_dim_cpu CUDA: unique_dim_cuda tags: dynamic_output_shape + autogen: unique_dim.out - func: unique_consecutive(Tensor self, bool return_inverse=False, bool return_counts=False, int? dim=None) -> (Tensor, Tensor, Tensor) variants: function @@ -5199,6 +5405,7 @@ CPU: unique_consecutive_cpu CUDA: unique_consecutive_cuda tags: dynamic_output_shape + autogen: unique_consecutive.out - func: unique_dim_consecutive(Tensor self, int dim, bool return_inverse=False, bool return_counts=False) -> (Tensor, Tensor, Tensor) variants: function @@ -5206,6 +5413,7 @@ CPU: unique_dim_consecutive_cpu CUDA: unique_dim_consecutive_cuda tags: dynamic_output_shape + autogen: unique_dim_consecutive.out # _unique and _unique_dim are fragile and modifying them easily cause internal break # the below operator is a temporary hack for adding return_counts support @@ -5217,10 +5425,12 @@ CPU: _unique2_cpu CUDA: _unique2_cuda tags: dynamic_output_shape + autogen: _unique2.out - func: _unsafe_view(Tensor self, int[] size) -> Tensor dispatch: CompositeExplicitAutograd: _unsafe_view + autogen: _unsafe_view.out - func: unsqueeze(Tensor(a) self, int dim) -> Tensor(a) variants: function, method @@ -5245,7 +5455,7 @@ device_check: NoCheck # TensorIterator variants: function, method -- func: var.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor +- func: var.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor device_check: NoCheck # TensorIterator variants: function, method @@ -5256,7 +5466,7 @@ CPU, CUDA: var MPS: var_mps -- func: var.out(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!) +- func: var.out(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator - func: var.correction_out(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False, Tensor(a!) out) -> Tensor(a!) @@ -5283,7 +5493,7 @@ device_check: NoCheck # TensorIterator variants: function -- func: var_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) +- func: var_mean.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) device_check: NoCheck # TensorIterator variants: function @@ -5292,6 +5502,7 @@ variants: function dispatch: CPU, CUDA: var_mean + autogen: var_mean.correction_out - func: var_mean.names_dim(Tensor self, Dimname[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) device_check: NoCheck # TensorIterator @@ -5345,12 +5556,14 @@ dispatch: CPU: weight_norm_cpu CUDA: weight_norm_cuda + autogen: _weight_norm_interface.out - func: _weight_norm_interface_backward(Tensor grad_w, Tensor saved_v, Tensor saved_g, Tensor saved_norms, int dim) -> (Tensor, Tensor) variants: function dispatch: CPU: weight_norm_backward_cpu CUDA: weight_norm_backward_cuda + autogen: _weight_norm_interface_backward.out - func: _weight_norm_differentiable_backward(Tensor grad_w, Tensor saved_v, Tensor saved_g, Tensor saved_norms, int dim) -> (Tensor, Tensor) variants: function @@ -5360,11 +5573,13 @@ device_guard: False dispatch: CompositeExplicitAutograd: zeros + autogen: zeros.names_out - func: _efficientzerotensor(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CPU: _efficientzerotensor CUDA: _efficientzerotensor_cuda + autogen: _efficientzerotensor.out - func: zeros(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: @@ -5373,6 +5588,7 @@ - func: zeros.SymInt(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: zeros_symint + autogen: zeros.SymInt_out - func: zeros.out(int[] size, *, Tensor(a!) out) -> Tensor(a!) dispatch: @@ -5384,12 +5600,14 @@ # NB: Although this composite mutates on the inside, it is # non-differentiable so NonFunctional doesn't apply CompositeExplicitAutograd: zeros_like + autogen: zeros_like.out - func: _standard_gamma_grad(Tensor self, Tensor output) -> Tensor variants: function dispatch: CPU: _standard_gamma_grad_cpu CUDA: _standard_gamma_grad_cuda + autogen: _standard_gamma_grad.out - func: _standard_gamma(Tensor self, Generator? generator=None) -> Tensor variants: function @@ -5397,17 +5615,21 @@ CPU: _s_gamma_cpu CUDA: _s_gamma_cuda tags: nondeterministic_seeded + autogen: _standard_gamma.out - func: _dirichlet_grad(Tensor x, Tensor alpha, Tensor total) -> Tensor dispatch: CPU: _dirichlet_grad_cpu CUDA: _dirichlet_grad_cuda + autogen: _dirichlet_grad.out - func: _sample_dirichlet(Tensor self, Generator? generator=None) -> Tensor + tags: nondeterministic_seeded variants: function dispatch: CPU: _s_dirichlet_cpu CUDA: _s_dirichlet_cuda + autogen: _sample_dirichlet.out - func: poisson(Tensor self, Generator? generator=None) -> Tensor device_check: NoCheck # TensorIterator @@ -5415,6 +5637,7 @@ CPU: _s_poisson_cpu CUDA: _s_poisson_cuda tags: nondeterministic_seeded + autogen: poisson.out - func: binomial(Tensor count, Tensor prob, Generator? generator=None) -> Tensor device_check: NoCheck # TensorIterator @@ -5422,6 +5645,7 @@ CPU: _s_binomial_cpu CUDA: _s_binomial_cuda tags: nondeterministic_seeded + autogen: binomial.out # When more variants get ported to native, this dispatch will get more # complicated @@ -5429,10 +5653,12 @@ - func: native_norm(Tensor self, Scalar p=2) -> Tensor dispatch: SparseCPU, SparseCUDA: norm_sparse + autogen: native_norm.out - func: native_norm.ScalarOpt_dim_dtype(Tensor self, Scalar? p, int[1] dim, bool keepdim, ScalarType? dtype) -> Tensor dispatch: SparseCPU, SparseCUDA: norm_sparse + autogen: native_norm.ScalarOpt_dim_dtype_out # TODO: reduce signatures down to one when optional args is available - func: _sparse_sum(Tensor self) -> Tensor @@ -5442,6 +5668,7 @@ - func: _sparse_sum.dim(Tensor self, int[1] dim) -> Tensor dispatch: CompositeExplicitAutograd: _sparse_sum + autogen: _sparse_sum.dim_out - func: _sparse_sum.dim_dtype(Tensor self, int[1] dim, *, ScalarType dtype) -> Tensor @@ -5449,16 +5676,19 @@ dispatch: SparseCPU: _sparse_sum_backward_cpu SparseCUDA: _sparse_sum_backward_cuda + autogen: _sparse_sum_backward.out - func: _sparse_csr_sum.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor dispatch: SparseCsrCPU: _sparse_csr_sum_cpu SparseCsrCUDA: _sparse_csr_sum_cuda + autogen: _sparse_csr_sum.dim_dtype_out - func: _sparse_csr_prod.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor dispatch: SparseCsrCPU: _sparse_csr_prod_cpu SparseCsrCUDA: _sparse_csr_prod_cuda + autogen: _sparse_csr_prod.dim_dtype_out - func: _sparse_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor python_module: sparse @@ -5473,11 +5703,13 @@ dispatch: SparseCPU: softmax_sparse_cpu SparseCUDA: softmax_sparse_cuda + autogen: _sparse_softmax.out - func: _sparse_softmax_backward_data(Tensor grad_output, Tensor output, int dim, Tensor self) -> Tensor dispatch: SparseCPU: softmax_backward_sparse_cpu SparseCUDA: softmax_backward_sparse_cuda + autogen: _sparse_softmax_backward_data.out - func: _sparse_log_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor python_module: sparse @@ -5492,28 +5724,33 @@ dispatch: SparseCPU: log_softmax_sparse_cpu SparseCUDA: log_softmax_sparse_cuda + autogen: _sparse_log_softmax.out - func: _sparse_log_softmax_backward_data(Tensor grad_output, Tensor output, int dim, Tensor self) -> Tensor dispatch: SparseCPU: log_softmax_backward_sparse_cpu SparseCUDA: log_softmax_backward_sparse_cuda + autogen: _sparse_log_softmax_backward_data.out - func: _spdiags(Tensor diagonals, Tensor offsets, int[] shape, Layout? layout=None) -> Tensor python_module: sparse dispatch: CPU: spdiags + autogen: _spdiags.out - func: norm.ScalarOpt_dtype(Tensor self, Scalar? p, *, ScalarType dtype) -> Tensor device_check: NoCheck # TensorIterator variants: function, method dispatch: CompositeExplicitAutograd: norm + autogen: norm.ScalarOpt_dtype_out - func: norm.Scalar(Tensor self, Scalar p=2) -> Tensor device_check: NoCheck # TensorIterator variants: function, method dispatch: CompositeExplicitAutograd: norm + autogen: norm.Scalar_out - func: norm.ScalarOpt_dim_dtype(Tensor self, Scalar? p, int[1] dim, bool keepdim, *, ScalarType dtype) -> Tensor structured_delegate: norm.dtype_out @@ -5603,6 +5840,7 @@ MkldnnCPU: mkldnn_clone QuantizedCPU, QuantizedCUDA: quantized_clone NestedTensorCPU, NestedTensorCUDA: clone_nested + autogen: clone.out - func: positive(Tensor(a) self) -> Tensor(a) variants: function, method @@ -5693,6 +5931,7 @@ variants: function dispatch: CPU, CUDA: rsub + autogen: rsub.Tensor_out - func: heaviside.out(Tensor self, Tensor values, *, Tensor(a!) out) -> Tensor(a!) structured: True @@ -5717,6 +5956,7 @@ variants: function dispatch: CompositeExplicitAutograd: rsub + autogen: rsub.Scalar_out # Functionally the same as addmm, but we give it a different derivative formula # that doesn't propagate gradients to non-present entries on sparse. @@ -5724,6 +5964,7 @@ python_module: sparse dispatch: CompositeExplicitAutograd: _sparse_addmm + autogen: _sparse_addmm.out - func: sparse_sampled_addmm.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) python_module: sparse @@ -5907,6 +6148,7 @@ - func: sparse_coo_tensor.size(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor dispatch: CompositeExplicitAutograd: sparse_coo_tensor + autogen: sparse_coo_tensor.size_out - func: sparse_coo_tensor.indices(Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor @@ -5925,10 +6167,12 @@ - func: _sparse_coo_tensor_with_dims(int sparse_dim, int dense_dim, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor dispatch: SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_sparse + autogen: _sparse_coo_tensor_with_dims.out - func: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, int[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor dispatch: SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_and_tensor_sparse + autogen: _sparse_coo_tensor_with_dims_and_tensors.out - func: sparse_resize_(Tensor(a!) self, int[] size, int sparse_dim, int dense_dim) -> Tensor(a!) use_const_ref_for_mutable_tensors: True @@ -5950,6 +6194,7 @@ SparseCPU: sparse_mask_cpu SparseCUDA: sparse_mask_cuda SparseCsrCPU, SparseCsrCUDA: sparse_mask_sparse_csr + autogen: sparse_mask.out - func: _to_cpu(Tensor[] tensors) -> Tensor[] variants: function @@ -5964,6 +6209,7 @@ SparseCPU, SparseCUDA: sparse_to_dense SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_dense MkldnnCPU: mkldnn_to_dense + autogen: _to_dense.out - func: to_dense_backward(Tensor grad, Tensor input) -> Tensor @@ -6019,6 +6265,7 @@ dispatch: SparseCPU: _coalesce_sparse_cpu SparseCUDA: _coalesce_sparse_cuda + autogen: _coalesce.out - func: is_coalesced(Tensor self) -> bool variants: method @@ -6126,12 +6373,14 @@ dispatch: CPU, CUDA: dense_to_sparse SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse + autogen: to_sparse.sparse_dim_out - func: to_sparse(Tensor self) -> Tensor variants: method dispatch: CPU, CUDA: dense_to_sparse SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse + autogen: to_sparse.out - func: to_sparse_csr(Tensor self) -> Tensor variants: method @@ -6139,6 +6388,7 @@ CPU, CUDA: dense_to_sparse_csr SparseCPU, SparseCUDA: coo_to_sparse_csr SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_csr + autogen: to_sparse_csr.out - func: to_sparse_csc(Tensor self) -> Tensor variants: method @@ -6146,6 +6396,7 @@ CPU, CUDA: dense_to_sparse_csc SparseCPU, SparseCUDA: coo_to_sparse_csc SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_csc + autogen: to_sparse_csc.out - func: to_sparse_bsr(Tensor self, int[2] blocksize) -> Tensor variants: method @@ -6153,6 +6404,7 @@ CPU, CUDA: dense_to_sparse_bsr SparseCPU, SparseCUDA: coo_to_sparse_bsr SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_bsr + autogen: to_sparse_bsr.out - func: to_sparse_bsc(Tensor self, int[2] blocksize) -> Tensor variants: method @@ -6160,23 +6412,27 @@ CPU, CUDA: dense_to_sparse_bsc SparseCPU, SparseCUDA: coo_to_sparse_bsc SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_bsc + autogen: to_sparse_bsc.out - func: to_mkldnn(Tensor self, ScalarType? dtype=None) -> Tensor variants: method dispatch: CPU: dense_to_mkldnn + autogen: to_mkldnn.out - func: mkldnn_reorder_conv2d_weight(Tensor self, int[2] padding=0, int[2] stride=1, int[2] dilation=1, int groups=1) -> Tensor variants: function python_module: nn dispatch: MkldnnCPU: mkldnn_reorder_conv2d_weight + autogen: mkldnn_reorder_conv2d_weight.out - func: mkldnn_reorder_conv3d_weight(Tensor self, int[3] padding=0, int[3] stride=1, int[3] dilation=1, int groups=1) -> Tensor variants: function python_module: nn dispatch: MkldnnCPU: mkldnn_reorder_conv3d_weight + autogen: mkldnn_reorder_conv3d_weight.out - func: to_mkldnn_backward(Tensor grad, Tensor input) -> Tensor @@ -6184,37 +6440,44 @@ variants: function dispatch: CPU, CUDA: quantize_per_tensor_dynamic + autogen: quantize_per_tensor_dynamic.out - func: quantize_per_tensor(Tensor self, float scale, int zero_point, ScalarType dtype) -> Tensor variants: function dispatch: CPU, CUDA: quantize_per_tensor + autogen: quantize_per_tensor.out - func: quantize_per_tensor.tensor_qparams(Tensor self, Tensor scale, Tensor zero_point, ScalarType dtype) -> Tensor variants: function dispatch: CPU, CUDA: quantize_per_tensor_tensor_qparams + autogen: quantize_per_tensor.tensor_qparams_out - func: quantize_per_tensor.tensors(Tensor[] tensors, Tensor scales, Tensor zero_points, ScalarType dtype) -> Tensor[] variants: function dispatch: CPU: quantize_per_tensor_list_cpu + autogen: quantize_per_tensor.tensors_out - func: quantize_per_channel(Tensor self, Tensor scales, Tensor zero_points, int axis, ScalarType dtype) -> Tensor variants: function dispatch: CPU, CUDA: quantize_per_channel + autogen: quantize_per_channel.out - func: dequantize.self(Tensor self) -> Tensor variants: function, method dispatch: CPU, CUDA: dequantize_cpu_or_cuda QuantizedCPU, QuantizedCUDA: dequantize_quantized + autogen: dequantize.self_out - func: dequantize.tensors(Tensor[] tensors) -> Tensor[] variants: function dispatch: QuantizedCPU: dequantize_tensors_quantized_cpu + autogen: dequantize.tensors_out - func: q_scale(Tensor self) -> float variants: function, method @@ -6230,11 +6493,13 @@ variants: function, method dispatch: QuantizedCPU, QuantizedCUDA: q_per_channel_scales + autogen: q_per_channel_scales.out - func: q_per_channel_zero_points(Tensor self) -> Tensor variants: function, method dispatch: QuantizedCPU, QuantizedCUDA: q_per_channel_zero_points + autogen: q_per_channel_zero_points.out - func: q_per_channel_axis(Tensor self) -> int variants: function, method @@ -6247,16 +6512,19 @@ dispatch: QuantizedCPU: int_repr_quantized_cpu QuantizedCUDA: int_repr_quantized_cuda + autogen: int_repr.out - func: _make_per_tensor_quantized_tensor(Tensor self, float scale, int zero_point) -> Tensor dispatch: CPU: make_per_tensor_quantized_tensor_cpu CUDA: make_per_tensor_quantized_tensor_cuda + autogen: _make_per_tensor_quantized_tensor.out - func: _make_per_channel_quantized_tensor(Tensor self, Tensor scale, Tensor zero_point, int axis) -> Tensor dispatch: CPU: make_per_channel_quantized_tensor_cpu CUDA: make_per_channel_quantized_tensor_cuda + autogen: _make_per_channel_quantized_tensor.out - func: qscheme(Tensor self) -> QScheme variants: method @@ -6275,11 +6543,13 @@ variants: function dispatch: CPU, CUDA: fake_quantize_per_tensor_affine_cachemask + autogen: fake_quantize_per_tensor_affine_cachemask.out - func: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams(Tensor self, Tensor scale, Tensor zero_point, Tensor fake_quant_enabled, int quant_min, int quant_max) -> (Tensor output, Tensor mask) variants: function dispatch: CPU, CUDA: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams + autogen: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams.out - func: fake_quantize_per_tensor_affine_cachemask_backward(Tensor grad, Tensor mask) -> Tensor variants: function @@ -6288,6 +6558,7 @@ variants: function dispatch: CPU, CUDA: _fake_quantize_learnable_per_tensor_affine + autogen: _fake_quantize_learnable_per_tensor_affine.out - func: _fake_quantize_learnable_per_tensor_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor) variants: function @@ -6300,6 +6571,7 @@ variants: function dispatch: CPU, CUDA: fake_quantize_per_channel_affine_cachemask + autogen: fake_quantize_per_channel_affine_cachemask.out - func: fake_quantize_per_channel_affine_cachemask_backward(Tensor grad, Tensor mask) -> Tensor variants: function @@ -6308,6 +6580,7 @@ variants: function dispatch: CPU, CUDA: _fake_quantize_learnable_per_channel_affine + autogen: _fake_quantize_learnable_per_channel_affine.out - func: _fake_quantize_learnable_per_channel_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int axis, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor) variants: function @@ -6343,6 +6616,7 @@ device_guard: False dispatch: CompositeExplicitAutograd: _to_copy + autogen: _to_copy.out # to(Device) must not exist because all constructors of Device also works for # TensorOptions. Otherwise, an ambiguity error is thrown. @@ -6381,6 +6655,7 @@ variants: function - func: item(Tensor self) -> Scalar + tags: data_dependent_output variants: method - func: result_type.Tensor(Tensor tensor, Tensor other) -> ScalarType @@ -6402,6 +6677,7 @@ # NB: Does NOT check precondition that numel == 1 - func: _local_scalar_dense(Tensor self) -> Scalar + tags: data_dependent_output dispatch: CPU: _local_scalar_dense_cpu CUDA: _local_scalar_dense_cuda @@ -6413,16 +6689,19 @@ - func: _lstm_mps(Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor) dispatch: MPS: _lstm_mps + autogen: _lstm_mps.out - func: lstm_mps_backward(Tensor grad_y, Tensor? grad_hy, Tensor? grad_cy, Tensor z_state, Tensor cell_state_fwd, Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor[], Tensor[]) dispatch: MPS: lstm_mps_backward + autogen: lstm_mps_backward.out # Fused RNN kernels - func: _thnn_fused_lstm_cell(Tensor input_gates, Tensor hidden_gates, Tensor cx, Tensor? input_bias=None, Tensor? hidden_bias=None) -> (Tensor, Tensor, Tensor) dispatch: CUDA: _thnn_fused_lstm_cell_cuda + autogen: _thnn_fused_lstm_cell.out # NB: The composite version of this function below is a simple wrapper that duplicates some of the outputs # It is necessary to avoid triggering TensorImpl use count checks in debug mode @@ -6430,6 +6709,7 @@ - func: _thnn_fused_lstm_cell_backward_impl(Tensor? grad_hy, Tensor? grad_cy, Tensor cx, Tensor cy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor) dispatch: CUDA: _thnn_fused_lstm_cell_backward_impl_cuda + autogen: _thnn_fused_lstm_cell_backward_impl.out - func: _thnn_fused_lstm_cell_backward(Tensor? grad_hy, Tensor? grad_cy, Tensor cx, Tensor cy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor) @@ -6438,10 +6718,12 @@ - func: _thnn_fused_gru_cell(Tensor input_gates, Tensor hidden_gates, Tensor hx, Tensor? input_bias=None, Tensor? hidden_bias=None) -> (Tensor, Tensor) dispatch: CUDA: _thnn_fused_gru_cell_cuda + autogen: _thnn_fused_gru_cell.out - func: _thnn_fused_gru_cell_backward(Tensor grad_hy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor) dispatch: CUDA: _thnn_fused_gru_cell_backward_cuda + autogen: _thnn_fused_gru_cell_backward.out - func: _thnn_differentiable_gru_cell_backward(Tensor grad_hy, Tensor input_gates, Tensor hidden_gates, Tensor hx, Tensor? input_bias, Tensor? hidden_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor) @@ -6500,6 +6782,7 @@ - func: _pack_padded_sequence(Tensor input, Tensor lengths, bool batch_first) -> (Tensor, Tensor) dispatch: CompositeExplicitAutograd: _pack_padded_sequence + autogen: _pack_padded_sequence.out - func: _pack_padded_sequence_backward(Tensor grad, int[] input_size, Tensor batch_sizes, bool batch_first) -> Tensor @@ -6556,6 +6839,7 @@ - func: lift(Tensor self) -> Tensor dispatch: CompositeExplicitAutograd: lift + autogen: lift.out # lift_fresh is called with an argument that is guaranteed to be # fresh (i.e., newly allocated). This is ONLY called from a @@ -6571,6 +6855,7 @@ tags: view_copy dispatch: CompositeExplicitAutograd: lift_fresh_copy + autogen: lift_fresh_copy.out - func: is_set_to(Tensor self, Tensor tensor) -> bool variants: method @@ -6623,15 +6908,17 @@ dispatch: CompositeExplicitAutograd: masked_scatter -- func: _masked_softmax(Tensor self, Tensor mask, int? dim=None) -> Tensor +- func: _masked_softmax(Tensor self, Tensor mask, int? dim=None, int? mask_type=None) -> Tensor dispatch: CUDA: masked_softmax_cuda CPU: masked_softmax_cpu + autogen: _masked_softmax.out - func: _masked_softmax_backward(Tensor grad_output, Tensor output, Tensor mask, int? dim=None) -> Tensor dispatch: CUDA: masked_softmax_backward_cuda CPU: masked_softmax_backward_cpu + autogen: _masked_softmax_backward.out - func: view.SymInt(Tensor(a) self, SymInt[] size) -> Tensor(a) variants: method @@ -6887,6 +7174,7 @@ variants: function dispatch: CompositeExplicitAutograd: bitwise_and + autogen: bitwise_and.Scalar_Tensor_out - func: bitwise_and.Tensor(Tensor self, Tensor other) -> Tensor device_check: NoCheck # TensorIterator @@ -6941,6 +7229,7 @@ variants: function dispatch: CompositeExplicitAutograd: bitwise_or + autogen: bitwise_or.Scalar_Tensor_out - func: bitwise_or.Tensor(Tensor self, Tensor other) -> Tensor device_check: NoCheck # TensorIterator @@ -6995,6 +7284,7 @@ variants: function dispatch: CompositeExplicitAutograd: bitwise_xor + autogen: bitwise_xor.Scalar_Tensor_out - func: bitwise_xor.Tensor(Tensor self, Tensor other) -> Tensor device_check: NoCheck # TensorIterator @@ -7092,6 +7382,7 @@ variants: function dispatch: CompositeExplicitAutograd: bitwise_left_shift + autogen: bitwise_left_shift.Scalar_Tensor_out - func: __rshift__.Scalar(Tensor self, Scalar other) -> Tensor device_check: NoCheck # TensorIterator @@ -7159,6 +7450,7 @@ variants: function dispatch: CompositeExplicitAutograd: bitwise_right_shift + autogen: bitwise_right_shift.Scalar_Tensor_out - func: tril_(Tensor(a!) self, int diagonal=0) -> Tensor(a!) structured_delegate: tril.out @@ -7203,6 +7495,7 @@ - func: random_.from(Tensor(a!) self, int from, int? to, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: method + tags: nondeterministic_seeded dispatch: CPU, CUDA: random_ Meta: random_meta_ @@ -7211,6 +7504,7 @@ - func: random_.to(Tensor(a!) self, int to, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: random_ @@ -7220,6 +7514,7 @@ - func: random_(Tensor(a!) self, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: random_ @@ -7228,6 +7523,7 @@ - func: uniform_(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: uniform_ @@ -7238,12 +7534,14 @@ - func: cauchy_(Tensor(a!) self, float median=0, float sigma=1, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: method + tags: nondeterministic_seeded dispatch: CPU, CUDA: cauchy_ autogen: cauchy, cauchy.out - func: log_normal_(Tensor(a!) self, float mean=1, float std=2, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: log_normal_ @@ -7251,6 +7549,7 @@ - func: exponential_(Tensor(a!) self, float lambd=1, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: exponential_ @@ -7259,11 +7558,12 @@ - func: geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: geometric_ -# wrappers for TH functions + # wrappers for TH functions autogen: geometric, geometric.out - func: diag.out(Tensor self, int diagonal=0, *, Tensor(a!) out) -> Tensor(a!) @@ -7313,17 +7613,20 @@ dispatch: CPU: tril_indices_cpu CUDA: tril_indices_cuda + autogen: tril_indices.out - func: triu_indices(int row, int col, int offset=0, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CPU: triu_indices_cpu CUDA: triu_indices_cuda + autogen: triu_indices.out - func: trace(Tensor self) -> Tensor variants: method, function dispatch: CPU: trace_cpu CUDA: trace_cuda + autogen: trace.out - func: trace_backward(Tensor grad, int[] sizes) -> Tensor variants: function @@ -7851,6 +8154,7 @@ dispatch: CPU: _symeig_helper_cpu CUDA: _symeig_helper_cuda + autogen: _symeig_helper.out - func: eig.e(Tensor self, bool eigenvectors=False, *, Tensor(a!) e, Tensor(b!) v) -> (Tensor(a!) eigenvalues, Tensor(b!) eigenvectors) dispatch: @@ -7913,6 +8217,7 @@ dispatch: CPU: _cholesky_solve_helper_cpu CUDA: _cholesky_solve_helper_cuda + autogen: _cholesky_solve_helper.out - func: cholesky_inverse(Tensor self, bool upper=False) -> Tensor variants: method, function @@ -7973,6 +8278,7 @@ # TODO: remove dispatch section when porting TH CUDA to ATen - func: multinomial.out(Tensor self, int num_samples, bool replacement=False, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CPU, CUDA: multinomial_out @@ -8115,6 +8421,7 @@ variants: method, function dispatch: CompositeExplicitAutograd: dist + autogen: dist.out - func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator @@ -8200,14 +8507,17 @@ - func: _histogramdd_bin_edges(Tensor self, int[] bins, *, float[]? range=None, Tensor? weight=None, bool density=False) -> Tensor[] dispatch: CPU: histogramdd_bin_edges_cpu + autogen: _histogramdd_bin_edges.out - func: _histogramdd_from_bin_cts(Tensor self, int[] bins, *, float[]? range=None, Tensor? weight=None, bool density=False) -> Tensor dispatch: CPU: histogramdd_cpu + autogen: _histogramdd_from_bin_cts.out - func: _histogramdd_from_bin_tensors(Tensor self, Tensor[] bins, *, Tensor? weight=None, bool density=False) -> Tensor dispatch: CPU: histogramdd_cpu + autogen: _histogramdd_from_bin_tensors.out - func: histogramdd(Tensor self, int[] bins, float[]? range=None, Tensor? weight=None, bool density=False) -> (Tensor hist, Tensor[] bin_edges) @@ -8342,6 +8652,7 @@ variants: function dispatch: CPU, CUDA: remainder + autogen: remainder.Scalar_Tensor_out - func: min(Tensor self) -> Tensor device_check: NoCheck # TensorIterator @@ -8351,6 +8662,13 @@ MPS: min_mps QuantizedCPU: min_quantized_cpu +# Not to be confused with binary op `min.out`. Commented because of failed CI +# FIXME: enable this +#- func: min.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) +# device_check: NoCheck # TensorIterator +# dispatch: +# CompositeExplicitAutograd: min_unary_out + - func: fmin(Tensor self, Tensor other) -> Tensor structured_delegate: fmin.out device_check: NoCheck # TensorIterator @@ -8371,6 +8689,13 @@ MPS: max_mps QuantizedCPU: max_quantized_cpu +# Not to be confused with binary op `max.out`. Commented because of failed CI +# FIXME: enable this +#- func: max.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) +# device_check: NoCheck # TensorIterator +# dispatch: +# CompositeExplicitAutograd: max_unary_out + - func: fmax(Tensor self, Tensor other) -> Tensor structured_delegate: fmax.out device_check: NoCheck # TensorIterator @@ -8493,6 +8818,7 @@ variants: method, function dispatch: CPU, CUDA: argsort_stable + autogen: argsort.stable_out - func: argsort.dimname(Tensor self, Dimname dim, bool descending=False) -> Tensor variants: method, function @@ -8564,8 +8890,10 @@ variants: function dispatch: CPU, CUDA: unfold_backward + autogen: unfold_backward.out - func: equal(Tensor self, Tensor other) -> bool + tags: data_dependent_output variants: method, function dispatch: CPU: cpu_equal @@ -8644,6 +8972,7 @@ - func: normal_(Tensor(a!) self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor(a!) device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded variants: method dispatch: CPU, CUDA: normal_ @@ -8657,10 +8986,12 @@ # but we can't due to overload ambiguity with normal.Tensor_float. - func: normal_functional(Tensor self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor device_check: NoCheck # TensorIterator + tags: nondeterministic_seeded dispatch: CompositeExplicitAutograd: normal_functional - func: normal.Tensor_float_out(Tensor mean, float std=1, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!) + tags: nondeterministic_seeded dispatch: CPU, CUDA: normal_out MPS: normal_mps_out @@ -8678,6 +9009,7 @@ CPU, CUDA: normal_out Meta: normal_out_meta MPS: normal_mps_out + tags: nondeterministic_seeded - func: normal.float_Tensor(float mean, Tensor std, *, Generator? generator=None) -> Tensor dispatch: @@ -8691,6 +9023,7 @@ CPU, CUDA: normal_out Meta: normal_out_meta MPS: normal_mps_out + tags: nondeterministic_seeded - func: normal.Tensor_Tensor(Tensor mean, Tensor std, *, Generator? generator=None) -> Tensor dispatch: @@ -8702,10 +9035,12 @@ - func: normal.float_float(float mean, float std, int[] size, *, Generator? generator=None, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor dispatch: CompositeExplicitAutograd: normal + tags: nondeterministic_seeded - func: normal.float_float_out(float mean, float std, int[] size, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!) dispatch: CompositeExplicitAutograd: normal_out + tags: nondeterministic_seeded - func: alias(Tensor(a) self) -> Tensor(a) variants: method, function @@ -8724,18 +9059,18 @@ CUDA: _amp_update_scale_cuda_ autogen: _amp_update_scale, _amp_update_scale.out -#- func: _cat(Tensor[] tensors, int dim=0) -> Tensor - #dispatch: + #- func: _cat(Tensor[] tensors, int dim=0) -> Tensor + #dispatch: #CPU: _cat_cpu #CUDA: cat_cuda #MPS: cat_mps #QuantizedCPU: cat_quantized_cpu -#- func: _cat.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!) - #dispatch: + #- func: _cat.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!) + #dispatch: #CPU: _cat_out_cpu - #CUDA: cat_out_cuda - #QuantizedCPU: cat_out_quantized_cpu + #CUDA: cat_out_cuda + #QuantizedCPU: cat_out_quantized_cpu - func: _foreach_add.Scalar(Tensor[] self, Scalar scalar) -> Tensor[] device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices @@ -9412,6 +9747,14 @@ CPU: foreach_tensor_maximum_slow CUDA: foreach_tensor_maximum_cuda +- func: _foreach_maximum_.List(Tensor(a!)[] self, Tensor[] other) -> () + device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices + variants: function + dispatch: + CPU: foreach_tensor_maximum_slow_ + CUDA: foreach_tensor_maximum_cuda_ + autogen: _foreach_maximum.List_out + - func: _foreach_minimum.List(Tensor[] self, Tensor[] other) -> Tensor[] device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices variants: function @@ -9419,12 +9762,21 @@ CPU: foreach_tensor_minimum_slow CUDA: foreach_tensor_minimum_cuda +- func: _foreach_minimum_.List(Tensor(a!)[] self, Tensor[] other) -> () + device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices + variants: function + dispatch: + CPU: foreach_tensor_minimum_slow_ + CUDA: foreach_tensor_minimum_cuda_ + autogen: _foreach_minimum.List_out + - func: _foreach_norm.Scalar(Tensor[] self, Scalar ord=2) -> Tensor[] device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices variants: function dispatch: CPU: foreach_tensor_norm_slow CUDA: foreach_tensor_norm_cuda + autogen: _foreach_norm.Scalar_out - func: bucketize.Tensor(Tensor self, Tensor boundaries, *, bool out_int32=False, bool right=False) -> Tensor dispatch: @@ -9440,6 +9792,7 @@ dispatch: CPU: bucketize_cpu CUDA: bucketize_cuda + autogen: bucketize.Scalar_out - func: searchsorted.Tensor(Tensor sorted_sequence, Tensor self, *, bool out_int32=False, bool right=False, str? side=None, Tensor? sorter=None) -> Tensor dispatch: @@ -9455,6 +9808,7 @@ - func: _torch_cuda_cu_linker_symbol_op(Tensor self) -> Tensor dispatch: CUDA: _torch_cuda_cu_linker_symbol_op_cuda + autogen: _torch_cuda_cu_linker_symbol_op.out - func: searchsorted.Tensor_out(Tensor sorted_sequence, Tensor self, *, bool out_int32=False, bool right=False, str? side=None, Tensor? sorter=None, Tensor(a!) out) -> Tensor(a!) dispatch: @@ -9465,6 +9819,7 @@ dispatch: CPU: searchsorted_cpu CUDA: searchsorted_cuda + autogen: searchsorted.Scalar_out - func: _convert_indices_from_coo_to_csr(Tensor self, int size, *, bool out_int32=False) -> Tensor structured_delegate: _convert_indices_from_coo_to_csr.out @@ -9767,11 +10122,13 @@ python_module: nn dispatch: CPU, CUDA: glu_jvp + autogen: glu_jvp.out - func: glu_backward_jvp(Tensor grad_x, Tensor grad_glu, Tensor x, Tensor dgrad_glu, Tensor dx, int dim) -> Tensor python_module: nn dispatch: CPU, CUDA: glu_backward_jvp + autogen: glu_backward_jvp.out - func: hardsigmoid.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) structured: True @@ -9860,6 +10217,7 @@ python_module: nn dispatch: CPU, CUDA: hardswish_backward + autogen: hardswish_backward.out - func: leaky_relu.out(Tensor self, Scalar negative_slope=0.01, *, Tensor(a!) out) -> Tensor(a!) structured: True @@ -9933,6 +10291,7 @@ - func: rrelu_with_noise.out(Tensor self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None, *, Tensor(a!) out) -> Tensor(a!) python_module: nn + tags: nondeterministic_seeded dispatch: CPU: rrelu_with_noise_out_cpu CUDA: rrelu_with_noise_out_cuda @@ -9948,9 +10307,11 @@ python_module: nn dispatch: CompositeExplicitAutograd: rrelu_with_noise_backward + autogen: rrelu_with_noise_backward.out - func: rrelu_with_noise_(Tensor(a!) self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!) python_module: nn + tags: nondeterministic_seeded dispatch: CPU: rrelu_with_noise_cpu_ CUDA: rrelu_with_noise_cuda_ @@ -10011,7 +10372,7 @@ CPU: adaptive_avg_pool2d_out_cpu CUDA: adaptive_avg_pool2d_out_cuda MPS: adaptive_avg_pool2d_out_mps - MkldnnCPU: mkldnn_adaptive_avg_pool2d_out + MkldnnCPU: mkldnn_adaptive_avg_pool2d_out_stub - func: adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor python_module: nn @@ -10020,9 +10381,14 @@ dispatch: MkldnnCPU: mkldnn_adaptive_avg_pool2d +- func: mkldnn_adaptive_avg_pool2d.out(Tensor self, int[2] output_size, *, Tensor(a!) out) -> Tensor(a!) + dispatch: + MkldnnCPU: mkldnn_adaptive_avg_pool2d_out + - func: mkldnn_adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor dispatch: MkldnnCPU: mkldnn_adaptive_avg_pool2d_backward + autogen: mkldnn_adaptive_avg_pool2d_backward.out - func: _adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor dispatch: @@ -10031,6 +10397,7 @@ MPS: adaptive_avg_pool2d_mps QuantizedCPU: adaptive_avg_pool2d_quantized_cpu QuantizedCUDA: adaptive_avg_pool2d_quantized_cuda + autogen: _adaptive_avg_pool2d.out - func: _adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor python_module: nn @@ -10038,6 +10405,7 @@ CPU: adaptive_avg_pool2d_backward_cpu CUDA: adaptive_avg_pool2d_backward_cuda MPS: adaptive_avg_pool2d_backward_mps + autogen: _adaptive_avg_pool2d_backward.out - func: adaptive_avg_pool3d.out(Tensor self, int[3] output_size, *, Tensor(a!) out) -> Tensor(a!) python_module: nn @@ -10054,6 +10422,7 @@ CPU: adaptive_avg_pool3d_cpu CUDA: adaptive_avg_pool3d_cuda QuantizedCPU: adaptive_avg_pool3d_quantized_cpu + autogen: _adaptive_avg_pool3d.out - func: adaptive_avg_pool3d_backward.grad_input(Tensor grad_output, Tensor self, *, Tensor(a!) grad_input) -> Tensor(a!) python_module: nn @@ -10066,6 +10435,7 @@ dispatch: CPU: adaptive_avg_pool3d_backward_cpu CUDA: adaptive_avg_pool3d_backward_cuda + autogen: _adaptive_avg_pool3d_backward.out # Return: (Tensor output, Tensor indices) - func: adaptive_max_pool2d.out(Tensor self, int[2] output_size, *, Tensor(a!) out, Tensor(b!) indices) -> (Tensor(a!), Tensor(b!)) @@ -10477,101 +10847,121 @@ python_module: nn dispatch: CompositeExplicitAutograd: upsample_linear1d + autogen: upsample_linear1d.vec_out - func: upsample_linear1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_linear1d_backward + autogen: upsample_linear1d_backward.vec_out - func: upsample_bilinear2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_bilinear2d + autogen: upsample_bilinear2d.vec_out - func: upsample_bilinear2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_bilinear2d_backward + autogen: upsample_bilinear2d_backward.vec_out - func: _upsample_bilinear2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_bilinear2d_aa + autogen: _upsample_bilinear2d_aa.vec_out - func: _upsample_bilinear2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_bilinear2d_aa_backward + autogen: _upsample_bilinear2d_aa_backward.vec_out - func: upsample_trilinear3d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_trilinear3d + autogen: upsample_trilinear3d.vec_out - func: upsample_trilinear3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_trilinear3d_backward + autogen: upsample_trilinear3d_backward.vec_out - func: upsample_bicubic2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_bicubic2d + autogen: upsample_bicubic2d.vec_out - func: upsample_bicubic2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_bicubic2d_backward + autogen: upsample_bicubic2d_backward.vec_out - func: _upsample_bicubic2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_bicubic2d_aa + autogen: _upsample_bicubic2d_aa.vec_out - func: _upsample_bicubic2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_bicubic2d_aa_backward + autogen: _upsample_bicubic2d_aa_backward.vec_out - func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_nearest1d + autogen: upsample_nearest1d.vec_out - func: _upsample_nearest_exact1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_nearest_exact1d + autogen: _upsample_nearest_exact1d.vec_out - func: upsample_nearest1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_nearest1d_backward + autogen: upsample_nearest1d_backward.vec_out - func: _upsample_nearest_exact1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_nearest_exact1d_backward + autogen: _upsample_nearest_exact1d_backward.vec_out - func: upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_nearest2d + autogen: upsample_nearest2d.vec_out - func: _upsample_nearest_exact2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_nearest_exact2d + autogen: _upsample_nearest_exact2d.vec_out - func: upsample_nearest2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: upsample_nearest2d_backward + autogen: upsample_nearest2d_backward.vec_out - func: _upsample_nearest_exact2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CompositeExplicitAutograd: _upsample_nearest_exact2d_backward + autogen: _upsample_nearest_exact2d_backward.vec_out - func: upsample_nearest3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn @@ -10579,6 +10969,7 @@ CPU: upsample_nearest3d_cpu CUDA: upsample_nearest3d_cuda QuantizedCPU: upsample_nearest3d_quantized_cpu + autogen: upsample_nearest3d.vec_out - func: _upsample_nearest_exact3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor python_module: nn @@ -10586,18 +10977,21 @@ CPU: _upsample_nearest_exact3d_cpu CUDA: _upsample_nearest_exact3d_cuda QuantizedCPU: _upsample_nearest_exact3d_quantized_cpu + autogen: _upsample_nearest_exact3d.vec_out - func: upsample_nearest3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CPU: upsample_nearest3d_backward_cpu CUDA: upsample_nearest3d_backward_cuda + autogen: upsample_nearest3d_backward.vec_out - func: _upsample_nearest_exact3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor python_module: nn dispatch: CPU: _upsample_nearest_exact3d_backward_cpu CUDA: _upsample_nearest_exact3d_backward_cuda + autogen: _upsample_nearest_exact3d_backward.vec_out # NOTE: all of the non-"vec" upsample overloads are only kept for backward compatibility. - func: upsample_linear1d.out(Tensor self, int[1] output_size, bool align_corners, float? scales=None, *, Tensor(a!) out) -> Tensor(a!) @@ -10986,6 +11380,7 @@ dispatch: CPU: slow_conv2d_backward_cpu CUDA: slow_conv2d_backward_cuda + autogen: _slow_conv2d_backward.output_mask_out - func: _conv_depthwise2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding, int[2] dilation, *, Tensor(a!) out) -> Tensor(a!) use_const_ref_for_mutable_tensors: True @@ -11002,6 +11397,7 @@ python_module: nn dispatch: CUDA: conv_depthwise3d_cuda + autogen: conv_depthwise3d.out - func: slow_conv3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, *, Tensor(a!) out) -> Tensor(a!) python_module: nn @@ -11024,12 +11420,14 @@ dispatch: CPU: slow_conv_dilated2d_cpu CUDA: slow_conv_dilated2d_cuda + autogen: slow_conv_dilated2d.out - func: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1) -> Tensor python_module: nn dispatch: CPU: slow_conv_dilated3d_cpu CUDA: slow_conv_dilated3d_cuda + autogen: slow_conv_dilated3d.out - func: col2im.out(Tensor self, int[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) out) -> Tensor(a!) python_module: nn @@ -11097,6 +11495,7 @@ SparseCPU, SparseCUDA: isinf_sparse SparseMeta: isinf_sparse_meta SparseCsrCPU, SparseCsrCUDA: isinf_sparse_csr + autogen: isinf.out - func: record_stream(Tensor(a!) self, Stream s) -> () variants: method @@ -11748,8 +12147,6 @@ - func: linalg_cross.out(Tensor self, Tensor other, *, int dim=-1, Tensor(a!) out) -> Tensor(a!) python_module: linalg structured: True - precomputed: - - dim -> int dim dispatch: CPU, CUDA: linalg_cross_out @@ -11886,6 +12283,7 @@ variants: function dispatch: CPU, CUDA: linalg_matrix_exp + autogen: linalg_matrix_exp.out - func: _linalg_slogdet(Tensor A) -> (Tensor sign, Tensor logabsdet, Tensor LU, Tensor pivots) structured_delegate: _linalg_slogdet.sign @@ -11960,34 +12358,26 @@ dispatch: CPU, CUDA: linalg_householder_product_out -- func: _linalg_inv_out_helper_(Tensor(a!) self, Tensor(b!) infos_lu, Tensor(c!) infos_getri) -> Tensor(a!) - variants: function - dispatch: - CPU: _linalg_inv_out_helper_cpu - CUDA: _linalg_inv_out_helper_cuda - autogen: _linalg_inv_out_helper, _linalg_inv_out_helper.out - -- func: linalg_inv_ex(Tensor self, *, bool check_errors=False) -> (Tensor inverse, Tensor info) +- func: linalg_inv_ex(Tensor A, *, bool check_errors=False) -> (Tensor inverse, Tensor info) python_module: linalg - variants: function - dispatch: - # calls transpose_ - CompositeExplicitAutogradNonFunctional: linalg_inv_ex + structured_delegate: linalg_inv_ex.inverse -- func: linalg_inv_ex.inverse(Tensor self, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info) +- func: linalg_inv_ex.inverse(Tensor A, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info) python_module: linalg - variants: function + structured: True dispatch: - # calls transpose_ - CompositeExplicitAutogradNonFunctional: linalg_inv_ex_out + CPU, CUDA: linalg_inv_ex_out -- func: linalg_inv(Tensor self) -> Tensor +- func: linalg_inv(Tensor A) -> Tensor python_module: linalg - variants: function -- func: linalg_inv.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) +- func: linalg_inv.out(Tensor A, *, Tensor(a!) out) -> Tensor(a!) python_module: linalg - variants: function + +- func: inverse(Tensor self) -> Tensor + variants: function, method + +- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) - func: inner(Tensor self, Tensor other) -> Tensor variants: function, method @@ -12229,18 +12619,21 @@ python_module: nn dispatch: CPU: _test_optional_intlist + autogen: _test_optional_intlist.out # Note: this function is only for testing. - func: _test_optional_filled_intlist(Tensor values, int[2]? addends) -> Tensor python_module: nn dispatch: CPU: _test_optional_intlist + autogen: _test_optional_filled_intlist.out # Note: this function is only for testing. - func: _test_optional_floatlist(Tensor values, float[]? addends) -> Tensor python_module: nn dispatch: CPU: _test_optional_floatlist + autogen: _test_optional_floatlist.out # Note: this function is only for testing. - func: _test_string_default(Tensor dummy, str a="\"'\\", str b='"\'\\') -> Tensor @@ -12260,16 +12653,45 @@ python_module: nn dispatch: CompositeExplicitAutograd: _test_warn_in_autograd + autogen: _test_warn_in_autograd.out + +# Note: this function is only for testing. +- func: _test_autograd_multiple_dispatch.fullcoverage(Tensor self) -> Tensor + dispatch: + # the NestedTensor keys are necessary because NestedTensor has been removed + # from the CompositeExplicitAutograd keyset see Note [NestedTensor Not Included in Backend Keys] + CompositeExplicitAutograd, NestedTensorCPU, NestedTensorCUDA: _test_autograd_multiple_dispatch_fullcoverage + autogen: _test_autograd_multiple_dispatch.fullcoverage_out + +# Note: this function is only for testing. +- func: _test_autograd_multiple_dispatch.ntonly(Tensor self, bool b) -> Tensor + dispatch: + CompositeImplicitAutograd, NestedTensorCPU, NestedTensorCUDA: _test_autograd_multiple_dispatch_ntonly + +# Note: this function is only for testing. +- func: _test_autograd_multiple_dispatch_view(Tensor(a) self) -> Tensor(a) + dispatch: + CompositeExplicitAutograd: _test_autograd_multiple_dispatch_view + +# Note: this function is only for testing. +- func: _test_autograd_multiple_dispatch_view_copy(Tensor self) -> Tensor + variants: function + dispatch: + CompositeExplicitAutogradNonFunctional: _test_autograd_multiple_dispatch_view_copy + tags: view_copy + autogen: _test_autograd_multiple_dispatch_view_copy.out - func: segment_reduce(Tensor data, str reduce, *, Tensor? lengths=None, Tensor? indices=None, Tensor? offsets=None, int axis=0, bool unsafe=False, Scalar? initial=None) -> Tensor variants: function dispatch: CPU, CUDA: segment_reduce_kernel + autogen: segment_reduce.out - func: _segment_reduce_backward(Tensor grad, Tensor output, Tensor data, str reduce, *, Tensor? lengths=None, Tensor? offsets=None, int axis=0, Scalar? initial=None) -> Tensor variants: function dispatch: CPU, CUDA: _segment_reduce_backward_kernel + autogen: _segment_reduce_backward.out - func: pad_sequence(Tensor[] sequences, bool batch_first=False, float padding_value=0.0) -> Tensor python_module: nn @@ -12287,6 +12709,7 @@ variants: function dispatch: CompositeExplicitAutograd: nested_tensor + autogen: nested_tensor.out - func: _fw_primal_copy(Tensor self, int level) -> Tensor variants: function @@ -12467,12 +12890,14 @@ dispatch: CompositeExplicitAutograd: ccol_indices_copy tags: view_copy + autogen: ccol_indices_copy.out - func: row_indices_copy(Tensor self) -> Tensor variants: function dispatch: CompositeExplicitAutograd: row_indices_copy tags: view_copy + autogen: row_indices_copy.out - func: unbind_copy.int(Tensor self, int dim=0) -> Tensor[] variants: function @@ -12545,6 +12970,7 @@ dispatch: CompositeExplicitAutograd: view_copy_SymInt tags: view_copy + autogen: view_copy.SymInt_out - func: as_strided_copy.out(Tensor self, int[] size, int[] stride, int? storage_offset=None, *, Tensor(a!) out) -> Tensor(a!) @@ -12719,26 +13145,47 @@ dispatch: NestedTensorCPU: NestedTensor_to_padded_tensor_generic NestedTensorCUDA: NestedTensor_to_padded_tensor_cuda + autogen: to_padded_tensor.out + +- func: _nested_tensor_softmax_with_shape(Tensor self, Tensor query) -> Tensor + dispatch: + NestedTensorCPU: NestedTensor_softmax_dropout + NestedTensorCUDA: NestedTensor_softmax_dropout_cuda - func: _nested_tensor_layer_norm(Tensor self, Tensor? weight, Tensor? bias, float eps) -> Tensor variants: method dispatch: NestedTensorCPU, NestedTensorCUDA: NestedTensor_layer_norm + autogen: _nested_tensor_layer_norm.out # Apparently, putting "forward" in the name will cause Python bindings to be skipped, so "fwd" it is. -- func: _transformer_encoder_layer_fwd(Tensor src, int embed_dim, int num_heads, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, bool use_gelu, bool norm_first, float eps, Tensor norm_weight_1, Tensor norm_bias_1, Tensor norm_weight_2, Tensor norm_bias_2, Tensor ffn_weight_1, Tensor ffn_bias_1, Tensor ffn_weight_2, Tensor ffn_bias_2, Tensor? mask=None) -> Tensor +- func: _transformer_encoder_layer_fwd(Tensor src, int embed_dim, int num_heads, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, bool use_gelu, bool norm_first, float eps, Tensor norm_weight_1, Tensor norm_bias_1, Tensor norm_weight_2, Tensor norm_bias_2, Tensor ffn_weight_1, Tensor ffn_bias_1, Tensor ffn_weight_2, Tensor ffn_bias_2, Tensor? mask=None, int? mask_type=None) -> Tensor variants: function dispatch: CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: transformer_encoder_layer_forward + autogen: _transformer_encoder_layer_fwd.out -- func: _native_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, bool need_weights=True, bool average_attn_weights=True) -> (Tensor, Tensor) +- func: _native_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, bool need_weights=True, bool average_attn_weights=True, int? mask_type=None) -> (Tensor, Tensor) variants: function dispatch: CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: native_multi_head_attention + autogen: _native_multi_head_attention.out - func: _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor) variants: function +- func: _triton_scaled_dot_attention(Tensor q, Tensor k, Tensor v, float dropout_p=0.0) -> Tensor + variants: function + dispatch: + CUDA: triton_scaled_dot_attention + autogen: _triton_scaled_dot_attention.out + +- func: _triton_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None) -> Tensor + variants: function + dispatch: + CUDA: triton_multi_head_attention + autogen: _triton_multi_head_attention.out + - func: special_airy_ai(Tensor x) -> Tensor python_module: special structured_delegate: special_airy_ai.out @@ -12756,11 +13203,13 @@ variants: function dispatch: CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: transformer_decoder_only_layer_forward + autogen: _transformer_decoder_only_layer_fwd.out - func: _native_decoder_only_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, Tensor? incr_key=None, Tensor? incr_value=None, bool need_weights=True, bool average_attn_weights=True) -> (Tensor, Tensor, Tensor, Tensor) variants: function dispatch: CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: native_decoder_only_multi_head_attention + autogen: _native_decoder_only_multi_head_attention.out - func: special_bessel_j0(Tensor self) -> Tensor python_module: special @@ -13354,3 +13803,4 @@ - func: _foobar(Tensor self, bool arg1=True, bool arg2=True, *, bool arg3=True) -> Tensor dispatch: CPU: foobar + autogen: _foobar.out diff --git a/aten/src/ATen/native/nested/NestedTensorBackward.cpp b/aten/src/ATen/native/nested/NestedTensorBackward.cpp index 600db24c03aa2..ec96fdfaf4c02 100644 --- a/aten/src/ATen/native/nested/NestedTensorBackward.cpp +++ b/aten/src/ATen/native/nested/NestedTensorBackward.cpp @@ -13,6 +13,25 @@ namespace at { namespace native { +// See Note [nested tensor matmul] in NestedTensorMath.cpp +std::tuple matmul_backward_nested( + const Tensor& grad, + const Tensor& self, + const Tensor& other, + std::array grad_input_mask) { + if (!grad.defined()) { + return std::make_tuple(Tensor(), Tensor()); + } + Tensor grad_self, grad_other; + if (grad_input_mask[0]) { + grad_self = at::matmul(grad, other.transpose(-1, -2)); + } + if (grad_input_mask[1]) { + grad_other = at::matmul(self.transpose(-1, -2), grad); + } + return std::make_tuple(grad_self, grad_other); +} + std::tuple nested_linear_backward( const Tensor& input, const Tensor& grad_output, @@ -64,5 +83,92 @@ Tensor _reshape_nested_backward(const Tensor& self, const Tensor& grad) { return grad.reshape(sizes); } +Tensor nested_softmax_backward( + const Tensor& grad, + const Tensor& output, + int64_t dim, + ScalarType input_dtype) { + TORCH_INTERNAL_ASSERT(grad.is_nested(), "Should be nested grad") + TORCH_INTERNAL_ASSERT(output.is_nested(), "Should be nested output") + + auto output_ptr = get_nested_tensor_impl(output); + auto grad_ptr = get_nested_tensor_impl(grad); + int64_t ntensors = output_ptr->size(0); + if (ntensors == 0) { + return grad.clone(); + } + int64_t positive_dim = at::maybe_wrap_dim(dim, output_ptr->dim()); + + // Get the info about the output + const Tensor &output_buffer = output_ptr->get_buffer(), + &output_sizemat = output_ptr->get_nested_size_tensor(); + + // Get the info about the grad + const Tensor &grad_sizemat = grad_ptr->get_nested_size_tensor(); + + TORCH_INTERNAL_ASSERT(output_sizemat.equal(grad_sizemat)); + Tensor grad_output = + wrap_buffer(at::empty_like(output_buffer), output_sizemat.clone()); + + // Unbind nt into individual tensor slices for calculating the derivative + std::vector grad_output_unbind{grad_output.unbind()}, + grad_unbind{grad.unbind()}, output_unbind{output.unbind()}; + + for(const auto i: c10::irange(ntensors)) { + at::_softmax_backward_data_out( + grad_output_unbind[i], + grad_unbind[i], + output_unbind[i], + positive_dim - 1, + input_dtype); + } + return grad_output; + +} + +// Rudimentary sum backward assuming the conditions in #82387 +Tensor _nested_sum_backward_cpu( + const Tensor& grad, + const Tensor& nested_self, + OptionalIntArrayRef opt_dims, + bool keepdim) { + auto nt_self = get_nested_tensor_impl(nested_self); + auto nt_grad = get_nested_tensor_impl(grad); + const Tensor& grad_buffer = nt_grad->get_buffer(); + const Tensor& self_buffer = nt_self->get_buffer(); + auto grad_sizes = nt_grad->get_nested_size_tensor(); + auto self_sizes = nt_self->get_nested_size_tensor(); + int64_t ntensors = nt_self->size(0); + const Tensor& self_grad_buffer = self_buffer.new_empty(self_buffer.sizes()); + + auto num_segments = at::prod(grad_sizes, -1); + auto segment_lengths = self_sizes.select(1, -1); + + // This logic assumes for now that + // (1) all the gradient nested tensors are contiguous + // (2) the gradient nested tensors are stored contiguously in the buffer + AT_DISPATCH_ALL_TYPES_AND2( + ScalarType::Half, ScalarType::BFloat16, self_grad_buffer.scalar_type(), "nested_sum_dim_cpu", [&]() { + auto* self_grad_data = self_grad_buffer.data_ptr(); + const auto* output_grad_data = grad_buffer.data_ptr(); + int64_t out_idx = 0, in_idx = 0; + for (const auto i : c10::irange(ntensors)) { + int64_t segments = num_segments[i].item(); + int64_t segment_length = segment_lengths[i].item(); + for (auto j = 0; j < segments; j++) { + scalar_t output_grad = output_grad_data[out_idx]; + for (auto k = 0; k < segment_length; k++) { + self_grad_data[in_idx] = output_grad; + in_idx += 1; + } + out_idx += 1; + } + } + }); + + return wrap_buffer(self_grad_buffer, self_sizes); + +} + } // namespace native } // namespace at diff --git a/aten/src/ATen/native/nested/NestedTensorMath.cpp b/aten/src/ATen/native/nested/NestedTensorMath.cpp index 6c05986e2e61f..d819bceadbb36 100644 --- a/aten/src/ATen/native/nested/NestedTensorMath.cpp +++ b/aten/src/ATen/native/nested/NestedTensorMath.cpp @@ -96,10 +96,6 @@ Tensor pad_tensor_to_shape( } } // namespace -inline const at::Tensor& get_buffer(const at::Tensor& tensor) { - return get_nested_tensor_impl(tensor)->get_buffer(); -} - std::vector NestedTensor_unbind( const at::Tensor& self, int64_t dim) { @@ -119,14 +115,15 @@ std::vector NestedTensor_unbind( std::vector sizes = NestedTensor_get_sizes(self_ptr), strides = NestedTensor_get_strides(self_ptr); const std::vector& offsets = self_ptr->get_offsets(); - for (int64_t i = 0; i < ntensors; i++) { + for (const int64_t i: c10::irange(ntensors)){ result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]); } return result_tensors; } Tensor& NestedTensor_relu_(Tensor& self) { - at::relu_(const_cast(get_nested_tensor_impl(self)->get_buffer())); + auto buffer = get_nested_tensor_impl(self)->get_buffer(); + at::relu_(buffer); return self; } @@ -135,7 +132,8 @@ Tensor NestedTensor_relu(const Tensor& self) { } Tensor& NestedTensor_gelu_(Tensor& self, c10::string_view approximate) { - at::gelu_(const_cast(get_nested_tensor_impl(self)->get_buffer()), approximate); + auto buffer = get_nested_tensor_impl(self)->get_buffer(); + at::gelu_(buffer, approximate); return self; } @@ -147,7 +145,7 @@ Tensor NestedTensor_gelu(const Tensor& self, c10::string_view approximate) { }); } -Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask) { +Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask, bool mask_check) { TORCH_CHECK(mask.scalar_type() == at::ScalarType::Bool, "Expected mask to be of ScalarType Bool, but got ", mask.scalar_type(), " instead."); TORCH_CHECK(mask.dim() == 2, "Padding mask should be 2D"); TORCH_CHECK(t.dim() == 3, "Input should be a 3D tensor, N * L * D"); @@ -165,7 +163,8 @@ Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask) sizes = sizes.cumsum(1).select(1, L - 1); nums = nums.to(sizes.options()); - TORCH_CHECK(sizes.equal(nums), "Mask must be left-aligned without gaps"); + if (mask_check) + TORCH_CHECK(sizes.equal(nums), "Mask must be left-aligned without gaps"); sizes = sizes.reshape({N, 1}); // N, ([d1=D, d2=D, ... dN=D]) @@ -706,22 +705,25 @@ at::Tensor NestedTensor_get_nested_size_tensor(const at::Tensor& self){ return get_nested_size_tensor(self); } -Tensor dropout_nested(const Tensor& input, double p, bool train) { +std::tuple native_dropout_nested(const Tensor& input, double p, c10::optional train) { auto input_ptr = get_nested_tensor_impl(input); const Tensor& input_buffer = input_ptr->get_buffer(), & sizemat = input_ptr->get_nested_size_tensor(), & stridemat = input_ptr->get_nested_stride_tensor(); const std::vector& offsets = input_ptr->get_offsets(); - Tensor output_buffer = at::dropout(input_buffer, p, train); + Tensor output_buffer, mask_buffer; + if (input_buffer.numel() == 0) { + output_buffer = input_buffer.clone(); + mask_buffer = input_buffer.clone(); + } + else { + std::tie(output_buffer, mask_buffer) = at::native_dropout(input_buffer, p, train); + } // regular tensor dropout reuses input size and stride // i.e. if input is not contiguous, then output is also discontiguous - return wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), offsets); -} - -Tensor& dropout_nested_(Tensor& input, double p, bool train) { - Tensor input_buffer = get_buffer(input); - at::dropout_(input_buffer, p, train); - return input; + Tensor output = wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), std::vector(offsets)), + mask = wrap_buffer(mask_buffer, sizemat.clone(), stridemat.clone(), std::vector(offsets)); + return std::make_tuple(output, mask); } Tensor softmax_nested( @@ -731,7 +733,7 @@ Tensor softmax_nested( auto input_ptr = get_nested_tensor_impl(input); int64_t ntensors = input_ptr->size(0); if (ntensors == 0) { - return input; + return input.clone(); } int64_t positive_dim = at::maybe_wrap_dim(dim, input_ptr->dim()); TORCH_CHECK( @@ -819,10 +821,18 @@ Tensor bmm_nested(const Tensor& self, const Tensor& mat2) { return output; } -// utilities support _NestedTensor_GeneralizedBMM +// utilities support `matmul_nested` namespace { +// Args: +// self_sizes: the sizes of `self` in `matmul_nested` +// mat2_sizes: the sizes of `mat2` in `matmul_nested` +// buffer_op: the options for new buffer +// sizemat_op: the options for new size matrix +// Returns: +// the batch size of each input underlying tensor, i.e. the product of batch-dimension sizes +// the empty output nested tensor inline std::tuple, Tensor> -_NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory( +matmul_nested_helper( const std::vector& self_sizes, const std::vector& mat2_sizes, const c10::TensorOptions& buffer_op, @@ -869,14 +879,16 @@ _NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory( } } -// This is a generalized batched matmul dedicated to nested tensors, +// Note [nested tensor matmul] +// This is really a generalized batched matmul dedicated to nested tensors, // where `self` and `mat2` have same number (>= 3) of dimensions. // The last 2 dimensions will be considered as matrix dimensions, // so they should be matrix-multiplicable. // The leading dimensions are considered as batch dimensions, // and since nested tensor does not support broadcasting for now, // for each batch dimension `self` and `mat2` must have same size. -Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) { +// TODO: Should make full matmul semantics support some day +Tensor matmul_nested(const Tensor& self, const Tensor& mat2) { if (self.is_nested() && !mat2.is_nested()) { AT_ERROR("Expected both to be nested, but got a nested self and non-nested other"); } @@ -913,7 +925,7 @@ Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) { // create a contiguous output std::vector batch_sizes; Tensor output; - std::tie(batch_sizes, output) = _NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory( + std::tie(batch_sizes, output) = matmul_nested_helper( self_sizes, mat2_sizes, self_buffer.options(), self_ptr->get_nested_size_tensor().options()); // call tensor matmul // TODO: `padding nested tensor -> bmm -> remove padding` may be more efficient @@ -945,6 +957,28 @@ Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) { return output; } +Tensor& matmul_out_nested(const Tensor& tensor1, const Tensor& tensor2, Tensor& result) { + // TODO: this is a very quick and dirty implementation + // should improve it to avoid the intermediate memory usage + Tensor function_result = at::matmul(tensor1, tensor2); + auto function_result_ptr = get_nested_tensor_impl(function_result); + // TODO: this is to reproduce function_result_ptr->opt_sizes_ + // if an accessor is provided in the future, can replace this + std::vector sizes; + for (int64_t i = 0; i < function_result_ptr->dim(); i++) { + c10::optional opt_size = function_result_ptr->opt_size(i); + if (opt_size.has_value()) { + sizes.push_back(*opt_size); + } + else { + sizes.push_back(-1); + } + } + result.reshape(sizes); + result.copy_(function_result); + return result; +} + Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) { auto self_ptr = get_nested_tensor_impl(self); // check input dimensions @@ -970,7 +1004,8 @@ Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) { // create transposed `sizemat` and `stridemat` Tensor sizemat_transposed = at::index_select(sizemat, 1, column_indices), stridemat_transposed = at::index_select(stridemat, 1, column_indices); - return wrap_buffer(self_ptr->get_buffer(), sizemat_transposed, stridemat_transposed, self_ptr->get_offsets()); + return create_nested_view_tensor( + self, sizemat_transposed, stridemat_transposed, std::vector(self_ptr->get_offsets())); } // utilities supporting `_reshape_nested` @@ -1005,24 +1040,18 @@ inline std::tuple NestedTensor_reshape_size_stride( // some negative sizes remain to be infered if (ndims_underlying < ndims_underlying_reshaped) { // replace negative sizes for old dimensions with old sizes - int64_t numel = 1, numel_reshaped = 1; for (int64_t idim = 0; idim < ndims_underlying; idim++) { int64_t& size_reshaped = size_reshaped_vector[idim]; TORCH_CHECK(size_reshaped >= -1, "invalid shape dimension ", size_reshaped); if (size_reshaped == -1) { size_reshaped = size[idim]; } - numel *= size[idim]; - numel_reshaped *= size_reshaped; } // infer negative size for new dimension int64_t infer_index = -1; for (int64_t idim = ndims_underlying; idim < ndims_underlying_reshaped; idim++) { const int64_t& size_reshaped = size_reshaped_vector[idim]; - if (size_reshaped >= 0) { - numel_reshaped *= size_reshaped; - } - else if (size_reshaped == -1) { + if (size_reshaped == -1) { if (infer_index > -1) { throw std::runtime_error("only one dimension can be inferred"); } @@ -1030,7 +1059,7 @@ inline std::tuple NestedTensor_reshape_size_stride( infer_index = idim; } } - else { + else if (size_reshaped < 0) { AT_ERROR("invalid shape dimension ", size_reshaped); } } @@ -1098,7 +1127,7 @@ inline void NestedTensor_reshape_copy( buffer.as_strided(sizes[i], strides[i], offsets[i]).reshape(sizes_reshaped[i])); } } -} +} // namespace // Special rules for reshape(nested tensor): // 1. Only 1 regular dimension can be collapsed with @@ -1142,7 +1171,7 @@ Tensor _reshape_nested(const Tensor& self, IntArrayRef proposed_shape) { std::tie(reshape_as_view, sizemat_reshaped, stridemat_reshaped) = NestedTensor_reshape_size_stride( sizes, strides, proposed_shape, sizemat.options()); if (reshape_as_view) { - return wrap_buffer(buffer, sizemat_reshaped, stridemat_reshaped, offsets); + return wrap_buffer(buffer, sizemat_reshaped, stridemat_reshaped, std::vector(offsets)); } Tensor buffer_reshaped = buffer.new_empty(buffer.sizes()); Tensor output = wrap_buffer(buffer_reshaped, sizemat_reshaped); diff --git a/aten/src/ATen/native/nested/NestedTensorMath.h b/aten/src/ATen/native/nested/NestedTensorMath.h index b315a3b253df3..844000605bb04 100644 --- a/aten/src/ATen/native/nested/NestedTensorMath.h +++ b/aten/src/ATen/native/nested/NestedTensorMath.h @@ -3,6 +3,9 @@ #include #include #include +#include +#include +#include #include @@ -21,11 +24,52 @@ inline at::Tensor wrap_buffer(at::Tensor buffer, at::Tensor nested_size_tensor) inline at::Tensor wrap_buffer( at::Tensor buffer, at::Tensor nested_size_tensor, - at::Tensor nested_stride_tensor, const std::vector& offsets) { + at::Tensor nested_stride_tensor, std::vector&& offsets) { TORCH_INTERNAL_ASSERT_DEBUG_ONLY(buffer.is_contiguous(), "Given buffer must be contiguous."); return at::detail::make_tensor( std::move(buffer), std::move(nested_size_tensor), - std::move(nested_stride_tensor), offsets); + std::move(nested_stride_tensor), std::move(offsets)); +} + +inline at::Tensor get_buffer(const at::Tensor& tensor) { + return get_nested_tensor_impl(tensor)->get_buffer(); +} + + /** + * Create a new nested tensor that is a view of a base nested tensor + * + * create_view_tensor calls a specialized constructor that copys the + * the keys from base onto the new view tensor being created. + * The storage is shared between the base and the returned view tensor + * + * All callers of this helper must: + * - Only return a view of the input + * - Must be explicit and define a derivative + * + * @param base Base tensor to construct view from. + * @param nested_size_tensor View tensors' sizes. + * @param nested_stride_tensor View tensors' strides. + * @param offsets View tensors' offsets. + * @return A newly constructed view tensor + */ +inline at::Tensor create_nested_view_tensor( + const at::Tensor& base, + at::Tensor nested_size_tensor, + at::Tensor nested_stride_tensor, + std::vector&& offsets) { + TORCH_INTERNAL_ASSERT( + base.is_nested(), + "This function can only be used to create nested tensor views"); + TORCH_INTERNAL_ASSERT( + c10::impl::tls_local_dispatch_key_set().excluded_.has( + c10::DispatchKey::AutogradFunctionality), + "Creating a non differentiable nested tensor view in a CompositeImplicit function is not allowed."); + return at::detail::make_tensor( + c10::TensorImpl::VIEW, + base, + nested_size_tensor, + nested_stride_tensor, + std::move(offsets)); } // The sizes of the underlying tensors @@ -42,7 +86,8 @@ inline std::vector NestedTensor_get_sizes(const NestedTensorImpl* s return sizes; } const int64_t* sizemat_ptr = sizemat.data_ptr(); - for (int64_t i = 0; i < ntensors; i++) { + + for(const auto i: c10::irange(ntensors)){ sizes[i] = IntArrayRef(sizemat_ptr, sizemat_ptr + orig_dim); sizemat_ptr += orig_dim; } @@ -68,7 +113,7 @@ inline std::vector NestedTensor_get_strides(const NestedTensorImpl* return strides; } const int64_t* stridemat_ptr = stridemat.data_ptr(); - for (int64_t i = 0; i < ntensors; i++) { + for(const auto i: c10::irange(ntensors)) { strides[i] = IntArrayRef(stridemat_ptr, stridemat_ptr + orig_dim); stridemat_ptr += orig_dim; } diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp index d33decc224333..c559b75a78a69 100644 --- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp +++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp @@ -138,44 +138,56 @@ Tensor NestedTensor_add_NestedTensor_in_place( return self; } -void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores) { +Tensor NestedTensor_softmax_dropout(const Tensor& self, const Tensor& query) { const auto* query_nt = get_nested_tensor_impl_or_null(query); TORCH_INTERNAL_ASSERT(query_nt != nullptr); TORCH_INTERNAL_ASSERT(nested_tensor_impl_is_contiguous(query_nt)); const Tensor& sizes = query_nt->get_nested_size_tensor(); const auto num_tensors = sizes.sizes()[0]; - const auto max_seq_len = attn_scores.sizes()[2]; + + auto output = at::empty_like(self,{}, at::MemoryFormat::Contiguous); + TORCH_INTERNAL_ASSERT(output.is_contiguous()); + + const auto max_seq_len = self.sizes()[2]; for (int64_t i = 0; i < num_tensors; i++) { auto seq_len = sizes.index({i, 0}).item(); - auto subseq = attn_scores.index( + auto subseq = self.index( {i, indexing::Slice(), indexing::Slice(0, seq_len), indexing::Slice(0, seq_len)}); auto subscores = at::softmax(subseq, subseq.dim() - 1); - attn_scores.index_put_( + output.index_put_( {i, indexing::Slice(), indexing::Slice(0, seq_len), indexing::Slice(0, seq_len)}, subscores); - attn_scores.index_put_( + output.index_put_( {i, indexing::Slice(), indexing::Slice(0, seq_len), indexing::Slice(seq_len, max_seq_len)}, 0); - attn_scores.index_put_( + output.index_put_( {i, indexing::Slice(), indexing::Slice(seq_len, max_seq_len), indexing::Slice(0, max_seq_len)}, 0); } + return output; } +Tensor NestedTensor_softmax_dropout_cuda(const Tensor& self, const Tensor& query) { + c10::optional attn_mask; + + attn_mask = NestedTensor_to_mask(query, 2, self.size(2)); + attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true); + return _masked_softmax(self, *attn_mask, self.dim() - 1, /*mask type */ 1 ); // NestedTensor_to_mask produces a BxT mask +} Tensor NestedTensor_batch_offsets_from_size_tensor( const Tensor& sizes, @@ -196,6 +208,7 @@ Tensor NestedTensor_batch_offsets_from_size_tensor( return offsets; } + Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional mask_dim, c10::optional mask_dim_length) { auto* nt_impl = get_nested_tensor_impl(nt); TORCH_CHECK( diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h index 96ecfe91c3ddd..77eb0145d6847 100644 --- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h +++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h @@ -50,8 +50,6 @@ Tensor NestedTensor_from_padded_tensor_cpu( const Tensor& padded, const NestedTensorImpl& nt); -void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores); - Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional mask_dim, c10::optional mask_dim_length); template diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp index d89e5c5763d7f..fade1d026b2bc 100644 --- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp +++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp @@ -204,5 +204,6 @@ Tensor NestedTensor_to_padded_tensor_cuda( } return NestedTensor_to_padded_tensor_generic(t, padding, output_size); } + } // namespace native } // namespace at diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu index e8eb164bf4e76..dd5e9b80ca6bc 100644 --- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu +++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu @@ -146,7 +146,7 @@ void remove_padding_kernelLauncher( dim3 grid; grid.x = batch_size; grid.y = GRID_DIM_Y; - at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream(); + at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream(); if (output_dim == 2) { remove_padding_2<<>>( input, @@ -180,7 +180,7 @@ void remove_padding_transform0213_kernelLauncher( dim3 grid; grid.x = batch_size; grid.y = GRID_DIM_Y; - at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream(); + at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream(); TORCH_CHECK( output_dim == 2, "remove padding transform0213 only support output dim == 2"); @@ -374,7 +374,7 @@ void add_padding_kernelLauncher( const std::vector& output_sizes, const int batch_size, const int output_batch_size) { - at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream(); + at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream(); dim3 grid; grid.x = output_batch_size; grid.y = GRID_DIM_Y; diff --git a/aten/src/ATen/native/quantized/README.md b/aten/src/ATen/native/quantized/README.md index 62c4a8a1f9e13..f042881a8cebe 100644 --- a/aten/src/ATen/native/quantized/README.md +++ b/aten/src/ATen/native/quantized/README.md @@ -171,7 +171,8 @@ def quantized_xand(qa, qb): return ops.quantized.xand(qa, qb) ``` -**Note:** If writing new pytorch functions that use quantized kernels, it is strongly encouraged to place them in the `torch/nn/quantized/functional.py`. +**Note:** If writing new pytorch functions that use quantized kernels, +it is strongly encouraged to place them in the `torch/ao/nn/quantized/functional.py`. ### C++ diff --git a/aten/src/ATen/native/quantized/cpu/QuantUtils.h b/aten/src/ATen/native/quantized/cpu/QuantUtils.h index 8ebcea45883c6..f53efab900be1 100644 --- a/aten/src/ATen/native/quantized/cpu/QuantUtils.h +++ b/aten/src/ATen/native/quantized/cpu/QuantUtils.h @@ -205,4 +205,24 @@ inline void HandleWeightsSaturation(int64_t N, float* weight) { } } +// Util function for quantizing bias. +inline at::Tensor QuantizeBias( + bool is_per_channel, + const at::Tensor& bias, + const at::Tensor& weight_contig, + double input_scale) { + at::Tensor qbias; + if (is_per_channel) { + auto bias_quant_scales = + weight_contig.q_per_channel_scales() * input_scale; + auto bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt); + qbias = at::native::quantize_per_channel( + bias, bias_quant_scales, bias_zp, 0, c10::kQInt32); + } else { + qbias = at::native::quantize_per_tensor( + bias, weight_contig.q_scale() * input_scale, 0, c10::kQInt32); + } + return qbias; +} + } // namespace quant_utils diff --git a/aten/src/ATen/native/quantized/cpu/conv_serialization.h b/aten/src/ATen/native/quantized/cpu/conv_serialization.h index b44520f2eb0b7..9e4edb8f9a881 100644 --- a/aten/src/ATen/native/quantized/cpu/conv_serialization.h +++ b/aten/src/ATen/native/quantized/cpu/conv_serialization.h @@ -307,6 +307,9 @@ c10::intrusive_ptr> deserialize_conv( } for (const auto i : c10::irange(kSpatialDim)) { (void)i; // Suppress unused variable + TORCH_INTERNAL_ASSERT(idx < static_cast(config_vals.size()), + "Unexpected index = ", idx, " for config_vals of size ", + config_vals.size()); output_padding.emplace_back(config_vals.at(idx)); idx++; } diff --git a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp index b1bdaadaf5b33..b7d8a89f43493 100644 --- a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp +++ b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp @@ -671,17 +671,14 @@ static void qprelu_out_kernel(Tensor& out, int64_t input_ndim = qx.dim(); TORCH_CHECK(input_ndim > 0, "qprelu: zero-dim input tensor is not allowed."); - // Helper to convert 1d tensors or scalar tensor to an nd tensor that broadcasts with input + // Weight should be a 1d or scalar tensor + // Reshape it to an nd tensor that broadcasts with input // All elements go into the channel dimension - DimVector sizes(input_ndim, 1), strides(input_ndim, 0); - auto as_nd = [&](const Tensor& t) { - TORCH_INTERNAL_ASSERT(t.defined() && (t.dim() == 1 || t.dim() == 0)); - sizes[1] = t.dim() == 1 ? t.sizes()[0] : 1; - strides[1] = t.dim() == 1 ? t.strides()[0] : 0; - return t.as_strided(sizes, strides); - }; - - auto qw_nd = as_nd(qw); + DimVector sizes(input_ndim, 1); + if (input_ndim > 1) { + sizes[1] = qw.numel(); + } + auto qw_nd = qw.reshape(sizes); auto iter = TensorIteratorConfig() .add_output(out) @@ -2750,18 +2747,26 @@ void quantized_normalize_kernel( dq = (dq - layer_mean_div_scale_xVec) * gamma_p_vec + beta_vec; - qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale) - .store(Y_ptr + vecStartIdx); } + qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale) + .store(Y_ptr + vecStartIdx); } - for (int64_t remIdx = chEndIdx - kNonVecRemInChannel; - remIdx < chEndIdx; - remIdx++) { - auto qXVal = X_ptr[remIdx]; - float dqXVal = at::native::dequantize_val(x_fake_scale, x_zp, qXVal); - float dqY = - (dqXVal - layer_mean_div_scale_x) * gamma_p + beta; - Y_ptr[remIdx] = at::native::quantize_val(y_scale, y_zp, dqY); + + // Remainder + if (kNonVecRemInChannel > 0) { + int64_t remIdx = chEndIdx - kNonVecRemInChannel; + auto qXVec = qVec::loadu(X_ptr + remIdx, kNonVecRemInChannel); + auto dqXVec = qXVec.dequantize(x_fake_scale_vec, x_zp_vec, + x_fake_scale_zp_neg_premul_vec); + int validDqvecLen = (kNonVecRemInChannel - 1) / fVec::size() + 1; + for (int i = 0; i < validDqvecLen; ++i) { + auto &dq = dqXVec[i]; + dq = + (dq - layer_mean_div_scale_xVec) * + gamma_p_vec + beta_vec; + } + qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale) + .store(Y_ptr + remIdx, kNonVecRemInChannel); } } // chIdx @@ -3703,8 +3708,8 @@ void quantize_tensor_per_channel_impl( // channels_last contig. // If axis = 0 and channels_last contig, implementation for channels // first (NCHW) works. - for (const auto b : c10::irange(batches)) { - for (const auto e : c10::irange(elements_per_channel)) { + for (C10_UNUSED const auto b : c10::irange(batches)) { + for (C10_UNUSED const auto e : c10::irange(elements_per_channel)) { uint32_t c = 0; while (c + 8 < channels) { const int32x4_t voffset0123 = vld1q_s32(&zero_points_int32t[c]); @@ -3738,7 +3743,7 @@ void quantize_tensor_per_channel_impl( } } } else { - for (const auto b : c10::irange(batches)) { + for (C10_UNUSED const auto b : c10::irange(batches)) { for (const auto c : c10::irange(channels)) { uint32_t e = 0; const int32x4_t voffset = vdupq_n_s32(zero_points_int32t[c]); diff --git a/aten/src/ATen/native/quantized/cpu/qconv.cpp b/aten/src/ATen/native/quantized/cpu/qconv.cpp index f31d271365e24..873d983a48209 100644 --- a/aten/src/ATen/native/quantized/cpu/qconv.cpp +++ b/aten/src/ATen/native/quantized/cpu/qconv.cpp @@ -725,17 +725,7 @@ at::Tensor PackedConvWeightsQnnp::apply_impl_xnnp( // Original bias was float, so we requantize it here. - at::Tensor qbias; - if (per_channel()) { - auto bias_quant_scales = - weight_contig.q_per_channel_scales() * act_input_scale; - auto bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt); - qbias = at::native::quantize_per_channel( - bias, bias_quant_scales, bias_zp, 0, c10::kQInt32); - } else { - qbias = at::native::quantize_per_tensor( - bias, weight_contig.q_scale() * act_input_scale, 0, c10::kQInt32); - } + at::Tensor qbias = quant_utils::QuantizeBias(per_channel(), bias, weight_contig, act_input_scale); status = at::native::xnnp_utils::xnnp_create_convolution2d_nhwc( padding()[0], @@ -937,21 +927,8 @@ at::Tensor PackedConvWeightsQnnp::apply_impl( for (const auto i : c10::irange(wt_numel)) { qnnp_w_data[i] = static_cast(w_data[i] + 128); } - at::Tensor qbias; // Original bias was float, so we requantize it here. - if (convolution_op->per_channel) { - at::Tensor bias_quant_scales = - weight_contig.q_per_channel_scales() * act_input_scale; - at::Tensor bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt); - qbias = at::native::quantize_per_channel( - bias_fp32, bias_quant_scales, bias_zp, 0, c10::kQInt32); - } else { - qbias = at::native::quantize_per_tensor( - bias_fp32, - weight_contig.q_scale() * act_input_scale, - 0, - c10::kQInt32); - } + at::Tensor qbias = quant_utils::QuantizeBias(convolution_op->per_channel, bias_fp32, weight_contig, act_input_scale); // Update the input scale to not pack again. input_scale = act_input_scale; diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp index b32fcf03a8cc3..748e89fc182d7 100644 --- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp +++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp @@ -139,7 +139,8 @@ namespace native { // // Python example examining a packed 8bit zero_point and scale: // -// >> x = torch.from_numpy(np.array([[[10, 20], [30, 40]],[[50, 60], [70, 80]]], dtype=np.float32)) +// >> x = torch.from_numpy(np.array([[[10, 20], [30, 40]],[[50, 60], [70, 80]]], +// dtype=np.float32)) // >> x_packed = torch.ops.quantized.embedding_bag_byte_prepack(x) // // # Pull out and examine packed scales, zero_points and values @@ -228,8 +229,9 @@ Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight) { auto* output_data = output.data_ptr(); #ifdef USE_FBGEMM - if (weight.scalar_type() == at::ScalarType::Half) { - const auto weight_data = static_cast(weight.data_ptr()); + if (weight_contig->scalar_type() == at::ScalarType::Half) { + const auto weight_data = + static_cast(weight_contig->data_ptr()); at::parallel_for( 0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) { fbgemm::FloatOrHalfToFused8BitRowwiseQuantizedSBFloat< @@ -240,7 +242,7 @@ Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight) { output_data + start_idx * output_columns); }); } else { - const auto weight_data = weight.data_ptr(); + const auto weight_data = weight_contig->data_ptr(); at::parallel_for( 0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) { fbgemm::FloatOrHalfToFused8BitRowwiseQuantizedSBFloat( @@ -346,8 +348,9 @@ Tensor _qembeddingbag_nbit_prepack_helper( #ifdef USE_FBGEMM if (!optimized_qparams) { - if (weight.scalar_type() == at::ScalarType::Half) { - const auto weight_data = static_cast(weight.data_ptr()); + if (weight_contig.scalar_type() == at::ScalarType::Half) { + const auto weight_data = + static_cast(weight_contig.data_ptr()); at::parallel_for( 0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) { fbgemm::FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf< @@ -359,7 +362,7 @@ Tensor _qembeddingbag_nbit_prepack_helper( output_data + start_idx * output_shape[1]); }); } else { - const auto weight_data = weight.data_ptr(); + const auto weight_data = weight_contig.data_ptr(); at::parallel_for( 0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) { fbgemm::FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf( diff --git a/aten/src/ATen/native/quantized/cpu/qlinear.cpp b/aten/src/ATen/native/quantized/cpu/qlinear.cpp index 99e0155857ceb..0e51b98676078 100644 --- a/aten/src/ATen/native/quantized/cpu/qlinear.cpp +++ b/aten/src/ATen/native/quantized/cpu/qlinear.cpp @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -328,8 +329,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl_xnnp( orig_weight, xnnp_weight); // Original bias was float, so we requantize it here. - at::Tensor qbias = at::native::quantize_per_tensor( - bias_, orig_weight.q_scale() * input_scale, 0, c10::kQInt32); + at::Tensor qbias = quant_utils::QuantizeBias(false, bias_, orig_weight, input_scale); // output limits auto output_min = kReluFused @@ -476,18 +476,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl( } // Original bias was float, so we requantize it here. const bool is_per_channel = orig_weight.qscheme() == at::kPerChannelAffine; - at::Tensor qbias; - // Original bias was float, so we requantize it here. - if (is_per_channel) { - at::Tensor bias_quant_scales = - weight_contig.q_per_channel_scales() * input_scale; - at::Tensor bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt); - qbias = at::native::quantize_per_channel( - bias_fp32, bias_quant_scales, bias_zp, 0, c10::kQInt32); - } else { - qbias = at::native::quantize_per_tensor( - bias_fp32, weight_contig.q_scale() * input_scale, 0, c10::kQInt32); - } + at::Tensor qbias = quant_utils::QuantizeBias(is_per_channel, bias_fp32, weight_contig, input_scale); // Update the input scale to not pack again. this->input_scale = input_scale; diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp index df529a6612f98..d6fd9be57e30e 100644 --- a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp +++ b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp @@ -415,14 +415,21 @@ at::Tensor& PackedLinearWeightFp16::apply_dynamic_impl( // Resize output Tensor output.resize_(output_sizes); - // Call the fp16 gemm interface - fbgemm::cblas_gemm_compute( - fbgemm::matrix_op_t::NoTranspose, - M, - input_ptr, - packed_weight_fp16, - 0.0f, - output.data_ptr()); + int num_tasks = at::get_num_threads(); + at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) { + for (const auto task_id : c10::irange(begin, end)) { + // Call the fp16 gemm interface + fbgemm::cblas_gemm_compute( + /*transa=*/fbgemm::matrix_op_t::NoTranspose, + /*m=*/static_cast(M), + /*A=*/input_ptr, + /*Bp=*/packed_weight_fp16, + /*beta=*/0.0f, + /*C=*/output.data_ptr(), + /*thread_id=*/static_cast(task_id), + /*num_threads=*/num_tasks); + } + }); // Add bias term if (bias_.has_value()) { diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h index 6235d55f8bc7c..575c0a17bceb1 100644 --- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h +++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h @@ -577,7 +577,7 @@ class FullyConnectedSparseOperatorTester { for (size_t i = 0; i < batchSize(); i++) { for (size_t c = 0; c < outputChannels(); c++) { - ASSERT_EQ( + ASSERT_FLOAT_EQ( output_dynamic[i * outputChannels() + c], accumulators_float[i * outputChannels() + c]) << "at " << i << ", " << c diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h index 71a370f85d4f2..25e7bb670653d 100644 --- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h +++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h @@ -475,7 +475,7 @@ class GemmBlockSparseMicrokernelTester { for (size_t mIndex = 0; mIndex < m(); mIndex++) { for (size_t nIndex = 0; nIndex < n(); nIndex++) { - ASSERT_EQ( + ASSERT_FLOAT_EQ( c[mIndex * cStride() + nIndex], acc[mIndex * n() + nIndex]) << "at " << mIndex << ", " << nIndex diff --git a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp index 8d3a17a24ff82..062cc3d126293 100644 --- a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp +++ b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp @@ -757,9 +757,8 @@ Tensor empty_like_sparse_csr( } Tensor select_sparse_csr(const Tensor& self, int64_t dim, int64_t index) { - TORCH_CHECK( - self.layout() == kSparseCsr || self.layout() == kSparseBsr, - "select(): currently only supports the SparseCsr and SparseBsr layout."); + AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS( + self.layout(), "select()", []() { return; }); TORCH_CHECK_INDEX( self.dim() != 0, "select() cannot be applied to a 0-dim tensor."); dim = maybe_wrap_dim(dim, self.dim()); @@ -784,41 +783,55 @@ Tensor select_sparse_csr(const Tensor& self, int64_t dim, int64_t index) { new_sizes.erase(new_sizes.begin() + dim); auto options = self.options(); - // Selecting batch dimension - if (dim < self.dim() - 2) { - if (self.layout() == kSparseBsr) { - return at::native::_sparse_bsr_tensor_unsafe( - self.crow_indices().select(dim, index), - self.col_indices().select(dim, index), - self.values().select(dim, index), - new_sizes, - optTypeMetaToScalarType(options.dtype_opt()), - options.layout_opt(), - options.device_opt(), - options.pinned_memory_opt()); - } - return at::native::_sparse_csr_tensor_unsafe( - self.crow_indices().select(dim, index), - self.col_indices().select(dim, index), + Tensor plain_indices; + Tensor compressed_indices; + std::tie(compressed_indices, plain_indices) = + AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS( + self.layout(), + "select", + [&]() { + return std::make_pair(self.crow_indices(), self.col_indices()); + }, + [&]() { + return std::make_pair(self.ccol_indices(), self.row_indices()); + }); + auto n_batch = compressed_indices.dim() - 1; + + if (dim < n_batch) { + // Selecting batch dimension + return at::native::_sparse_compressed_tensor_unsafe( + compressed_indices.select(dim, index), + plain_indices.select(dim, index), self.values().select(dim, index), new_sizes, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt()); - } else { + } else if (dim < n_batch + 2) { + // Selecting sparse dimension TORCH_CHECK( - self.is_sparse_csr(), - "select(): selecting non-batch dimensions is currently only supported for CSR tensors."); + self.layout() == kSparseCsr || self.layout() == kSparseCsc, + "select(): selecting non-batch dimensions is currently only supported for non-blocked sparse compressed layouts tensors."); TORCH_CHECK( - self.dim() == 2, - "select(): selecting rows or columns is not implemented for batched sparse CSR tensors.") - // Converting to COO and calling select is slighly slower than operating on - // the CSR indices directly for constructing a COO vector, however current - // version is more readable and easier to understand. + n_batch == 0, + "select(): selecting rows or columns is not implemented for batched sparse compressed tensors.") + // Converting to COO and calling select is slightly slower than operating + // on the CSR indices directly for constructing a COO vector, however + // current version is more readable and easier to understand. return self.to_sparse().select(dim, index); + } else { + // Selecting dense dimension + return AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS( + self.layout(), + "select", + // Non blocked layout (2 sparse dims become 1 nnz dim in values, so dim + // is found one position to the left) + [&]() { return self.values().select(dim - 1, index); }, + // Block layout (2 sparse dims become 1 nnz dim + 2 block-shape dims in + // values, so dim is found 1 position to the right) + [&]() { return self.values().select(dim + 1, index); }); } } - } // namespace native } // namespace at diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.cpp b/aten/src/ATen/native/sparse/SparseTensorMath.cpp index ad98fcee2d5bb..a25083c6fae88 100644 --- a/aten/src/ATen/native/sparse/SparseTensorMath.cpp +++ b/aten/src/ATen/native/sparse/SparseTensorMath.cpp @@ -610,34 +610,127 @@ SparseTensor& add_out_sparse_cpu(const SparseTensor& t, const SparseTensor& src, // add(Tensor, SparseTensor, Scalar) // formerly known as spcadd // -------------------------------------------------------------------- - template -void add_dense_sparse_worker_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) { +void add_dense_sparse_worker_non_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) { auto indices_accessor = indices.accessor(); auto values_accessor = values.accessor(); scalar_t* r_ptr = r.data_ptr(); - auto r_strides = r.strides(); scalar_t cast_value = value.to(); - const auto sparse_dim = sparse.sparse_dim(); - + const int64_t sparse_dim = sparse.sparse_dim(); + std::vector result_stride(sparse_dim); + for (const auto d: c10::irange(sparse_dim)) { + result_stride[d] = r.stride(d); + } at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) { - for (auto k: c10::irange(start, end)) { + for (const auto k: c10::irange(start, end)) { int64_t index = r.storage_offset(); for (auto d: c10::irange(sparse_dim)) { - index += r_strides[d] * indices_accessor[d][k]; + index += result_stride[d] * indices_accessor[d][k]; } r_ptr[index] += cast_value * values_accessor[k]; } }); } +template +inline void add_dense_sparse_worker_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) { + + // Get the dense dimension element numbers of hybrid sparse tensor + int64_t values_dense_size = values.stride(0); + TORCH_CHECK(values.is_contiguous()); + scalar_t* v_ptr = values.data_ptr(); + + scalar_t* r_ptr = r.data_ptr(); + TORCH_CHECK(r_ptr != nullptr); + + auto indices_accessor = indices.accessor(); + scalar_t cast_value = value.to(); + auto sparse_dim = sparse.sparse_dim(); + std::vector result_stride(sparse_dim); + for (auto d : c10::irange(sparse_dim)) { + result_stride[d] = r.stride(d); + } + + at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) { + for (auto k: c10::irange(start, end)) { + auto r_index = r_ptr; + for (auto d: c10::irange(sparse_dim)) { + r_index += result_stride[d] * indices_accessor[d][k]; + } + auto v_index = v_ptr + k * values_dense_size; + at::native::cpublas::axpy(values_dense_size, cast_value, v_index, 1, r_index, 1); + } + }); +} + +template +inline void add_dense_sparse_worker_non_coalesced_cpu(Tensor& r, const Scalar& value, + const SparseTensor& sparse, const Tensor& indices, const Tensor& values) { + + // Get the dense dimension element numbers of hybrid sparse tensor + auto values_dense_size = values.stride(0); + TORCH_CHECK(values.is_contiguous()); + scalar_t* v_ptr = values.data_ptr(); + TORCH_CHECK(v_ptr != nullptr); + + scalar_t* r_ptr = r.data_ptr(); + TORCH_CHECK(r_ptr != nullptr); + + scalar_t cast_value = value.to(); + auto sparse_dim = sparse.sparse_dim(); + + auto indices_accessor = indices.accessor(); + int64_t result_length = r.size(0); + std::vector result_stride(sparse_dim); + for (auto d : c10::irange(sparse_dim)) { + result_stride[d] = r.stride(d); + } + + auto sparse_nnz = sparse._nnz(); + int max_threads = at::get_num_threads(); + max_threads = (result_length < max_threads) ? result_length : max_threads; + int64_t avg_chunk_down = result_length / max_threads; + std::vector chuck_size(max_threads); + for (const auto i : c10::irange(max_threads)) { + chuck_size[i] = avg_chunk_down; + } + //make chunk balance among threads as 211 + for (auto i = 0 ; i < result_length % max_threads ; i++) { + chuck_size[i] += 1; + } + std::vector chuck_sum_size(max_threads + 1); + chuck_sum_size[0] = 0; + for (const auto i : c10::irange(1, max_threads)) { + chuck_sum_size[i] = chuck_sum_size[i - 1] + chuck_size[i - 1]; + } + chuck_sum_size[max_threads] = result_length; + at::parallel_for(0, max_threads, 0, [&](int64_t start, int64_t end) { + for (auto k: c10::irange(start, end)) { + int64_t chunk_begin = chuck_sum_size[k]; + int64_t chunk_end = chuck_sum_size[k + 1]; + for (const auto n: c10::irange(sparse_nnz)) { + int64_t chunk_offset = indices_accessor[0][n]; + if (chunk_offset >= chunk_begin && chunk_offset < chunk_end) { + int64_t r_offset = result_stride[0] * chunk_offset; + for (const auto d : c10::irange(1, sparse_dim)) { + r_offset += result_stride[d] * indices_accessor[d][n]; + } + scalar_t* v_index = v_ptr + n * values_dense_size; + auto r_index = r_ptr + r_offset; + at::native::cpublas::axpy(values_dense_size, cast_value, v_index, 1, r_index, 1); + } + } + } + }); +} + Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTensor& sparse_, const Scalar& value) { - AT_ASSERT(!r.is_sparse()); - AT_ASSERT(!dense.is_sparse()); - AT_ASSERT(sparse_.is_sparse()); + TORCH_CHECK(!r.is_sparse()); + TORCH_CHECK(!dense.is_sparse()); + TORCH_CHECK(sparse_.is_sparse()); - AT_ASSERT(!dense.is_cuda()); // dispatch argument + TORCH_CHECK(!dense.is_cuda()); // dispatch argument TORCH_CHECK(!r.is_cuda(), "add: expected 'out' to be CPU tensor, but got CUDA tensor"); TORCH_CHECK(!sparse_.is_cuda(), "add: expected 'other' to be a CPU tensor, but got a CUDA tensor"); @@ -648,19 +741,15 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen TORCH_CHECK(canCast(commonDtype, r.scalar_type()), "Can't convert result type ", commonDtype, " to output ", r.scalar_type(), " in add operation"); r.resize_as_(dense); - SparseTensor sparse = sparse_.coalesce(); - - Tensor indices = sparse._indices(); - Tensor values = sparse._values(); - int64_t nDim = dense.dim(); - int64_t nDimI = sparse.sparse_dim(); - if (sparse._nnz() == 0) { + auto sparse_nnz = sparse_._nnz(); + if (sparse_nnz == 0) { if (!is_same_tensor(r, dense)) r.copy_(dense); return r; } - Tensor valuesBuffer = values.to(commonDtype); + int64_t dense_dim = dense.dim(); + int64_t sparse_dim = sparse_.sparse_dim(); Tensor resultBuffer = r; if (r.scalar_type() != commonDtype) { resultBuffer = dense.to(commonDtype); @@ -668,23 +757,56 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen resultBuffer.copy_(dense); } - // accessors rely on nnz test - if (nDim > nDimI) { - auto indices_accessor = indices.accessor(); - for (const auto k : c10::irange(sparse._nnz())) { - Tensor dstBuffer = resultBuffer; - for (const auto d : c10::irange(sparse.sparse_dim())) { - dstBuffer = dstBuffer.select(0, indices_accessor[d][k]); - } - Tensor srcBuffer = valuesBuffer.select(0, k); - dstBuffer.add_(srcBuffer, value); + Tensor values = sparse_._values(); + bool sparse_is_coalesced = (sparse_.is_coalesced() || sparse_nnz == 1); + bool result_is_contiguous = ((r.storage().data() != nullptr) && resultBuffer.is_contiguous()); + bool value_is_contiguous = values.is_contiguous(); + bool is_contiguous = (result_is_contiguous && value_is_contiguous); + + SparseTensor sparse = sparse_; + Tensor indices = sparse_._indices(); + Tensor valuesBuffer = values.to(commonDtype); + if (is_contiguous && sparse_is_coalesced) { + //TODO: we can optimize it for non-hybrid by not using buffers + if (sparse_dim == dense_dim) { + AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4( + at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half, + commonDtype, "add_dense_sparse_non_hybrid", [&] { + add_dense_sparse_worker_non_hybrid_cpu(resultBuffer, value, sparse_, indices, valuesBuffer); + }); + } else { + AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4( + at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half, + commonDtype, "add_dense_sparse_hybrid", [&] { + add_dense_sparse_worker_hybrid_cpu(resultBuffer, value, sparse_, indices, valuesBuffer); + }); } - } else { + } else if (is_contiguous && (sparse_dim > 0)) { + // Handle sparse is not coalesced AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4( at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half, - commonDtype, "add_dense_sparse", [&] { - add_dense_sparse_worker_cpu(resultBuffer, value, sparse, indices, valuesBuffer); + commonDtype, "add_dense_sparse_worker_non_coalesced", [&] { + add_dense_sparse_worker_non_coalesced_cpu(resultBuffer, value, sparse_, indices, valuesBuffer); }); + } else { + // Slow path for non-contiguous values and output + // TODO: coalesce() performance may can be further improved + sparse = sparse_.coalesce(); + indices = sparse._indices(); + values = sparse._values(); + valuesBuffer = values.to(commonDtype); + auto indices_accessor = indices.accessor(); + auto sparse_nnz = sparse._nnz(); + at::parallel_for(0, sparse_nnz, 100, [&](int64_t start, int64_t end) { + for (auto k: c10::irange(start, end)) { + Tensor dstBuffer = resultBuffer; + for (auto d: c10::irange(sparse_dim)) { + dstBuffer = dstBuffer.select(0, indices_accessor[d][k]); + } + Tensor srcBuffer = valuesBuffer.select(0, k); + dstBuffer.add_(srcBuffer, value); + } + }); } if (r.scalar_type() != commonDtype) { r.copy_(resultBuffer); @@ -776,7 +898,7 @@ Tensor& intersection_binary_op_sparse_dense_out( const auto sparse_dim = static_cast(res_shape.size()); const auto indices = at::empty({sparse_dim, 0}, s_._indices().options()); const auto values = at::empty({0}, s_._values().options().dtype(res.scalar_type())); - get_sparse_impl(res)->raw_resize_(sparse_dim, /*dense_dim=*/0, /*shape=*/res_shape); + get_sparse_impl(res)->raw_resize_(sparse_dim, /*dense_dim=*/0, /*size=*/res_shape); get_sparse_impl(res)->set_indices_and_values_unsafe(indices, values); get_sparse_impl(res)->set_nnz_and_narrow(0); return res._coalesced_(true); @@ -798,7 +920,18 @@ Tensor& intersection_binary_op_sparse_dense_out( const auto apply_op = [&](const Tensor& d_filtered) -> Tensor& { const auto res_indices = s_indices.clone(); - const auto res_values = op(d_filtered, s_values); + // to(res.scalar_type) is only performed when both d and s are 0-dim. + // This insures right type promotions with the following rules: + // op(0-dim, 0-dim).dtype == + // op(0-dim, ge-1-dim).dtype == .dtype, + // where ge-1-dim is a tensor with dim >= 1. + // We do not cast if op is performed in-place. + // The cast is required if s is 0-dim non-coalesced tensor and d is 0-dim. + // This is because s.values is at least 1D, so + // op(s.values, d).dtype == s.values.dtype, but we want + // op(s.values, d).dtype == . + const auto values = op(d_filtered, s_values); + const auto res_values = is_same_tensor(s_, res) ? values : values.to(res.scalar_type()); get_sparse_impl(res)->raw_resize_(sparse_dim, dense_dim, res_shape); get_sparse_impl(res)->set_indices_and_values_unsafe(res_indices, res_values); get_sparse_impl(res)->set_nnz_and_narrow(s._nnz()); @@ -827,14 +960,14 @@ Tensor& intersection_binary_op_sparse_dense_out( intersec_indices.reserve(d_dim); if (d_start_dim_intersec) { - intersec_indices.push_back(Ellipsis); + intersec_indices.emplace_back(Ellipsis); } for (const auto i : c10::irange(sparse_dim_intersec)) { const auto s_idx = s_start_dim_intersec + i; - intersec_indices.push_back(s_indices[s_idx]); + intersec_indices.emplace_back(s_indices[s_idx]); } for (auto i = d_start_dim_intersec + sparse_dim_intersec; i < d_dim; ++i) { - intersec_indices.push_back(Slice()); + intersec_indices.emplace_back(Slice()); } // we need to expand d in the dimensions it is being indexed into // to avoid out of bound indices @@ -851,10 +984,10 @@ Tensor& intersection_binary_op_sparse_dense_out( // Otherwise nnz gets larger, and both indices and values need an update. const auto d_batch_shape = d.sizes().slice(0, d_start_dim_intersec); - const auto d_batch_len = d_batch_shape.size(); - int64_t batch_count; - int64_t max_batch_dim; - std::tie(batch_count, max_batch_dim) = [&]() -> std::tuple { + const auto d_batch_len = static_cast(d_batch_shape.size()); + int64_t batch_count = 1; + int64_t max_batch_dim = 0; + std::tie(batch_count, max_batch_dim) = [d_batch_shape]() -> std::tuple { int64_t batch_count = 1; int64_t max_batch_dim = 0; for (const auto& b : d_batch_shape) { @@ -873,31 +1006,31 @@ Tensor& intersection_binary_op_sparse_dense_out( const auto res_values = op(d_filtered, s_values).reshape(res_values_shape); const auto res_indices = [&]() -> Tensor { const auto index_buffer = at::arange(max_batch_dim, s_indices.options()); - auto res_indices = at::empty({res_sparse_dim, res_nnz}, s_indices.options()); + auto indices = at::empty({res_sparse_dim, res_nnz}, s_indices.options()); // fill in indices corresponding to the "batch" dimensions of d. int64_t n_repeat_interleave = res_nnz; - int n_repeat = 1; + int64_t n_repeat = 1; for (const auto dim : c10::irange(d_batch_len)) { const auto dim_size = d_batch_shape[dim]; n_repeat_interleave /= dim_size; // fill in indices corresponding to the "batch" dimension dim. - // Equivalent to res_indices[dim].copy_(repeat_interleave(dim_index, n_repeat_interleave).repeat(n_repeat)) + // Equivalent to indices[dim].copy_(repeat_interleave(dim_index, n_repeat_interleave).repeat(n_repeat)) const std::initializer_list dim_index_expanded_shape = {n_repeat, dim_size, n_repeat_interleave}; const auto dim_index = index_buffer.slice(-1, 0, dim_size); const auto dim_index_expanded = dim_index.unsqueeze(0).unsqueeze_(-1).expand(dim_index_expanded_shape); - // NOTE: res_indices is contiguous, so view is safe - res_indices[dim].view(dim_index_expanded_shape).copy_(dim_index_expanded); + // NOTE: indices is contiguous, so view is safe + indices[dim].view(dim_index_expanded_shape).copy_(dim_index_expanded); n_repeat *= dim_size; } // fill in indices corresponding to s_indices. - // Equivalent to res_indices_sparse.copy(s_indices.repeat({1, n_repeat}) + // Equivalent to indices_sparse.copy(s_indices.repeat({1, n_repeat}) n_repeat = res_nnz / s_nnz; - auto res_indices_sparse = res_indices.narrow(0, d_batch_len, res_sparse_dim - d_batch_len); + auto indices_sparse = indices.narrow(0, d_batch_len, res_sparse_dim - d_batch_len); const std::initializer_list s_indices_expanded_shape = {-1, n_repeat, s_nnz}; const auto s_indices_expanded = s_indices.unsqueeze(1).expand(s_indices_expanded_shape); - res_indices_sparse.view(s_indices_expanded_shape).copy_(s_indices_expanded); + indices_sparse.view(s_indices_expanded_shape).copy_(s_indices_expanded); - return res_indices; + return indices; }(); get_sparse_impl(res)->raw_resize_(res_sparse_dim, res_dense_dim, res_shape); @@ -914,6 +1047,46 @@ Tensor& _mul_dense_sparse_out(const Tensor& d, const Tensor& s, Tensor& res) { }); } +Tensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& other, Tensor& r) { + const auto is_wrapped_scalar = [](const Tensor& s) -> bool { + return !s.dim() && s.is_coalesced(); + }; + + const auto extract_vals_from_wrapped_scalar = [](const Tensor& s) -> Tensor { + auto vals = s._values().squeeze(0); + // if squeeze does not kill the dim, it means that + // vals is empty with shape [0]. In such a case we + // return a 0-dim empty tensor to avoid broadcasting + // issues in intersection_binary_op_sparse_dense_out + // when the sparse argument is actually 0-dim. + if (vals.dim()) { + return at::empty({}, vals.options()); + } + return vals; + }; + + // The code dispatches to mul(dense, sparse), and the goal + // is to delay calling into coalesce when converting one of + // the sparse arguments to dense if possible. + // This is possible when there is a 0-dim coalesced argument. + + // if is_wrapped_scalar(zero_dim) + if (zero_dim.is_coalesced()) { + const auto scalar_val = extract_vals_from_wrapped_scalar(zero_dim); + return _mul_dense_sparse_out(scalar_val, other, r); + } + // Here zero_dim is not a wrapped scalar, so we test other. + if (is_wrapped_scalar(other)) { + const auto scalar_val = extract_vals_from_wrapped_scalar(other); + return _mul_dense_sparse_out(scalar_val, zero_dim, r); + } + // Neither of inputs is a wrapped scalar, but zero_dim + // is at least 0-dim, so we coalesce it to convert to + // a scalar. + const auto scalar_val = extract_vals_from_wrapped_scalar(zero_dim.coalesce()); + return _mul_dense_sparse_out(scalar_val, other, r); +} + SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r) { AT_ASSERT(!t_.is_cuda()); // dispatch argument TORCH_CHECK(!r.is_cuda(), "mul: expected 'out' to be CPU tensor, but got CUDA tensor"); @@ -928,6 +1101,14 @@ SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r return _mul_dense_sparse_out(t_, src_, r); } + // case mul(sparse, sparse) with a 0-dim input. + if (!src_.dim()) { + return _mul_sparse_sparse_zero_dim_out(src_, t_, r); + } + if (!t_.dim()) { + return _mul_sparse_sparse_zero_dim_out(t_, src_, r); + } + TORCH_CHECK(t_.sizes().equals(src_.sizes()), "mul: expected 'self' and 'other' to have same sizes when both are sparse" ", but ", t_.sizes(), " != ", src_.sizes()); diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.h b/aten/src/ATen/native/sparse/SparseTensorMath.h index 645e0e65e0605..1a263b2e7d5e7 100644 --- a/aten/src/ATen/native/sparse/SparseTensorMath.h +++ b/aten/src/ATen/native/sparse/SparseTensorMath.h @@ -7,5 +7,6 @@ namespace at { namespace native { TORCH_API sparse::SparseTensor& mul_out_sparse_scalar(sparse::SparseTensor& r, const sparse::SparseTensor& t, const Scalar& value); TORCH_API sparse::SparseTensor& mul_out_sparse_zerodim(sparse::SparseTensor& r, const sparse::SparseTensor& t, const Tensor& value); TORCH_API sparse::SparseTensor& _mul_dense_sparse_out(const Tensor& d, const Tensor& s, Tensor& res); +TORCH_API sparse::SparseTensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& other, Tensor& res); }} diff --git a/aten/src/ATen/native/sparse/cuda/SoftMax.cu b/aten/src/ATen/native/sparse/cuda/SoftMax.cu index 05cb9e06d90f3..0591646f89b5a 100644 --- a/aten/src/ATen/native/sparse/cuda/SoftMax.cu +++ b/aten/src/ATen/native/sparse/cuda/SoftMax.cu @@ -258,7 +258,7 @@ Tensor get_offsets( cudaMemcpyHostToDevice, stream)); - auto indices_accessor = indices.packed_accessor(); + auto indices_accessor = indices.packed_accessor64(); Tensor offsets = at::empty({nnz}, indices.options()); @@ -345,7 +345,7 @@ std::tuple compute_pool_max( if (requireMxRows) { auto values_accessor = - values.packed_accessor(); // {nnz, nvalues} + values.packed_accessor64(); // {nnz, nvalues} mx_buffer = at::full({new_sz * nvalues}, Scalar(-std::numeric_limits::infinity()), values.options()); @@ -420,10 +420,10 @@ void cuda_sparse_coo_softmax( /* Prepare accessors */ auto values_2 = values.view({nnz, nvalues}); - auto values_accessor = values_2.packed_accessor(); + auto values_accessor = values_2.packed_accessor64(); auto out_values_2 = out_values.view({nnz, nvalues}); - auto out_values_accessor = out_values_2.packed_accessor(); + auto out_values_accessor = out_values_2.packed_accessor64(); Tensor sorted_indices; Tensor pool_offsets; @@ -539,13 +539,13 @@ void cuda_sparse_coo_softmax_backward( auto nvalues = get_nvalues(sizes, sparse_dim); auto values_2 = values.view({nnz, nvalues}); - auto values_accessor = values_2.packed_accessor(); + auto values_accessor = values_2.packed_accessor64(); auto out_values_2 = out_values.view({out_nnz, nvalues}); - auto out_values_accessor = out_values_2.packed_accessor(); + auto out_values_accessor = out_values_2.packed_accessor64(); auto grad_values_2 = grad_values.view({grad_nnz, nvalues}); - auto grad_values_accessor = grad_values_2.packed_accessor(); + auto grad_values_accessor = grad_values_2.packed_accessor64(); Tensor lower_bound_values = at::empty({out_offsets.size(0)}, indices.options()); diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp index 4309e756e8bea..bae31b308cbfe 100644 --- a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp +++ b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp @@ -562,7 +562,7 @@ void spmm( const Scalar& beta, const Scalar& alpha, const Tensor& result) { -#if !AT_USE_CUSPARSE_GENERIC_API() +#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_52_API()) addmm_out_legacy(mat1, mat2, beta, alpha, result); #else c10::MaybeOwned result_ = prepare_dense_matrix_for_cusparse(result); @@ -663,7 +663,7 @@ void spmm( if (!result.is_same(*result_)) { result.copy_(*result_); } -#endif // !AT_USE_CUSPARSE_GENERIC_API() +#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()) } void spgemm( @@ -672,12 +672,18 @@ void spgemm( const Scalar& beta, const Scalar& alpha, const at::sparse_csr::SparseCsrTensor& C) { -#if defined(CUDA_VERSION) && CUDA_VERSION < 11000 +#if (!defined(USE_ROCM)) && (defined(CUDA_VERSION) && CUDA_VERSION < 11000) TORCH_CHECK( false, "Calling addmm with sparse GPU tensors requires compiling ", "PyTorch with CUDA 11+. ", "Please use PyTorch built with newer CUDA version."); +#elif defined(USE_ROCM) && ROCM_VERSION < 50200 + TORCH_CHECK( + false, + "Calling addmm with sparse GPU tensors requires compiling ", + "PyTorch with ROCm 5.2+. ", + "Please use PyTorch built with newer ROCm version."); #else // older versions of cusparse on Windows segfault for complex128 dtype #if defined(_WIN32) && defined(CUSPARSE_VERSION) && CUSPARSE_VERSION < 11400 @@ -862,7 +868,7 @@ void addmv_out_sparse_csr( if (mat.layout() == kSparseBsr) { return block_sparse_mv(mat, vec, beta, alpha, result); } -#if !AT_USE_CUSPARSE_GENERIC_API() +#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()) TORCH_CHECK( false, "Calling addmv on a sparse GPU tensor requires compiling ", @@ -936,7 +942,7 @@ void addmv_out_sparse_csr( if (!result.is_same(*result_)) { result.copy_(*result_); } -#endif +#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()) } /* diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp index 633a503ac8332..bd89e6fc1701a 100644 --- a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp +++ b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp @@ -19,7 +19,13 @@ #define IS_SPMM_AVAILABLE() 0 #endif -#if IS_SPMM_AVAILABLE() +#if defined(USE_ROCM) && ROCM_VERSION >= 50200 +#define IS_SPMM_HIP_AVAILABLE() 1 +#else +#define IS_SPMM_HIP_AVAILABLE() 0 +#endif + +#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE() #include #endif @@ -86,7 +92,7 @@ cusparseOperation_t convertTransToCusparseOperation(char trans) { } } -#if IS_SPMM_AVAILABLE() +#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE() namespace { template diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu index bea7788e9d579..0cd9882b0c1be 100644 --- a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu +++ b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu @@ -471,6 +471,14 @@ SparseTensor& mul_out_sparse_cuda(const Tensor& t_, const Tensor& src_, SparseTe return _mul_dense_sparse_out(t_, src_, r_); } + // case mul(sparse, sparse) with a 0-dim input. + if (!src_.dim()) { + return _mul_sparse_sparse_zero_dim_out(src_, t_, r_); + } + if (!t_.dim()) { + return _mul_sparse_sparse_zero_dim_out(t_, src_, r_); + } + TORCH_CHECK(t_.is_cuda(), "mul: expected 'self' to be CUDA, but got CPU"); TORCH_CHECK(src_.is_cuda(), "mul: expected 'other' to be CUDA, but got CPU"); TORCH_CHECK(cuda::check_device({r_, t_, src_})); @@ -708,7 +716,7 @@ Tensor bmm_sparse_cuda(const SparseTensor& self, const Tensor& mat2) { return bmm_out_sparse_cuda(self, mat2, result); } -#if !(defined(USE_ROCM) || (defined(_MSC_VER) && CUSPARSE_VERSION < 11000)) +#if defined(USE_ROCM) || !(defined(_MSC_VER) && CUSPARSE_VERSION < 11000) __global__ void search_end_matrix_indices_cuda_kernel( int64_t* mat_el_end_indices, int64_t num_matrices, @@ -789,11 +797,9 @@ cudaDataType getTensorCudaDataType(Tensor self) { #endif Tensor& bmm_out_sparse_cuda(const SparseTensor& self, const Tensor& mat2, Tensor& result) { -#if defined(USE_ROCM) - TORCH_CHECK(false, "bmm sparse-dense is not supported on HIP"); -#elif defined(_MSC_VER) && (CUSPARSE_VERSION < 11000) +#if defined(_MSC_VER) && (CUSPARSE_VERSION < 11000) TORCH_CHECK(false, "bmm sparse-dense CUDA is not supported on Windows with cuda before 11.0"); -#elif defined(CUDART_VERSION) && (CUDART_VERSION >= 10010) // linux cuda >= 10.1 or windows cuda >= 11.0 +#elif defined(USE_ROCM) || (defined(CUDART_VERSION) && (CUDART_VERSION >= 10010)) // linux cuda >= 10.1 or windows cuda >= 11.0 TORCH_CHECK(!mat2.is_sparse(), "bmm_sparse: Tensor 'mat2' must be dense"); TORCH_CHECK(self.dense_dim() == 0, "bmm_sparse: Tensor 'self' must have 0 dense dims, but has ", self.dense_dim()); diff --git a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu index dbc194ddb20b6..8cc5fc3157c38 100644 --- a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu +++ b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu @@ -734,13 +734,13 @@ void sparse_sparse_matmul_cuda_kernel( output_values.set_(csr_output.csr_values_); output_indices.resize_({2, nnz}); - auto output_indices_accessor = output_indices.packed_accessor(); + auto output_indices_accessor = output_indices.packed_accessor64(); auto csr_output_pointers_accessor = - csr_output.csr_pointers_.packed_accessor(); + csr_output.csr_pointers_.packed_accessor64(); auto csr_output_ind_accessor = - csr_output.csr_indices_.packed_accessor(); + csr_output.csr_indices_.packed_accessor64(); auto major_dim = result.size(0); cudaStream_t stream = at::cuda::getCurrentCUDAStream(); diff --git a/aten/src/ATen/native/tags.yaml b/aten/src/ATen/native/tags.yaml index 39ff5de6f7c48..8fc44c68c2674 100644 --- a/aten/src/ATen/native/tags.yaml +++ b/aten/src/ATen/native/tags.yaml @@ -12,6 +12,12 @@ desc: | This tag indicates if an operator's output's shape depends on input Tensor data. +- tag: data_dependent_output + desc: | + Operator has a non-Tensor output whose value is dependent on the data + of Tensor inputs. Among other things, this implies that this operator + cannot be run with meta tensor (since data is not available), nor + can it be symbolically traced. - tag: generated desc: | This tag indicates that the operator doesn't have an explicit entry in diff --git a/aten/src/ATen/native/transformers/attention.cpp b/aten/src/ATen/native/transformers/attention.cpp index 67fa95c72aa2e..6a6a6daafd866 100644 --- a/aten/src/ATen/native/transformers/attention.cpp +++ b/aten/src/ATen/native/transformers/attention.cpp @@ -1,5 +1,4 @@ #include - #include #include #include @@ -118,14 +117,10 @@ Tensor bmm_nt(const Tensor& a, const Tensor& b) { Tensor masked_softmax( Tensor& attn_scores, c10::optional attn_mask, - const Tensor& query) { + const Tensor& query, + c10::optional mask_type = NULL) { if (query.is_nested() && !attn_mask) { - if (attn_scores.is_cpu()) { - NestedTensor_softmax_dropout(query, attn_scores); - return attn_scores; - } - attn_mask = NestedTensor_to_mask(query, 2, attn_scores.size(2)); - attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true); + return at::_nested_tensor_softmax_with_shape(attn_scores, query); } if (attn_mask && attn_mask->dtype() != at::kBool) { TORCH_WARN( @@ -143,7 +138,7 @@ Tensor masked_softmax( attn_mask = at::expand_inplace(attn_scores, *attn_mask)->contiguous(); } if (attn_mask) { - return _masked_softmax(attn_scores, *attn_mask); + return _masked_softmax(attn_scores, *attn_mask, attn_scores.dim() - 1, mask_type); } else { return _softmax_out(attn_scores, attn_scores, attn_scores.dim() - 1, false); } @@ -329,7 +324,8 @@ std::tuple native_multi_head_attention( const Tensor& proj_bias, const c10::optional& mask, bool need_weights, - bool average_attn_weights) { + bool average_attn_weights, + const c10::optional mask_type) { // query shape: [B, T, D] // qkv_weight shape: [3 * D, D] @@ -445,7 +441,7 @@ std::tuple native_multi_head_attention( // shape: [B, num_head, T, T] // TODO: long-term, have a kernel that works with // NestedTensor directly if there is no mask passed - qkt = masked_softmax(qkt, mask, query); + qkt = masked_softmax(qkt, mask, query, mask_type); #ifdef DEBUG_PRINT_EACH_STEP std::cerr << "qkt after softmax: " << qkt << std::endl; #endif @@ -727,5 +723,103 @@ std::tuple _scaled_dot_product_attention( return (need_attn_weights ? std::make_tuple(output, attn) : std::make_tuple(output, Tensor())); } +Tensor triton_multi_head_attention( + const Tensor& query, + const Tensor& key, + const Tensor& value, + const int64_t embed_dim, + const int64_t num_head, + const Tensor& qkv_weight, + const Tensor& qkv_bias, + const Tensor& proj_weight, + const Tensor& proj_bias, + const c10::optional& mask) { + // query shape: [B, T, D] + // qkv_weight shape: [3 * D, D] + TORCH_CHECK(!mask, "Only casual mask is supported for Triton."); + + const auto D = embed_dim; + TORCH_CHECK( + query.dim() == 3, + "expected 3-D `query`, got ", + query.dim(), + "-D tensor"); + TORCH_CHECK( + query.sizes()[2] == embed_dim, + "passed-in embed_dim ", + embed_dim, + " didn't match last dim of query ", + query.sizes()[2]); + TORCH_CHECK( + key.dim() == 3, + "expected 3-D `key`, got ", + key.dim(), + "-D tensor"); + TORCH_CHECK( + value.dim() == 3, + "expected 3-D `value`, got ", + value.dim(), + "-D tensor"); + TORCH_CHECK( + query.sizes() == key.sizes() && key.sizes() == value.sizes(), + "expected `query`/`key`/`value` shapes to match"); + TORCH_CHECK( + qkv_weight.dim() == 2, + "expected 2-D `qkv_weight`, got ", + qkv_weight.dim(), + "-D tensor"); + TORCH_CHECK( + D * 3 == qkv_weight.sizes()[0], + "expected `qkv_weight` first dim to be 3x embed_dim"); + TORCH_CHECK( + D == qkv_weight.sizes()[1], + "expected `qkv_weight` second dim to be embed_Dim"); + +#ifndef NDEBUG + const auto B = query.is_nested() + ? get_nested_tensor_impl(query)->get_nested_size_tensor().size(0) + : query.sizes()[0]; + auto T = query.is_nested() ? 0 : query.sizes()[1]; + const auto dim_per_head = D / num_head; +#endif + + // shape: [B, T, 3 x D] + auto qkv = qkv_projection(query, key, value, embed_dim, qkv_weight); + + // shape: 3 x [B, num_head, T, dim_per_head] + auto q_k_v = _transform_bias_rescale_qkv(qkv, qkv_bias, num_head); + qkv = Tensor(); // Not used any more, allow free + auto& q = std::get<0>(q_k_v); + const auto& k = std::get<1>(q_k_v); + const auto& v = std::get<2>(q_k_v); +#ifndef NDEBUG + debug_assert_shape(__LINE__, q, {B, num_head, T, dim_per_head}); + debug_assert_shape(__LINE__, k, {B, num_head, T, dim_per_head}); + debug_assert_shape(__LINE__, v, {B, num_head, T, dim_per_head}); +#endif +#ifdef DEBUG_PRINT_EACH_STEP + std::cerr << "q: " << q << std::endl; + std::cerr << "k: " << k << std::endl; + std::cerr << "v: " << v << std::endl; +#endif + + auto attn_ctx = at::_triton_scaled_dot_attention(q, k, v); + +#ifndef NDEBUG + debug_assert_shape(__LINE__, attn_ctx, {B, num_head, T, dim_per_head}); +#endif +#ifdef DEBUG_PRINT_EACH_STEP + std::cerr << "attn_ctx: " << attn_ctx << std::endl; +#endif + + // shape: [B, T, D] + // Fuse transform_0213 inside + auto proj = transform0213_gemm_nt_bias( + attn_ctx, proj_weight, proj_bias, query); +#ifndef NDEBUG + debug_assert_shape(__LINE__, proj, {B, T, D}); +#endif + return proj; +} } // namespace native } // namespace at diff --git a/aten/src/ATen/native/transformers/cuda/attention.cu b/aten/src/ATen/native/transformers/cuda/attention.cu index f347cd6c8c30a..dd31a755bf1dd 100644 --- a/aten/src/ATen/native/transformers/cuda/attention.cu +++ b/aten/src/ATen/native/transformers/cuda/attention.cu @@ -345,7 +345,7 @@ __host__ std::tuple transform_bias_rescale_qkv_cuda( accscalar_t, \ assume_aligned> \ <<>>( \ - nt_qkv->get_buffer() \ + nt_qkv_buffer \ .packed_accessor64(), \ qkv_bias.packed_accessor64(), \ offsets_ptr, \ @@ -376,6 +376,7 @@ __host__ std::tuple transform_bias_rescale_qkv_cuda( } if (qkv.is_nested()) { auto* nt_qkv = get_nested_tensor_impl(qkv); + const at::Tensor& nt_qkv_buffer = nt_qkv->get_buffer(); auto sizes = collapse_dims_1_and_2(nt_qkv->get_nested_size_tensor()); auto offsets = NestedTensor_batch_offsets_from_size_tensor(sizes, sizes.numel()); @@ -387,7 +388,7 @@ __host__ std::tuple transform_bias_rescale_qkv_cuda( const auto input_dim = sizes.sizes()[1]; TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input_dim == 1); if (aligned && - ((reinterpret_cast(nt_qkv->get_buffer().data_ptr()) % + ((reinterpret_cast(qkv.data_ptr()) % TRANSFORM_BIAS_RESCALE_VEC) == 0)) { CALL_ADD_PADDING_KERNEL(true); } else { @@ -406,5 +407,10 @@ __host__ std::tuple transform_bias_rescale_qkv_cuda( at::native::split(q_k_v.view({3 * B, num_head, T, dim_per_head}), B, 0); return std::make_tuple(q_k_v_s[0], q_k_v_s[1], q_k_v_s[2]); } + +Tensor triton_scaled_dot_attention(const Tensor& q, const Tensor& k, const Tensor& v, double dropout_p){ + TORCH_CHECK(false, "This operator should be overridden in python before use"); + return at::Tensor(); +} } // namespace native } // namespace at diff --git a/aten/src/ATen/native/transformers/transformer.cpp b/aten/src/ATen/native/transformers/transformer.cpp index bba3adc9b2c4b..2a641a40dfb5f 100644 --- a/aten/src/ATen/native/transformers/transformer.cpp +++ b/aten/src/ATen/native/transformers/transformer.cpp @@ -92,7 +92,8 @@ Tensor transformer_encoder_layer_forward( const Tensor& ffn_bias_1, const Tensor& ffn_weight_2, const Tensor& ffn_bias_2, - const c10::optional& mask) { + const c10::optional& mask, + const c10::optional mask_type) { { const Tensor& check_for_empty = src.is_nested() ? get_nested_tensor_impl(src)->get_buffer() : src; if (check_for_empty.numel() == 0) { @@ -117,7 +118,9 @@ Tensor transformer_encoder_layer_forward( proj_weight, proj_bias, mask, - false /* need_weights */)); + false /* need_weights */, + true /* average_attn_weights */, + mask_type)); add_in_place(x, src, use_nested_tensor); if (!norm_first) { x = norm(x, embed_dim, layer_norm_eps, layer_norm_weight_1, layer_norm_bias_1, use_nested_tensor); diff --git a/aten/src/ATen/native/ts_native_functions.yaml b/aten/src/ATen/native/ts_native_functions.yaml index 2ef238c0bff00..a6d26b3ad75b6 100644 --- a/aten/src/ATen/native/ts_native_functions.yaml +++ b/aten/src/ATen/native/ts_native_functions.yaml @@ -199,7 +199,6 @@ supported: - pixel_unshuffle - select_backward - _trilinear - - linalg_inv_ex - linalg_pinv.atol_rtol_tensor - logsumexp.out autograd: diff --git a/aten/src/ATen/native/vulkan/api/Allocator.h b/aten/src/ATen/native/vulkan/api/Allocator.h index 470eb07543c24..ca7541784cf06 100644 --- a/aten/src/ATen/native/vulkan/api/Allocator.h +++ b/aten/src/ATen/native/vulkan/api/Allocator.h @@ -47,7 +47,7 @@ #pragma clang diagnostic ignored "-Wunused-variable" #endif /* __clang__ */ -#include +#include #ifdef __clang__ #pragma clang diagnostic pop diff --git a/aten/src/ATen/native/vulkan/api/Command.cpp b/aten/src/ATen/native/vulkan/api/Command.cpp index b2c63ee4399f5..c42eda1c5ef26 100644 --- a/aten/src/ATen/native/vulkan/api/Command.cpp +++ b/aten/src/ATen/native/vulkan/api/Command.cpp @@ -215,6 +215,82 @@ void CommandBuffer::copy_texture_to_texture( state_ = CommandBuffer::State::RECORDING; } +void CommandBuffer::copy_texture_to_buffer( + const api::VulkanImage& source, + const api::VulkanBuffer& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) { + TORCH_CHECK( + state_ == CommandBuffer::State::BARRIERS_INSERTED, + "Vulkan CommandBuffer: called copy_texture_to_buffer() on a command buffer whose state " + "is not BARRIERS_INSERTED."); + + const VkImageSubresourceLayers src_subresource_layers{ + VK_IMAGE_ASPECT_COLOR_BIT, // aspectMask + 0u, // mipLevel + 0u, // baseArrayLayer + 1u, // layerCount + }; + + const VkBufferImageCopy copy_details{ + dst_offset.data[0u], // bufferOffset + dst_offset.data[1u], // bufferRowLength + dst_offset.data[2u], // bufferImageHeight + src_subresource_layers, // imageSubresource + create_offset3d(src_offset), // imageOffset + create_extent3d(copy_range), // imageExtent + }; + + vkCmdCopyImageToBuffer( + handle_, + source.handle(), + source.layout(), + destination.handle(), + 1u, + ©_details); + + state_ = CommandBuffer::State::RECORDING; +} + +void CommandBuffer::copy_buffer_to_texture( + const api::VulkanBuffer& source, + const api::VulkanImage& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) { + TORCH_CHECK( + state_ == CommandBuffer::State::BARRIERS_INSERTED, + "Vulkan CommandBuffer: called copy_buffer_to_texture() on a command buffer whose state " + "is not BARRIERS_INSERTED."); + + const VkImageSubresourceLayers dst_subresource_layers{ + VK_IMAGE_ASPECT_COLOR_BIT, // aspectMask + 0u, // mipLevel + 0u, // baseArrayLayer + 1u, // layerCount + }; + + const VkBufferImageCopy copy_details{ + src_offset.data[0u], // bufferOffset + src_offset.data[1u], // bufferRowLength + src_offset.data[2u], // bufferImageHeight + dst_subresource_layers, // imageSubresource + create_offset3d(dst_offset), // imageOffset + create_extent3d(copy_range), // imageExtent + }; + + vkCmdCopyBufferToImage( + handle_, + source.handle(), + destination.handle(), + destination.layout(), + 1u, + ©_details); + + state_ = CommandBuffer::State::RECORDING; +} + void CommandBuffer::write_timestamp( const VkQueryPool querypool, const uint32_t idx) const { diff --git a/aten/src/ATen/native/vulkan/api/Command.h b/aten/src/ATen/native/vulkan/api/Command.h index 6bb9f49e95656..f52e238463fd8 100644 --- a/aten/src/ATen/native/vulkan/api/Command.h +++ b/aten/src/ATen/native/vulkan/api/Command.h @@ -34,7 +34,7 @@ class CommandBuffer final { INVALID, // Used to indicate the command buffer is moved from NEW, // Set during constructor RECORDING, // Set during call to begin(), dispatch(), and - // copy_texture_to_texture() + // copy_*_to_*() PIPELINE_BOUND, // Set during call to bind_pipeline() DESCRIPTORS_BOUND, // Set during call to bind_descriptors() BARRIERS_INSERTED, // Set during call to insert_barrier() @@ -88,6 +88,20 @@ class CommandBuffer final { const api::utils::uvec3&, const api::utils::uvec3&); + void copy_texture_to_buffer( + const api::VulkanImage&, + const api::VulkanBuffer&, + const api::utils::uvec3&, + const api::utils::uvec3&, + const api::utils::uvec3&); + + void copy_buffer_to_texture( + const api::VulkanBuffer&, + const api::VulkanImage&, + const api::utils::uvec3&, + const api::utils::uvec3&, + const api::utils::uvec3&); + void write_timestamp(const VkQueryPool, const uint32_t) const; void reset_querypool(const VkQueryPool, const uint32_t, const uint32_t) const; diff --git a/aten/src/ATen/native/vulkan/api/Common.h b/aten/src/ATen/native/vulkan/api/Common.h index 1fa268e63409b..d658181b4802d 100644 --- a/aten/src/ATen/native/vulkan/api/Common.h +++ b/aten/src/ATen/native/vulkan/api/Common.h @@ -20,23 +20,36 @@ } #endif /* USE_VULKAN_SHADERC_RUNTIME */ -#define VK_CHECK(function) \ - do { \ - const VkResult result = (function); \ - TORCH_CHECK( \ - VK_SUCCESS == result, \ - C10_STRINGIZE(__FILE__), \ - " [", \ - C10_STRINGIZE(__LINE__), \ - "] " \ - "VkResult:", \ - result); \ +/* + * Check that the return code of a Vulkan API call is VK_SUCCESS, throwing an + * error with the returned code if not. If STRIP_ERROR_MESSAGES is defined then + * only the return code will be preserved. + */ +#ifdef STRIP_ERROR_MESSAGES +#define VK_CHECK(function) \ + do { \ + const VkResult result = (function); \ + if (VK_SUCCESS != result) { \ + throw c10::Error( \ + {__func__, __FILE__, static_cast(__LINE__)}, \ + c10::str(result)); \ + } \ } while (false) - -#define VK_CHECK_RELAXED(function) \ - do { \ - const VkResult result = (function); \ - TORCH_CHECK(VK_SUCCESS <= result, "VkResult:", result); \ +#else +#define VK_CHECK(function) \ + do { \ + const VkResult result = (function); \ + if (VK_SUCCESS != result) { \ + throw c10::Error( \ + {__func__, __FILE__, static_cast(__LINE__)}, \ + c10::str( \ + C10_STRINGIZE(__FILE__), \ + "[", \ + C10_STRINGIZE(__LINE__), \ + "] Expected VK_SUCCESS, got VkResult of ", \ + result)); \ + } \ } while (false) +#endif /* STRIP_ERROR_MESSAGES */ #endif /* USE_VULKAN_API */ diff --git a/aten/src/ATen/native/vulkan/api/Context.cpp b/aten/src/ATen/native/vulkan/api/Context.cpp index 4d7b3aa0d9877..a26dc95000328 100644 --- a/aten/src/ATen/native/vulkan/api/Context.cpp +++ b/aten/src/ATen/native/vulkan/api/Context.cpp @@ -72,48 +72,6 @@ void Context::submit_compute_epilogue( command_buffer.dispatch(global_workgroup_size); } -void Context::submit_texture_copy( - const PipelineBarrier& pipeline_barrier, - const api::VulkanImage& source, - const api::VulkanImage& destination, - const api::utils::uvec3& copy_range, - const api::utils::uvec3& src_offset, - const api::utils::uvec3& dst_offset, - const VkFence fence_handle) { - // Serialize recording to the shared command buffer. Do not initialize with a - // mutex just yet, since in some cases it will be externally managed. - std::unique_lock cmd_lock; - // Refer to comments in submit_compute_job for explanation. - if (fence_handle == VK_NULL_HANDLE) { - cmd_lock = std::unique_lock(cmd_mutex_); - } - - set_cmd(); - -#ifdef USE_VULKAN_GPU_DIAGNOSTICS - uint32_t log_idx = querypool_.shader_profile_begin( - cmd_, - "copy_texture_to_texture", - create_extent3d({0, 0, 0}), - create_extent3d({0, 0, 0})); -#endif /* USE_VULKAN_GPU_DIAGNOSTICS */ - - cmd_.insert_barrier(pipeline_barrier); - - cmd_.copy_texture_to_texture( - source, destination, copy_range, src_offset, dst_offset); - -#ifdef USE_VULKAN_GPU_DIAGNOSTICS - querypool_.shader_profile_end(cmd_, log_idx); -#endif /* USE_VULKAN_GPU_DIAGNOSTICS */ - - submit_count_++; - if (fence_handle != VK_NULL_HANDLE || - submit_count_ >= config_.cmdSubmitFrequency) { - submit_cmd_to_gpu(fence_handle); - } -} - void Context::submit_cmd_to_gpu(const VkFence fence_handle) { if (cmd_) { cmd_.end(); @@ -171,18 +129,26 @@ Context* context() { }; return new Context(runtime()->default_adapter_i(), config); + } catch (const c10::Error& e) { + TORCH_WARN( + "Pytorch Vulkan Context: Failed to initialize global vulkan context: ", + e.what()); } catch (const std::exception& e) { - TORCH_CHECK( - false, "Vulkan: Failed to initialize context! Error: ", e.what()); + TORCH_WARN( + "Pytorch Vulkan Context: Failed to initialize global vulkan context: ", + e.what()); } catch (...) { - TORCH_CHECK( - false, "Vulkan: Failed to initialize context! Error: Unknown"); + TORCH_WARN( + "Pytorch Vulkan Context: Failed to initialize global vulkan context!"); } return nullptr; }()); - TORCH_INTERNAL_ASSERT_DEBUG_ONLY(context, "Invalid Vulkan context!"); + TORCH_CHECK( + context, + "Pytorch Vulkan Context: The global context could not be retrieved " + "because it failed to initialize."); return context.get(); } diff --git a/aten/src/ATen/native/vulkan/api/Context.h b/aten/src/ATen/native/vulkan/api/Context.h index fbf4aae11376f..e9464b9a16a7e 100644 --- a/aten/src/ATen/native/vulkan/api/Context.h +++ b/aten/src/ATen/native/vulkan/api/Context.h @@ -163,6 +163,16 @@ class Context final { const utils::uvec3&); public: + template + void submit_copy( + const PipelineBarrier&, + const S&, + const D&, + const api::utils::uvec3&, + const api::utils::uvec3&, + const api::utils::uvec3&, + const VkFence fence_handle); + template void submit_compute_job( const ShaderSource&, @@ -172,15 +182,6 @@ class Context final { const VkFence fence_handle, Arguments&&...); - void submit_texture_copy( - const PipelineBarrier& pipeline_barrier, - const api::VulkanImage&, - const api::VulkanImage&, - const api::utils::uvec3&, - const api::utils::uvec3&, - const api::utils::uvec3&, - const VkFence fence_handle); - private: void submit_cmd_to_gpu(const VkFence fence_handle = VK_NULL_HANDLE); @@ -215,28 +216,33 @@ class UniformParamsBuffer final { } }; -class StagingBuffer final { +class StorageBuffer final { private: Context* context_p_; + c10::ScalarType dtype_; + size_t numel_; VulkanBuffer vulkan_buffer_; public: - StagingBuffer( + StorageBuffer( Context* context_p, - const VkDeviceSize size, + const c10::ScalarType dtype, + const size_t numel, const bool gpuonly = false) : context_p_(context_p), + dtype_(dtype), + numel_(numel), vulkan_buffer_(context_p_->adapter_ptr()->vma().create_storage_buffer( - size, + c10::elementSize(dtype_) * numel_, gpuonly)) {} - StagingBuffer(const StagingBuffer&) = delete; - StagingBuffer& operator=(const StagingBuffer&) = delete; + StorageBuffer(const StorageBuffer&) = delete; + StorageBuffer& operator=(const StorageBuffer&) = delete; - StagingBuffer(StagingBuffer&&) = delete; - StagingBuffer& operator=(StagingBuffer&&) = delete; + StorageBuffer(StorageBuffer&&) = delete; + StorageBuffer& operator=(StorageBuffer&&) = delete; - ~StagingBuffer() { + ~StorageBuffer() { context_p_->register_buffer_cleanup(vulkan_buffer_); } @@ -266,6 +272,91 @@ inline void bind( } // namespace detail +template +inline void record_copy( + CommandBuffer& cmd, + const S& source, + const D& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) = delete; + +template <> +inline void record_copy( + CommandBuffer& cmd, + const VulkanImage& source, + const VulkanImage& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) { + cmd.copy_texture_to_texture( + source, destination, copy_range, src_offset, dst_offset); +} + +template <> +inline void record_copy( + CommandBuffer& cmd, + const VulkanImage& source, + const VulkanBuffer& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) { + cmd.copy_texture_to_buffer( + source, destination, copy_range, src_offset, dst_offset); +} + +template <> +inline void record_copy( + CommandBuffer& cmd, + const VulkanBuffer& source, + const VulkanImage& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset) { + cmd.copy_buffer_to_texture( + source, destination, copy_range, src_offset, dst_offset); +} + +template +inline void Context::submit_copy( + const PipelineBarrier& pipeline_barrier, + const S& source, + const D& destination, + const api::utils::uvec3& copy_range, + const api::utils::uvec3& src_offset, + const api::utils::uvec3& dst_offset, + const VkFence fence_handle) { + // Serialize recording to the shared command buffer. Do not initialize with a + // mutex just yet, since in some cases it will be externally managed. + std::unique_lock cmd_lock; + // Refer to comments in submit_compute_job for explanation. + if (fence_handle == VK_NULL_HANDLE) { + cmd_lock = std::unique_lock(cmd_mutex_); + } + + set_cmd(); + +#ifdef USE_VULKAN_GPU_DIAGNOSTICS + std::string label = "cmd_copy"; + uint32_t log_idx = querypool_.shader_profile_begin( + cmd_, label, create_extent3d({0, 0, 0}), create_extent3d({0, 0, 0})); +#endif /* USE_VULKAN_GPU_DIAGNOSTICS */ + + cmd_.insert_barrier(pipeline_barrier); + + record_copy(cmd_, source, destination, copy_range, src_offset, dst_offset); + +#ifdef USE_VULKAN_GPU_DIAGNOSTICS + querypool_.shader_profile_end(cmd_, log_idx); +#endif /* USE_VULKAN_GPU_DIAGNOSTICS */ + + submit_count_++; + if (fence_handle != VK_NULL_HANDLE || + submit_count_ >= config_.cmdSubmitFrequency) { + submit_cmd_to_gpu(fence_handle); + } +} + template inline void Context::submit_compute_job( const ShaderSource& shader_descriptor, diff --git a/aten/src/ATen/native/vulkan/api/Resource.cpp b/aten/src/ATen/native/vulkan/api/Resource.cpp index 82b98579e051f..fde96a87f959e 100644 --- a/aten/src/ATen/native/vulkan/api/Resource.cpp +++ b/aten/src/ATen/native/vulkan/api/Resource.cpp @@ -10,6 +10,20 @@ namespace api { // Utility Functions // +/* + * This function is used to determine what image format to use for a given + * dtype. + * + * TODO: enable proper format selection between kFloat and kHalf. + * + * Context: due to limitations of the shader compilation system, at the moment + * it is not possible to support both 32 bit and 16 bit float formats since + * shaders will have to specify the format qualifier of texture inputs. Right + * now, shaders are compiled with either rgba16f or rgba32f qualifiers depending + * on whether USE_VULKAN_FP16_INFERENCE is set. Therefore, textures must be + * always created with the corresponding VkFormat. Consequently, kHalf tensors + * are currently unsupported in favor of enforcing inputs to be of kFloat dtype. + */ VkFormat vk_format(const caffe2::TypeMeta dtype) { switch (c10::typeMetaToScalarType(dtype)) { case kFloat: @@ -18,15 +32,34 @@ VkFormat vk_format(const caffe2::TypeMeta dtype) { #else return VK_FORMAT_R32G32B32A32_SFLOAT; #endif /* USE_VULKAN_FP16_INFERENCE */ - case c10::kQUInt8: return VK_FORMAT_R8G8B8A8_UINT; default: - TORCH_CHECK(false, "Vulkan tensor format not supported!"); + TORCH_CHECK( + false, "Vulkan vk_format(): no corresponding format for dtype"); + } +} + +/* + * This function is used to map a texture format to a corresponding + * c10::ScalarType. It is primarily used to set the data type of a + * StorageBuffer object that will receive copied data from a texture. + */ +c10::ScalarType c10_scalartype(const VkFormat image_format) { + switch (image_format) { + case VK_FORMAT_R32G32B32A32_SFLOAT: + return c10::kFloat; + case VK_FORMAT_R16G16B16A16_SFLOAT: + return c10::kHalf; + case VK_FORMAT_R8G8B8A8_UINT: + return c10::kQUInt8; + + default: + TORCH_CHECK(false, "vulkan c10_scalartype(): Unknown VkFormat."); } - return VK_FORMAT_UNDEFINED; } + // // MemoryBarrier // @@ -137,7 +170,8 @@ MemoryMap::MemoryMap(const VulkanBuffer& buffer, const uint8_t access) : access_(access), allocator_(buffer.vma_allocator()), allocation_(buffer.allocation()), - data_(nullptr) { + data_(nullptr), + data_len_{buffer.mem_size()} { VK_CHECK(vmaMapMemory(allocator_, allocation_, &data_)); } @@ -145,7 +179,8 @@ MemoryMap::MemoryMap(MemoryMap&& other) noexcept : access_(other.access_), allocator_(other.allocator_), allocation_(other.allocation_), - data_(other.data_) { + data_(other.data_), + data_len_{other.data_len_} { other.allocation_ = VK_NULL_HANDLE; other.data_ = nullptr; } @@ -158,8 +193,8 @@ MemoryMap::~MemoryMap() { if (access_ & MemoryAccessType::WRITE) { // Call will be ignored by implementation if the memory type this allocation // belongs to is not HOST_VISIBLE or is HOST_COHERENT, which is the behavior - // we want. - VK_CHECK(vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE)); + // we want. Don't check the result here as the destructor cannot throw. + vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE); } vmaUnmapMemory(allocator_, allocation_); @@ -480,6 +515,7 @@ VkSampler SamplerCache::retrieve(const SamplerCache::Key& key) { } void SamplerCache::purge() { + std::lock_guard lock(cache_mutex_); cache_.clear(); } diff --git a/aten/src/ATen/native/vulkan/api/Resource.h b/aten/src/ATen/native/vulkan/api/Resource.h index 1efd907b3246d..75df3aa88560c 100644 --- a/aten/src/ATen/native/vulkan/api/Resource.h +++ b/aten/src/ATen/native/vulkan/api/Resource.h @@ -3,6 +3,9 @@ #ifdef USE_VULKAN_API #include +#include + +#include #include #include @@ -16,6 +19,8 @@ typedef uint8_t MemoryAccessFlags; VkFormat vk_format(const caffe2::TypeMeta dtype); +c10::ScalarType c10_scalartype(const VkFormat image_format); + constexpr VmaAllocationCreateFlags DEFAULT_ALLOCATION_STRATEGY = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT; @@ -104,6 +109,10 @@ class VulkanBuffer final { return buffer_properties_.mem_range; } + inline VkDeviceSize mem_size() const { + return buffer_properties_.size; + } + operator bool() const { return (allocation_ != VK_NULL_HANDLE); } @@ -128,6 +137,7 @@ class MemoryMap final { VmaAllocator allocator_; VmaAllocation allocation_; void* data_; + VkDeviceSize data_len_; public: template @@ -135,6 +145,10 @@ class MemoryMap final { return reinterpret_cast(data_); } + inline size_t nbytes() { + return utils::safe_downcast(data_len_); + } + void invalidate(); }; @@ -267,6 +281,10 @@ class VulkanImage final { return allocation_; } + inline VkFormat format() const { + return image_properties_.image_format; + } + inline VkExtent3D extents() const { return image_properties_.image_extents; } diff --git a/aten/src/ATen/native/vulkan/api/Runtime.cpp b/aten/src/ATen/native/vulkan/api/Runtime.cpp index a1c460fa4dc97..95cd716dcee40 100644 --- a/aten/src/ATen/native/vulkan/api/Runtime.cpp +++ b/aten/src/ATen/native/vulkan/api/Runtime.cpp @@ -253,6 +253,11 @@ std::unique_ptr init_global_vulkan_runtime() { try { return std::make_unique(Runtime(default_config)); + } catch (const c10::Error& e) { + TORCH_WARN( + "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! " + "The global vulkan runtime is invalid. Error: ", + e.what()); } catch (const std::exception& e) { TORCH_WARN( "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! " @@ -286,6 +291,10 @@ Runtime::Runtime(const RuntimeConfiguration config) case AdapterSelector::First: default_adapter_i_ = create_adapter(select_first); } + } catch (const c10::Error& e) { + TORCH_WARN( + "Pytorch Vulkan Runtime: Could not initialize default device! Error: ", + e.what()); } catch (const std::exception& e) { TORCH_WARN( "Pytorch Vulkan Runtime: Could not initialize default device! Error: ", @@ -372,10 +381,12 @@ Runtime* runtime() { // Runtime.h as it would have internal linkage. static const std::unique_ptr p_runtime = init_global_vulkan_runtime(); + TORCH_CHECK( p_runtime, "Pytorch Vulkan Runtime: The global runtime could not be retrieved " "because it failed to initialize."); + return p_runtime.get(); } diff --git a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h b/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h deleted file mode 100644 index 7b04e54d944bd..0000000000000 --- a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h +++ /dev/null @@ -1,19558 +0,0 @@ -// -// Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// - -#ifndef AMD_VULKAN_MEMORY_ALLOCATOR_H -#define AMD_VULKAN_MEMORY_ALLOCATOR_H - -/** \mainpage Vulkan Memory Allocator - -Version 3.0.1 (2022-05-26) - -Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved. \n -License: MIT - -API documentation divided into groups: [Modules](modules.html) - -\section main_table_of_contents Table of contents - -- User guide - - \subpage quick_start - - [Project setup](@ref quick_start_project_setup) - - [Initialization](@ref quick_start_initialization) - - [Resource allocation](@ref quick_start_resource_allocation) - - \subpage choosing_memory_type - - [Usage](@ref choosing_memory_type_usage) - - [Required and preferred flags](@ref choosing_memory_type_required_preferred_flags) - - [Explicit memory types](@ref choosing_memory_type_explicit_memory_types) - - [Custom memory pools](@ref choosing_memory_type_custom_memory_pools) - - [Dedicated allocations](@ref choosing_memory_type_dedicated_allocations) - - \subpage memory_mapping - - [Mapping functions](@ref memory_mapping_mapping_functions) - - [Persistently mapped memory](@ref memory_mapping_persistently_mapped_memory) - - [Cache flush and invalidate](@ref memory_mapping_cache_control) - - \subpage staying_within_budget - - [Querying for budget](@ref staying_within_budget_querying_for_budget) - - [Controlling memory usage](@ref staying_within_budget_controlling_memory_usage) - - \subpage resource_aliasing - - \subpage custom_memory_pools - - [Choosing memory type index](@ref custom_memory_pools_MemTypeIndex) - - [Linear allocation algorithm](@ref linear_algorithm) - - [Free-at-once](@ref linear_algorithm_free_at_once) - - [Stack](@ref linear_algorithm_stack) - - [Double stack](@ref linear_algorithm_double_stack) - - [Ring buffer](@ref linear_algorithm_ring_buffer) - - \subpage defragmentation - - \subpage statistics - - [Numeric statistics](@ref statistics_numeric_statistics) - - [JSON dump](@ref statistics_json_dump) - - \subpage allocation_annotation - - [Allocation user data](@ref allocation_user_data) - - [Allocation names](@ref allocation_names) - - \subpage virtual_allocator - - \subpage debugging_memory_usage - - [Memory initialization](@ref debugging_memory_usage_initialization) - - [Margins](@ref debugging_memory_usage_margins) - - [Corruption detection](@ref debugging_memory_usage_corruption_detection) - - \subpage opengl_interop -- \subpage usage_patterns - - [GPU-only resource](@ref usage_patterns_gpu_only) - - [Staging copy for upload](@ref usage_patterns_staging_copy_upload) - - [Readback](@ref usage_patterns_readback) - - [Advanced data uploading](@ref usage_patterns_advanced_data_uploading) - - [Other use cases](@ref usage_patterns_other_use_cases) -- \subpage configuration - - [Pointers to Vulkan functions](@ref config_Vulkan_functions) - - [Custom host memory allocator](@ref custom_memory_allocator) - - [Device memory allocation callbacks](@ref allocation_callbacks) - - [Device heap memory limit](@ref heap_memory_limit) -- Extension support - - \subpage vk_khr_dedicated_allocation - - \subpage enabling_buffer_device_address - - \subpage vk_ext_memory_priority - - \subpage vk_amd_device_coherent_memory -- \subpage general_considerations - - [Thread safety](@ref general_considerations_thread_safety) - - [Versioning and compatibility](@ref general_considerations_versioning_and_compatibility) - - [Validation layer warnings](@ref general_considerations_validation_layer_warnings) - - [Allocation algorithm](@ref general_considerations_allocation_algorithm) - - [Features not supported](@ref general_considerations_features_not_supported) - -\section main_see_also See also - -- [**Product page on GPUOpen**](https://gpuopen.com/gaming-product/vulkan-memory-allocator/) -- [**Source repository on GitHub**](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) - -\defgroup group_init Library initialization - -\brief API elements related to the initialization and management of the entire library, especially #VmaAllocator object. - -\defgroup group_alloc Memory allocation - -\brief API elements related to the allocation, deallocation, and management of Vulkan memory, buffers, images. -Most basic ones being: vmaCreateBuffer(), vmaCreateImage(). - -\defgroup group_virtual Virtual allocator - -\brief API elements related to the mechanism of \ref virtual_allocator - using the core allocation algorithm -for user-defined purpose without allocating any real GPU memory. - -\defgroup group_stats Statistics - -\brief API elements that query current status of the allocator, from memory usage, budget, to full dump of the internal state in JSON format. -See documentation chapter: \ref statistics. -*/ - - -#ifdef __cplusplus -extern "C" { -#endif - -#ifndef VULKAN_H_ - #include -#endif - -// Define this macro to declare maximum supported Vulkan version in format AAABBBCCC, -// where AAA = major, BBB = minor, CCC = patch. -// If you want to use version > 1.0, it still needs to be enabled via VmaAllocatorCreateInfo::vulkanApiVersion. -#if !defined(VMA_VULKAN_VERSION) - #if defined(VK_VERSION_1_3) - #define VMA_VULKAN_VERSION 1003000 - #elif defined(VK_VERSION_1_2) - #define VMA_VULKAN_VERSION 1002000 - #elif defined(VK_VERSION_1_1) - #define VMA_VULKAN_VERSION 1001000 - #else - #define VMA_VULKAN_VERSION 1000000 - #endif -#endif - -#if defined(__ANDROID__) && defined(VK_NO_PROTOTYPES) && VMA_STATIC_VULKAN_FUNCTIONS - extern PFN_vkGetInstanceProcAddr vkGetInstanceProcAddr; - extern PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr; - extern PFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties; - extern PFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties; - extern PFN_vkAllocateMemory vkAllocateMemory; - extern PFN_vkFreeMemory vkFreeMemory; - extern PFN_vkMapMemory vkMapMemory; - extern PFN_vkUnmapMemory vkUnmapMemory; - extern PFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges; - extern PFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges; - extern PFN_vkBindBufferMemory vkBindBufferMemory; - extern PFN_vkBindImageMemory vkBindImageMemory; - extern PFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements; - extern PFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements; - extern PFN_vkCreateBuffer vkCreateBuffer; - extern PFN_vkDestroyBuffer vkDestroyBuffer; - extern PFN_vkCreateImage vkCreateImage; - extern PFN_vkDestroyImage vkDestroyImage; - extern PFN_vkCmdCopyBuffer vkCmdCopyBuffer; - #if VMA_VULKAN_VERSION >= 1001000 - extern PFN_vkGetBufferMemoryRequirements2 vkGetBufferMemoryRequirements2; - extern PFN_vkGetImageMemoryRequirements2 vkGetImageMemoryRequirements2; - extern PFN_vkBindBufferMemory2 vkBindBufferMemory2; - extern PFN_vkBindImageMemory2 vkBindImageMemory2; - extern PFN_vkGetPhysicalDeviceMemoryProperties2 vkGetPhysicalDeviceMemoryProperties2; - #endif // #if VMA_VULKAN_VERSION >= 1001000 -#endif // #if defined(__ANDROID__) && VMA_STATIC_VULKAN_FUNCTIONS && VK_NO_PROTOTYPES - -#if !defined(VMA_DEDICATED_ALLOCATION) - #if VK_KHR_get_memory_requirements2 && VK_KHR_dedicated_allocation - #define VMA_DEDICATED_ALLOCATION 1 - #else - #define VMA_DEDICATED_ALLOCATION 0 - #endif -#endif - -#if !defined(VMA_BIND_MEMORY2) - #if VK_KHR_bind_memory2 - #define VMA_BIND_MEMORY2 1 - #else - #define VMA_BIND_MEMORY2 0 - #endif -#endif - -#if !defined(VMA_MEMORY_BUDGET) - #if VK_EXT_memory_budget && (VK_KHR_get_physical_device_properties2 || VMA_VULKAN_VERSION >= 1001000) - #define VMA_MEMORY_BUDGET 1 - #else - #define VMA_MEMORY_BUDGET 0 - #endif -#endif - -// Defined to 1 when VK_KHR_buffer_device_address device extension or equivalent core Vulkan 1.2 feature is defined in its headers. -#if !defined(VMA_BUFFER_DEVICE_ADDRESS) - #if VK_KHR_buffer_device_address || VMA_VULKAN_VERSION >= 1002000 - #define VMA_BUFFER_DEVICE_ADDRESS 1 - #else - #define VMA_BUFFER_DEVICE_ADDRESS 0 - #endif -#endif - -// Defined to 1 when VK_EXT_memory_priority device extension is defined in Vulkan headers. -#if !defined(VMA_MEMORY_PRIORITY) - #if VK_EXT_memory_priority - #define VMA_MEMORY_PRIORITY 1 - #else - #define VMA_MEMORY_PRIORITY 0 - #endif -#endif - -// Defined to 1 when VK_KHR_external_memory device extension is defined in Vulkan headers. -#if !defined(VMA_EXTERNAL_MEMORY) - #if VK_KHR_external_memory - #define VMA_EXTERNAL_MEMORY 1 - #else - #define VMA_EXTERNAL_MEMORY 0 - #endif -#endif - -// Define these macros to decorate all public functions with additional code, -// before and after returned type, appropriately. This may be useful for -// exporting the functions when compiling VMA as a separate library. Example: -// #define VMA_CALL_PRE __declspec(dllexport) -// #define VMA_CALL_POST __cdecl -#ifndef VMA_CALL_PRE - #define VMA_CALL_PRE -#endif -#ifndef VMA_CALL_POST - #define VMA_CALL_POST -#endif - -// Define this macro to decorate pointers with an attribute specifying the -// length of the array they point to if they are not null. -// -// The length may be one of -// - The name of another parameter in the argument list where the pointer is declared -// - The name of another member in the struct where the pointer is declared -// - The name of a member of a struct type, meaning the value of that member in -// the context of the call. For example -// VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount"), -// this means the number of memory heaps available in the device associated -// with the VmaAllocator being dealt with. -#ifndef VMA_LEN_IF_NOT_NULL - #define VMA_LEN_IF_NOT_NULL(len) -#endif - -// The VMA_NULLABLE macro is defined to be _Nullable when compiling with Clang. -// see: https://clang.llvm.org/docs/AttributeReference.html#nullable -#ifndef VMA_NULLABLE - #ifdef __clang__ - #define VMA_NULLABLE _Nullable - #else - #define VMA_NULLABLE - #endif -#endif - -// The VMA_NOT_NULL macro is defined to be _Nonnull when compiling with Clang. -// see: https://clang.llvm.org/docs/AttributeReference.html#nonnull -#ifndef VMA_NOT_NULL - #ifdef __clang__ - #define VMA_NOT_NULL _Nonnull - #else - #define VMA_NOT_NULL - #endif -#endif - -// If non-dispatchable handles are represented as pointers then we can give -// then nullability annotations -#ifndef VMA_NOT_NULL_NON_DISPATCHABLE - #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__) - #define VMA_NOT_NULL_NON_DISPATCHABLE VMA_NOT_NULL - #else - #define VMA_NOT_NULL_NON_DISPATCHABLE - #endif -#endif - -#ifndef VMA_NULLABLE_NON_DISPATCHABLE - #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__) - #define VMA_NULLABLE_NON_DISPATCHABLE VMA_NULLABLE - #else - #define VMA_NULLABLE_NON_DISPATCHABLE - #endif -#endif - -#ifndef VMA_STATS_STRING_ENABLED - #define VMA_STATS_STRING_ENABLED 1 -#endif - -//////////////////////////////////////////////////////////////////////////////// -//////////////////////////////////////////////////////////////////////////////// -// -// INTERFACE -// -//////////////////////////////////////////////////////////////////////////////// -//////////////////////////////////////////////////////////////////////////////// - -// Sections for managing code placement in file, only for development purposes e.g. for convenient folding inside an IDE. -#ifndef _VMA_ENUM_DECLARATIONS - -/** -\addtogroup group_init -@{ -*/ - -/// Flags for created #VmaAllocator. -typedef enum VmaAllocatorCreateFlagBits -{ - /** \brief Allocator and all objects created from it will not be synchronized internally, so you must guarantee they are used from only one thread at a time or synchronized externally by you. - - Using this flag may increase performance because internal mutexes are not used. - */ - VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT = 0x00000001, - /** \brief Enables usage of VK_KHR_dedicated_allocation extension. - - The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`. - When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1. - - Using this extension will automatically allocate dedicated blocks of memory for - some buffers and images instead of suballocating place for them out of bigger - memory blocks (as if you explicitly used #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT - flag) when it is recommended by the driver. It may improve performance on some - GPUs. - - You may set this flag only if you found out that following device extensions are - supported, you enabled them while creating Vulkan device passed as - VmaAllocatorCreateInfo::device, and you want them to be used internally by this - library: - - - VK_KHR_get_memory_requirements2 (device extension) - - VK_KHR_dedicated_allocation (device extension) - - When this flag is set, you can experience following warnings reported by Vulkan - validation layer. You can ignore them. - - > vkBindBufferMemory(): Binding memory to buffer 0x2d but vkGetBufferMemoryRequirements() has not been called on that buffer. - */ - VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT = 0x00000002, - /** - Enables usage of VK_KHR_bind_memory2 extension. - - The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`. - When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1. - - You may set this flag only if you found out that this device extension is supported, - you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device, - and you want it to be used internally by this library. - - The extension provides functions `vkBindBufferMemory2KHR` and `vkBindImageMemory2KHR`, - which allow to pass a chain of `pNext` structures while binding. - This flag is required if you use `pNext` parameter in vmaBindBufferMemory2() or vmaBindImageMemory2(). - */ - VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT = 0x00000004, - /** - Enables usage of VK_EXT_memory_budget extension. - - You may set this flag only if you found out that this device extension is supported, - you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device, - and you want it to be used internally by this library, along with another instance extension - VK_KHR_get_physical_device_properties2, which is required by it (or Vulkan 1.1, where this extension is promoted). - - The extension provides query for current memory usage and budget, which will probably - be more accurate than an estimation used by the library otherwise. - */ - VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT = 0x00000008, - /** - Enables usage of VK_AMD_device_coherent_memory extension. - - You may set this flag only if you: - - - found out that this device extension is supported and enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device, - - checked that `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true and set it while creating the Vulkan device, - - want it to be used internally by this library. - - The extension and accompanying device feature provide access to memory types with - `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flags. - They are useful mostly for writing breadcrumb markers - a common method for debugging GPU crash/hang/TDR. - - When the extension is not enabled, such memory types are still enumerated, but their usage is illegal. - To protect from this error, if you don't create the allocator with this flag, it will refuse to allocate any memory or create a custom pool in such memory type, - returning `VK_ERROR_FEATURE_NOT_PRESENT`. - */ - VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT = 0x00000010, - /** - Enables usage of "buffer device address" feature, which allows you to use function - `vkGetBufferDeviceAddress*` to get raw GPU pointer to a buffer and pass it for usage inside a shader. - - You may set this flag only if you: - - 1. (For Vulkan version < 1.2) Found as available and enabled device extension - VK_KHR_buffer_device_address. - This extension is promoted to core Vulkan 1.2. - 2. Found as available and enabled device feature `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress`. - - When this flag is set, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT` using VMA. - The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT` to - allocated memory blocks wherever it might be needed. - - For more information, see documentation chapter \ref enabling_buffer_device_address. - */ - VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT = 0x00000020, - /** - Enables usage of VK_EXT_memory_priority extension in the library. - - You may set this flag only if you found available and enabled this device extension, - along with `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority == VK_TRUE`, - while creating Vulkan device passed as VmaAllocatorCreateInfo::device. - - When this flag is used, VmaAllocationCreateInfo::priority and VmaPoolCreateInfo::priority - are used to set priorities of allocated Vulkan memory. Without it, these variables are ignored. - - A priority must be a floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations. - Larger values are higher priority. The granularity of the priorities is implementation-dependent. - It is automatically passed to every call to `vkAllocateMemory` done by the library using structure `VkMemoryPriorityAllocateInfoEXT`. - The value to be used for default priority is 0.5. - For more details, see the documentation of the VK_EXT_memory_priority extension. - */ - VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT = 0x00000040, - - VMA_ALLOCATOR_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaAllocatorCreateFlagBits; -/// See #VmaAllocatorCreateFlagBits. -typedef VkFlags VmaAllocatorCreateFlags; - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/// \brief Intended usage of the allocated memory. -typedef enum VmaMemoryUsage -{ - /** No intended memory usage specified. - Use other members of VmaAllocationCreateInfo to specify your requirements. - */ - VMA_MEMORY_USAGE_UNKNOWN = 0, - /** - \deprecated Obsolete, preserved for backward compatibility. - Prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`. - */ - VMA_MEMORY_USAGE_GPU_ONLY = 1, - /** - \deprecated Obsolete, preserved for backward compatibility. - Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` and `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`. - */ - VMA_MEMORY_USAGE_CPU_ONLY = 2, - /** - \deprecated Obsolete, preserved for backward compatibility. - Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`. - */ - VMA_MEMORY_USAGE_CPU_TO_GPU = 3, - /** - \deprecated Obsolete, preserved for backward compatibility. - Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`. - */ - VMA_MEMORY_USAGE_GPU_TO_CPU = 4, - /** - \deprecated Obsolete, preserved for backward compatibility. - Prefers not `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`. - */ - VMA_MEMORY_USAGE_CPU_COPY = 5, - /** - Lazily allocated GPU memory having `VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT`. - Exists mostly on mobile platforms. Using it on desktop PC or other GPUs with no such memory type present will fail the allocation. - - Usage: Memory for transient attachment images (color attachments, depth attachments etc.), created with `VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT`. - - Allocations with this usage are always created as dedicated - it implies #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. - */ - VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED = 6, - /** - Selects best memory type automatically. - This flag is recommended for most common use cases. - - When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT), - you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT - in VmaAllocationCreateInfo::flags. - - It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g. - vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo() - and not with generic memory allocation functions. - */ - VMA_MEMORY_USAGE_AUTO = 7, - /** - Selects best memory type automatically with preference for GPU (device) memory. - - When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT), - you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT - in VmaAllocationCreateInfo::flags. - - It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g. - vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo() - and not with generic memory allocation functions. - */ - VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE = 8, - /** - Selects best memory type automatically with preference for CPU (host) memory. - - When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT), - you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT - in VmaAllocationCreateInfo::flags. - - It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g. - vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo() - and not with generic memory allocation functions. - */ - VMA_MEMORY_USAGE_AUTO_PREFER_HOST = 9, - - VMA_MEMORY_USAGE_MAX_ENUM = 0x7FFFFFFF -} VmaMemoryUsage; - -/// Flags to be passed as VmaAllocationCreateInfo::flags. -typedef enum VmaAllocationCreateFlagBits -{ - /** \brief Set this flag if the allocation should have its own memory block. - - Use it for special, big resources, like fullscreen images used as attachments. - */ - VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT = 0x00000001, - - /** \brief Set this flag to only try to allocate from existing `VkDeviceMemory` blocks and never create new such block. - - If new allocation cannot be placed in any of the existing blocks, allocation - fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY` error. - - You should not use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT and - #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT at the same time. It makes no sense. - */ - VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT = 0x00000002, - /** \brief Set this flag to use a memory that will be persistently mapped and retrieve pointer to it. - - Pointer to mapped memory will be returned through VmaAllocationInfo::pMappedData. - - It is valid to use this flag for allocation made from memory type that is not - `HOST_VISIBLE`. This flag is then ignored and memory is not mapped. This is - useful if you need an allocation that is efficient to use on GPU - (`DEVICE_LOCAL`) and still want to map it directly if possible on platforms that - support it (e.g. Intel GPU). - */ - VMA_ALLOCATION_CREATE_MAPPED_BIT = 0x00000004, - /** \deprecated Preserved for backward compatibility. Consider using vmaSetAllocationName() instead. - - Set this flag to treat VmaAllocationCreateInfo::pUserData as pointer to a - null-terminated string. Instead of copying pointer value, a local copy of the - string is made and stored in allocation's `pName`. The string is automatically - freed together with the allocation. It is also used in vmaBuildStatsString(). - */ - VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT = 0x00000020, - /** Allocation will be created from upper stack in a double stack pool. - - This flag is only allowed for custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT flag. - */ - VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = 0x00000040, - /** Create both buffer/image and allocation, but don't bind them together. - It is useful when you want to bind yourself to do some more advanced binding, e.g. using some extensions. - The flag is meaningful only with functions that bind by default: vmaCreateBuffer(), vmaCreateImage(). - Otherwise it is ignored. - - If you want to make sure the new buffer/image is not tied to the new memory allocation - through `VkMemoryDedicatedAllocateInfoKHR` structure in case the allocation ends up in its own memory block, - use also flag #VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT. - */ - VMA_ALLOCATION_CREATE_DONT_BIND_BIT = 0x00000080, - /** Create allocation only if additional device memory required for it, if any, won't exceed - memory budget. Otherwise return `VK_ERROR_OUT_OF_DEVICE_MEMORY`. - */ - VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT = 0x00000100, - /** \brief Set this flag if the allocated memory will have aliasing resources. - - Usage of this flag prevents supplying `VkMemoryDedicatedAllocateInfoKHR` when #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT is specified. - Otherwise created dedicated memory will not be suitable for aliasing resources, resulting in Vulkan Validation Layer errors. - */ - VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT = 0x00000200, - /** - Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT). - - - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value, - you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect. - - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`. - This includes allocations created in \ref custom_memory_pools. - - Declares that mapped memory will only be written sequentially, e.g. using `memcpy()` or a loop writing number-by-number, - never read or accessed randomly, so a memory type can be selected that is uncached and write-combined. - - \warning Violating this declaration may work correctly, but will likely be very slow. - Watch out for implicit reads introduced by doing e.g. `pMappedData[i] += x;` - Better prepare your data in a local variable and `memcpy()` it to the mapped pointer all at once. - */ - VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT = 0x00000400, - /** - Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT). - - - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value, - you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect. - - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`. - This includes allocations created in \ref custom_memory_pools. - - Declares that mapped memory can be read, written, and accessed in random order, - so a `HOST_CACHED` memory type is required. - */ - VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT = 0x00000800, - /** - Together with #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT, - it says that despite request for host access, a not-`HOST_VISIBLE` memory type can be selected - if it may improve performance. - - By using this flag, you declare that you will check if the allocation ended up in a `HOST_VISIBLE` memory type - (e.g. using vmaGetAllocationMemoryProperties()) and if not, you will create some "staging" buffer and - issue an explicit transfer to write/read your data. - To prepare for this possibility, don't forget to add appropriate flags like - `VK_BUFFER_USAGE_TRANSFER_DST_BIT`, `VK_BUFFER_USAGE_TRANSFER_SRC_BIT` to the parameters of created buffer or image. - */ - VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT = 0x00001000, - /** Allocation strategy that chooses smallest possible free range for the allocation - to minimize memory usage and fragmentation, possibly at the expense of allocation time. - */ - VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = 0x00010000, - /** Allocation strategy that chooses first suitable free range for the allocation - - not necessarily in terms of the smallest offset but the one that is easiest and fastest to find - to minimize allocation time, possibly at the expense of allocation quality. - */ - VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = 0x00020000, - /** Allocation strategy that chooses always the lowest offset in available space. - This is not the most efficient strategy but achieves highly packed data. - Used internally by defragmentation, not recomended in typical usage. - */ - VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT = 0x00040000, - /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT. - */ - VMA_ALLOCATION_CREATE_STRATEGY_BEST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT, - /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT. - */ - VMA_ALLOCATION_CREATE_STRATEGY_FIRST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT, - /** A bit mask to extract only `STRATEGY` bits from entire set of flags. - */ - VMA_ALLOCATION_CREATE_STRATEGY_MASK = - VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT | - VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT | - VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT, - - VMA_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaAllocationCreateFlagBits; -/// See #VmaAllocationCreateFlagBits. -typedef VkFlags VmaAllocationCreateFlags; - -/// Flags to be passed as VmaPoolCreateInfo::flags. -typedef enum VmaPoolCreateFlagBits -{ - /** \brief Use this flag if you always allocate only buffers and linear images or only optimal images out of this pool and so Buffer-Image Granularity can be ignored. - - This is an optional optimization flag. - - If you always allocate using vmaCreateBuffer(), vmaCreateImage(), - vmaAllocateMemoryForBuffer(), then you don't need to use it because allocator - knows exact type of your allocations so it can handle Buffer-Image Granularity - in the optimal way. - - If you also allocate using vmaAllocateMemoryForImage() or vmaAllocateMemory(), - exact type of such allocations is not known, so allocator must be conservative - in handling Buffer-Image Granularity, which can lead to suboptimal allocation - (wasted memory). In that case, if you can make sure you always allocate only - buffers and linear images or only optimal images out of this pool, use this flag - to make allocator disregard Buffer-Image Granularity and so make allocations - faster and more optimal. - */ - VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT = 0x00000002, - - /** \brief Enables alternative, linear allocation algorithm in this pool. - - Specify this flag to enable linear allocation algorithm, which always creates - new allocations after last one and doesn't reuse space from allocations freed in - between. It trades memory consumption for simplified algorithm and data - structure, which has better performance and uses less memory for metadata. - - By using this flag, you can achieve behavior of free-at-once, stack, - ring buffer, and double stack. - For details, see documentation chapter \ref linear_algorithm. - */ - VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT = 0x00000004, - - /** Bit mask to extract only `ALGORITHM` bits from entire set of flags. - */ - VMA_POOL_CREATE_ALGORITHM_MASK = - VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT, - - VMA_POOL_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaPoolCreateFlagBits; -/// Flags to be passed as VmaPoolCreateInfo::flags. See #VmaPoolCreateFlagBits. -typedef VkFlags VmaPoolCreateFlags; - -/// Flags to be passed as VmaDefragmentationInfo::flags. -typedef enum VmaDefragmentationFlagBits -{ - /* \brief Use simple but fast algorithm for defragmentation. - May not achieve best results but will require least time to compute and least allocations to copy. - */ - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT = 0x1, - /* \brief Default defragmentation algorithm, applied also when no `ALGORITHM` flag is specified. - Offers a balance between defragmentation quality and the amount of allocations and bytes that need to be moved. - */ - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT = 0x2, - /* \brief Perform full defragmentation of memory. - Can result in notably more time to compute and allocations to copy, but will achieve best memory packing. - */ - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT = 0x4, - /** \brief Use the most roboust algorithm at the cost of time to compute and number of copies to make. - Only available when bufferImageGranularity is greater than 1, since it aims to reduce - alignment issues between different types of resources. - Otherwise falls back to same behavior as #VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT. - */ - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT = 0x8, - - /// A bit mask to extract only `ALGORITHM` bits from entire set of flags. - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK = - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT | - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT | - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT | - VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT, - - VMA_DEFRAGMENTATION_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaDefragmentationFlagBits; -/// See #VmaDefragmentationFlagBits. -typedef VkFlags VmaDefragmentationFlags; - -/// Operation performed on single defragmentation move. See structure #VmaDefragmentationMove. -typedef enum VmaDefragmentationMoveOperation -{ - /// Buffer/image has been recreated at `dstTmpAllocation`, data has been copied, old buffer/image has been destroyed. `srcAllocation` should be changed to point to the new place. This is the default value set by vmaBeginDefragmentationPass(). - VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY = 0, - /// Set this value if you cannot move the allocation. New place reserved at `dstTmpAllocation` will be freed. `srcAllocation` will remain unchanged. - VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE = 1, - /// Set this value if you decide to abandon the allocation and you destroyed the buffer/image. New place reserved at `dstTmpAllocation` will be freed, along with `srcAllocation`, which will be destroyed. - VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY = 2, -} VmaDefragmentationMoveOperation; - -/** @} */ - -/** -\addtogroup group_virtual -@{ -*/ - -/// Flags to be passed as VmaVirtualBlockCreateInfo::flags. -typedef enum VmaVirtualBlockCreateFlagBits -{ - /** \brief Enables alternative, linear allocation algorithm in this virtual block. - - Specify this flag to enable linear allocation algorithm, which always creates - new allocations after last one and doesn't reuse space from allocations freed in - between. It trades memory consumption for simplified algorithm and data - structure, which has better performance and uses less memory for metadata. - - By using this flag, you can achieve behavior of free-at-once, stack, - ring buffer, and double stack. - For details, see documentation chapter \ref linear_algorithm. - */ - VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT = 0x00000001, - - /** \brief Bit mask to extract only `ALGORITHM` bits from entire set of flags. - */ - VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK = - VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT, - - VMA_VIRTUAL_BLOCK_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaVirtualBlockCreateFlagBits; -/// Flags to be passed as VmaVirtualBlockCreateInfo::flags. See #VmaVirtualBlockCreateFlagBits. -typedef VkFlags VmaVirtualBlockCreateFlags; - -/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags. -typedef enum VmaVirtualAllocationCreateFlagBits -{ - /** \brief Allocation will be created from upper stack in a double stack pool. - - This flag is only allowed for virtual blocks created with #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT flag. - */ - VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT, - /** \brief Allocation strategy that tries to minimize memory usage. - */ - VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT, - /** \brief Allocation strategy that tries to minimize allocation time. - */ - VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT, - /** Allocation strategy that chooses always the lowest offset in available space. - This is not the most efficient strategy but achieves highly packed data. - */ - VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT, - /** \brief A bit mask to extract only `STRATEGY` bits from entire set of flags. - - These strategy flags are binary compatible with equivalent flags in #VmaAllocationCreateFlagBits. - */ - VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK = VMA_ALLOCATION_CREATE_STRATEGY_MASK, - - VMA_VIRTUAL_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF -} VmaVirtualAllocationCreateFlagBits; -/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags. See #VmaVirtualAllocationCreateFlagBits. -typedef VkFlags VmaVirtualAllocationCreateFlags; - -/** @} */ - -#endif // _VMA_ENUM_DECLARATIONS - -#ifndef _VMA_DATA_TYPES_DECLARATIONS - -/** -\addtogroup group_init -@{ */ - -/** \struct VmaAllocator -\brief Represents main object of this library initialized. - -Fill structure #VmaAllocatorCreateInfo and call function vmaCreateAllocator() to create it. -Call function vmaDestroyAllocator() to destroy it. - -It is recommended to create just one object of this type per `VkDevice` object, -right after Vulkan is initialized and keep it alive until before Vulkan device is destroyed. -*/ -VK_DEFINE_HANDLE(VmaAllocator) - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/** \struct VmaPool -\brief Represents custom memory pool - -Fill structure VmaPoolCreateInfo and call function vmaCreatePool() to create it. -Call function vmaDestroyPool() to destroy it. - -For more information see [Custom memory pools](@ref choosing_memory_type_custom_memory_pools). -*/ -VK_DEFINE_HANDLE(VmaPool) - -/** \struct VmaAllocation -\brief Represents single memory allocation. - -It may be either dedicated block of `VkDeviceMemory` or a specific region of a bigger block of this type -plus unique offset. - -There are multiple ways to create such object. -You need to fill structure VmaAllocationCreateInfo. -For more information see [Choosing memory type](@ref choosing_memory_type). - -Although the library provides convenience functions that create Vulkan buffer or image, -allocate memory for it and bind them together, -binding of the allocation to a buffer or an image is out of scope of the allocation itself. -Allocation object can exist without buffer/image bound, -binding can be done manually by the user, and destruction of it can be done -independently of destruction of the allocation. - -The object also remembers its size and some other information. -To retrieve this information, use function vmaGetAllocationInfo() and inspect -returned structure VmaAllocationInfo. -*/ -VK_DEFINE_HANDLE(VmaAllocation) - -/** \struct VmaDefragmentationContext -\brief An opaque object that represents started defragmentation process. - -Fill structure #VmaDefragmentationInfo and call function vmaBeginDefragmentation() to create it. -Call function vmaEndDefragmentation() to destroy it. -*/ -VK_DEFINE_HANDLE(VmaDefragmentationContext) - -/** @} */ - -/** -\addtogroup group_virtual -@{ -*/ - -/** \struct VmaVirtualAllocation -\brief Represents single memory allocation done inside VmaVirtualBlock. - -Use it as a unique identifier to virtual allocation within the single block. - -Use value `VK_NULL_HANDLE` to represent a null/invalid allocation. -*/ -VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaVirtualAllocation); - -/** @} */ - -/** -\addtogroup group_virtual -@{ -*/ - -/** \struct VmaVirtualBlock -\brief Handle to a virtual block object that allows to use core allocation algorithm without allocating any real GPU memory. - -Fill in #VmaVirtualBlockCreateInfo structure and use vmaCreateVirtualBlock() to create it. Use vmaDestroyVirtualBlock() to destroy it. -For more information, see documentation chapter \ref virtual_allocator. - -This object is not thread-safe - should not be used from multiple threads simultaneously, must be synchronized externally. -*/ -VK_DEFINE_HANDLE(VmaVirtualBlock) - -/** @} */ - -/** -\addtogroup group_init -@{ -*/ - -/// Callback function called after successful vkAllocateMemory. -typedef void (VKAPI_PTR* PFN_vmaAllocateDeviceMemoryFunction)( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t memoryType, - VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory, - VkDeviceSize size, - void* VMA_NULLABLE pUserData); - -/// Callback function called before vkFreeMemory. -typedef void (VKAPI_PTR* PFN_vmaFreeDeviceMemoryFunction)( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t memoryType, - VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory, - VkDeviceSize size, - void* VMA_NULLABLE pUserData); - -/** \brief Set of callbacks that the library will call for `vkAllocateMemory` and `vkFreeMemory`. - -Provided for informative purpose, e.g. to gather statistics about number of -allocations or total amount of memory allocated in Vulkan. - -Used in VmaAllocatorCreateInfo::pDeviceMemoryCallbacks. -*/ -typedef struct VmaDeviceMemoryCallbacks -{ - /// Optional, can be null. - PFN_vmaAllocateDeviceMemoryFunction VMA_NULLABLE pfnAllocate; - /// Optional, can be null. - PFN_vmaFreeDeviceMemoryFunction VMA_NULLABLE pfnFree; - /// Optional, can be null. - void* VMA_NULLABLE pUserData; -} VmaDeviceMemoryCallbacks; - -/** \brief Pointers to some Vulkan functions - a subset used by the library. - -Used in VmaAllocatorCreateInfo::pVulkanFunctions. -*/ -typedef struct VmaVulkanFunctions -{ - /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS. - PFN_vkGetInstanceProcAddr VMA_NULLABLE vkGetInstanceProcAddr; - /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS. - PFN_vkGetDeviceProcAddr VMA_NULLABLE vkGetDeviceProcAddr; - PFN_vkGetPhysicalDeviceProperties VMA_NULLABLE vkGetPhysicalDeviceProperties; - PFN_vkGetPhysicalDeviceMemoryProperties VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties; - PFN_vkAllocateMemory VMA_NULLABLE vkAllocateMemory; - PFN_vkFreeMemory VMA_NULLABLE vkFreeMemory; - PFN_vkMapMemory VMA_NULLABLE vkMapMemory; - PFN_vkUnmapMemory VMA_NULLABLE vkUnmapMemory; - PFN_vkFlushMappedMemoryRanges VMA_NULLABLE vkFlushMappedMemoryRanges; - PFN_vkInvalidateMappedMemoryRanges VMA_NULLABLE vkInvalidateMappedMemoryRanges; - PFN_vkBindBufferMemory VMA_NULLABLE vkBindBufferMemory; - PFN_vkBindImageMemory VMA_NULLABLE vkBindImageMemory; - PFN_vkGetBufferMemoryRequirements VMA_NULLABLE vkGetBufferMemoryRequirements; - PFN_vkGetImageMemoryRequirements VMA_NULLABLE vkGetImageMemoryRequirements; - PFN_vkCreateBuffer VMA_NULLABLE vkCreateBuffer; - PFN_vkDestroyBuffer VMA_NULLABLE vkDestroyBuffer; - PFN_vkCreateImage VMA_NULLABLE vkCreateImage; - PFN_vkDestroyImage VMA_NULLABLE vkDestroyImage; - PFN_vkCmdCopyBuffer VMA_NULLABLE vkCmdCopyBuffer; -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - /// Fetch "vkGetBufferMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetBufferMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension. - PFN_vkGetBufferMemoryRequirements2KHR VMA_NULLABLE vkGetBufferMemoryRequirements2KHR; - /// Fetch "vkGetImageMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetImageMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension. - PFN_vkGetImageMemoryRequirements2KHR VMA_NULLABLE vkGetImageMemoryRequirements2KHR; -#endif -#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000 - /// Fetch "vkBindBufferMemory2" on Vulkan >= 1.1, fetch "vkBindBufferMemory2KHR" when using VK_KHR_bind_memory2 extension. - PFN_vkBindBufferMemory2KHR VMA_NULLABLE vkBindBufferMemory2KHR; - /// Fetch "vkBindImageMemory2" on Vulkan >= 1.1, fetch "vkBindImageMemory2KHR" when using VK_KHR_bind_memory2 extension. - PFN_vkBindImageMemory2KHR VMA_NULLABLE vkBindImageMemory2KHR; -#endif -#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000 - PFN_vkGetPhysicalDeviceMemoryProperties2KHR VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties2KHR; -#endif -#if VMA_VULKAN_VERSION >= 1003000 - /// Fetch from "vkGetDeviceBufferMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceBufferMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4. - PFN_vkGetDeviceBufferMemoryRequirements VMA_NULLABLE vkGetDeviceBufferMemoryRequirements; - /// Fetch from "vkGetDeviceImageMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceImageMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4. - PFN_vkGetDeviceImageMemoryRequirements VMA_NULLABLE vkGetDeviceImageMemoryRequirements; -#endif -} VmaVulkanFunctions; - -/// Description of a Allocator to be created. -typedef struct VmaAllocatorCreateInfo -{ - /// Flags for created allocator. Use #VmaAllocatorCreateFlagBits enum. - VmaAllocatorCreateFlags flags; - /// Vulkan physical device. - /** It must be valid throughout whole lifetime of created allocator. */ - VkPhysicalDevice VMA_NOT_NULL physicalDevice; - /// Vulkan device. - /** It must be valid throughout whole lifetime of created allocator. */ - VkDevice VMA_NOT_NULL device; - /// Preferred size of a single `VkDeviceMemory` block to be allocated from large heaps > 1 GiB. Optional. - /** Set to 0 to use default, which is currently 256 MiB. */ - VkDeviceSize preferredLargeHeapBlockSize; - /// Custom CPU memory allocation callbacks. Optional. - /** Optional, can be null. When specified, will also be used for all CPU-side memory allocations. */ - const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks; - /// Informative callbacks for `vkAllocateMemory`, `vkFreeMemory`. Optional. - /** Optional, can be null. */ - const VmaDeviceMemoryCallbacks* VMA_NULLABLE pDeviceMemoryCallbacks; - /** \brief Either null or a pointer to an array of limits on maximum number of bytes that can be allocated out of particular Vulkan memory heap. - - If not NULL, it must be a pointer to an array of - `VkPhysicalDeviceMemoryProperties::memoryHeapCount` elements, defining limit on - maximum number of bytes that can be allocated out of particular Vulkan memory - heap. - - Any of the elements may be equal to `VK_WHOLE_SIZE`, which means no limit on that - heap. This is also the default in case of `pHeapSizeLimit` = NULL. - - If there is a limit defined for a heap: - - - If user tries to allocate more memory from that heap using this allocator, - the allocation fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`. - - If the limit is smaller than heap size reported in `VkMemoryHeap::size`, the - value of this limit will be reported instead when using vmaGetMemoryProperties(). - - Warning! Using this feature may not be equivalent to installing a GPU with - smaller amount of memory, because graphics driver doesn't necessary fail new - allocations with `VK_ERROR_OUT_OF_DEVICE_MEMORY` result when memory capacity is - exceeded. It may return success and just silently migrate some device memory - blocks to system RAM. This driver behavior can also be controlled using - VK_AMD_memory_overallocation_behavior extension. - */ - const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pHeapSizeLimit; - - /** \brief Pointers to Vulkan functions. Can be null. - - For details see [Pointers to Vulkan functions](@ref config_Vulkan_functions). - */ - const VmaVulkanFunctions* VMA_NULLABLE pVulkanFunctions; - /** \brief Handle to Vulkan instance object. - - Starting from version 3.0.0 this member is no longer optional, it must be set! - */ - VkInstance VMA_NOT_NULL instance; - /** \brief Optional. The highest version of Vulkan that the application is designed to use. - - It must be a value in the format as created by macro `VK_MAKE_VERSION` or a constant like: `VK_API_VERSION_1_1`, `VK_API_VERSION_1_0`. - The patch version number specified is ignored. Only the major and minor versions are considered. - It must be less or equal (preferably equal) to value as passed to `vkCreateInstance` as `VkApplicationInfo::apiVersion`. - Only versions 1.0, 1.1, 1.2, 1.3 are supported by the current implementation. - Leaving it initialized to zero is equivalent to `VK_API_VERSION_1_0`. - */ - uint32_t vulkanApiVersion; -#if VMA_EXTERNAL_MEMORY - /** \brief Either null or a pointer to an array of external memory handle types for each Vulkan memory type. - - If not NULL, it must be a pointer to an array of `VkPhysicalDeviceMemoryProperties::memoryTypeCount` - elements, defining external memory handle types of particular Vulkan memory type, - to be passed using `VkExportMemoryAllocateInfoKHR`. - - Any of the elements may be equal to 0, which means not to use `VkExportMemoryAllocateInfoKHR` on this memory type. - This is also the default in case of `pTypeExternalMemoryHandleTypes` = NULL. - */ - const VkExternalMemoryHandleTypeFlagsKHR* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryTypeCount") pTypeExternalMemoryHandleTypes; -#endif // #if VMA_EXTERNAL_MEMORY -} VmaAllocatorCreateInfo; - -/// Information about existing #VmaAllocator object. -typedef struct VmaAllocatorInfo -{ - /** \brief Handle to Vulkan instance object. - - This is the same value as has been passed through VmaAllocatorCreateInfo::instance. - */ - VkInstance VMA_NOT_NULL instance; - /** \brief Handle to Vulkan physical device object. - - This is the same value as has been passed through VmaAllocatorCreateInfo::physicalDevice. - */ - VkPhysicalDevice VMA_NOT_NULL physicalDevice; - /** \brief Handle to Vulkan device object. - - This is the same value as has been passed through VmaAllocatorCreateInfo::device. - */ - VkDevice VMA_NOT_NULL device; -} VmaAllocatorInfo; - -/** @} */ - -/** -\addtogroup group_stats -@{ -*/ - -/** \brief Calculated statistics of memory usage e.g. in a specific memory type, heap, custom pool, or total. - -These are fast to calculate. -See functions: vmaGetHeapBudgets(), vmaGetPoolStatistics(). -*/ -typedef struct VmaStatistics -{ - /** \brief Number of `VkDeviceMemory` objects - Vulkan memory blocks allocated. - */ - uint32_t blockCount; - /** \brief Number of #VmaAllocation objects allocated. - - Dedicated allocations have their own blocks, so each one adds 1 to `allocationCount` as well as `blockCount`. - */ - uint32_t allocationCount; - /** \brief Number of bytes allocated in `VkDeviceMemory` blocks. - - \note To avoid confusion, please be aware that what Vulkan calls an "allocation" - a whole `VkDeviceMemory` object - (e.g. as in `VkPhysicalDeviceLimits::maxMemoryAllocationCount`) is called a "block" in VMA, while VMA calls - "allocation" a #VmaAllocation object that represents a memory region sub-allocated from such block, usually for a single buffer or image. - */ - VkDeviceSize blockBytes; - /** \brief Total number of bytes occupied by all #VmaAllocation objects. - - Always less or equal than `blockBytes`. - Difference `(blockBytes - allocationBytes)` is the amount of memory allocated from Vulkan - but unused by any #VmaAllocation. - */ - VkDeviceSize allocationBytes; -} VmaStatistics; - -/** \brief More detailed statistics than #VmaStatistics. - -These are slower to calculate. Use for debugging purposes. -See functions: vmaCalculateStatistics(), vmaCalculatePoolStatistics(). - -Previous version of the statistics API provided averages, but they have been removed -because they can be easily calculated as: - -\code -VkDeviceSize allocationSizeAvg = detailedStats.statistics.allocationBytes / detailedStats.statistics.allocationCount; -VkDeviceSize unusedBytes = detailedStats.statistics.blockBytes - detailedStats.statistics.allocationBytes; -VkDeviceSize unusedRangeSizeAvg = unusedBytes / detailedStats.unusedRangeCount; -\endcode -*/ -typedef struct VmaDetailedStatistics -{ - /// Basic statistics. - VmaStatistics statistics; - /// Number of free ranges of memory between allocations. - uint32_t unusedRangeCount; - /// Smallest allocation size. `VK_WHOLE_SIZE` if there are 0 allocations. - VkDeviceSize allocationSizeMin; - /// Largest allocation size. 0 if there are 0 allocations. - VkDeviceSize allocationSizeMax; - /// Smallest empty range size. `VK_WHOLE_SIZE` if there are 0 empty ranges. - VkDeviceSize unusedRangeSizeMin; - /// Largest empty range size. 0 if there are 0 empty ranges. - VkDeviceSize unusedRangeSizeMax; -} VmaDetailedStatistics; - -/** \brief General statistics from current state of the Allocator - -total memory usage across all memory heaps and types. - -These are slower to calculate. Use for debugging purposes. -See function vmaCalculateStatistics(). -*/ -typedef struct VmaTotalStatistics -{ - VmaDetailedStatistics memoryType[VK_MAX_MEMORY_TYPES]; - VmaDetailedStatistics memoryHeap[VK_MAX_MEMORY_HEAPS]; - VmaDetailedStatistics total; -} VmaTotalStatistics; - -/** \brief Statistics of current memory usage and available budget for a specific memory heap. - -These are fast to calculate. -See function vmaGetHeapBudgets(). -*/ -typedef struct VmaBudget -{ - /** \brief Statistics fetched from the library. - */ - VmaStatistics statistics; - /** \brief Estimated current memory usage of the program, in bytes. - - Fetched from system using VK_EXT_memory_budget extension if enabled. - - It might be different than `statistics.blockBytes` (usually higher) due to additional implicit objects - also occupying the memory, like swapchain, pipelines, descriptor heaps, command buffers, or - `VkDeviceMemory` blocks allocated outside of this library, if any. - */ - VkDeviceSize usage; - /** \brief Estimated amount of memory available to the program, in bytes. - - Fetched from system using VK_EXT_memory_budget extension if enabled. - - It might be different (most probably smaller) than `VkMemoryHeap::size[heapIndex]` due to factors - external to the program, decided by the operating system. - Difference `budget - usage` is the amount of additional memory that can probably - be allocated without problems. Exceeding the budget may result in various problems. - */ - VkDeviceSize budget; -} VmaBudget; - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/** \brief Parameters of new #VmaAllocation. - -To be used with functions like vmaCreateBuffer(), vmaCreateImage(), and many others. -*/ -typedef struct VmaAllocationCreateInfo -{ - /// Use #VmaAllocationCreateFlagBits enum. - VmaAllocationCreateFlags flags; - /** \brief Intended usage of memory. - - You can leave #VMA_MEMORY_USAGE_UNKNOWN if you specify memory requirements in other way. \n - If `pool` is not null, this member is ignored. - */ - VmaMemoryUsage usage; - /** \brief Flags that must be set in a Memory Type chosen for an allocation. - - Leave 0 if you specify memory requirements in other way. \n - If `pool` is not null, this member is ignored.*/ - VkMemoryPropertyFlags requiredFlags; - /** \brief Flags that preferably should be set in a memory type chosen for an allocation. - - Set to 0 if no additional flags are preferred. \n - If `pool` is not null, this member is ignored. */ - VkMemoryPropertyFlags preferredFlags; - /** \brief Bitmask containing one bit set for every memory type acceptable for this allocation. - - Value 0 is equivalent to `UINT32_MAX` - it means any memory type is accepted if - it meets other requirements specified by this structure, with no further - restrictions on memory type index. \n - If `pool` is not null, this member is ignored. - */ - uint32_t memoryTypeBits; - /** \brief Pool that this allocation should be created in. - - Leave `VK_NULL_HANDLE` to allocate from default pool. If not null, members: - `usage`, `requiredFlags`, `preferredFlags`, `memoryTypeBits` are ignored. - */ - VmaPool VMA_NULLABLE pool; - /** \brief Custom general-purpose pointer that will be stored in #VmaAllocation, can be read as VmaAllocationInfo::pUserData and changed using vmaSetAllocationUserData(). - - If #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT is used, it must be either - null or pointer to a null-terminated string. The string will be then copied to - internal buffer, so it doesn't need to be valid after allocation call. - */ - void* VMA_NULLABLE pUserData; - /** \brief A floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations. - - It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object - and this allocation ends up as dedicated or is explicitly forced as dedicated using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. - Otherwise, it has the priority of a memory block where it is placed and this variable is ignored. - */ - float priority; -} VmaAllocationCreateInfo; - -/// Describes parameter of created #VmaPool. -typedef struct VmaPoolCreateInfo -{ - /** \brief Vulkan memory type index to allocate this pool from. - */ - uint32_t memoryTypeIndex; - /** \brief Use combination of #VmaPoolCreateFlagBits. - */ - VmaPoolCreateFlags flags; - /** \brief Size of a single `VkDeviceMemory` block to be allocated as part of this pool, in bytes. Optional. - - Specify nonzero to set explicit, constant size of memory blocks used by this - pool. - - Leave 0 to use default and let the library manage block sizes automatically. - Sizes of particular blocks may vary. - In this case, the pool will also support dedicated allocations. - */ - VkDeviceSize blockSize; - /** \brief Minimum number of blocks to be always allocated in this pool, even if they stay empty. - - Set to 0 to have no preallocated blocks and allow the pool be completely empty. - */ - size_t minBlockCount; - /** \brief Maximum number of blocks that can be allocated in this pool. Optional. - - Set to 0 to use default, which is `SIZE_MAX`, which means no limit. - - Set to same value as VmaPoolCreateInfo::minBlockCount to have fixed amount of memory allocated - throughout whole lifetime of this pool. - */ - size_t maxBlockCount; - /** \brief A floating-point value between 0 and 1, indicating the priority of the allocations in this pool relative to other memory allocations. - - It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object. - Otherwise, this variable is ignored. - */ - float priority; - /** \brief Additional minimum alignment to be used for all allocations created from this pool. Can be 0. - - Leave 0 (default) not to impose any additional alignment. If not 0, it must be a power of two. - It can be useful in cases where alignment returned by Vulkan by functions like `vkGetBufferMemoryRequirements` is not enough, - e.g. when doing interop with OpenGL. - */ - VkDeviceSize minAllocationAlignment; - /** \brief Additional `pNext` chain to be attached to `VkMemoryAllocateInfo` used for every allocation made by this pool. Optional. - - Optional, can be null. If not null, it must point to a `pNext` chain of structures that can be attached to `VkMemoryAllocateInfo`. - It can be useful for special needs such as adding `VkExportMemoryAllocateInfoKHR`. - Structures pointed by this member must remain alive and unchanged for the whole lifetime of the custom pool. - - Please note that some structures, e.g. `VkMemoryPriorityAllocateInfoEXT`, `VkMemoryDedicatedAllocateInfoKHR`, - can be attached automatically by this library when using other, more convenient of its features. - */ - void* VMA_NULLABLE pMemoryAllocateNext; -} VmaPoolCreateInfo; - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/// Parameters of #VmaAllocation objects, that can be retrieved using function vmaGetAllocationInfo(). -typedef struct VmaAllocationInfo -{ - /** \brief Memory type index that this allocation was allocated from. - - It never changes. - */ - uint32_t memoryType; - /** \brief Handle to Vulkan memory object. - - Same memory object can be shared by multiple allocations. - - It can change after the allocation is moved during \ref defragmentation. - */ - VkDeviceMemory VMA_NULLABLE_NON_DISPATCHABLE deviceMemory; - /** \brief Offset in `VkDeviceMemory` object to the beginning of this allocation, in bytes. `(deviceMemory, offset)` pair is unique to this allocation. - - You usually don't need to use this offset. If you create a buffer or an image together with the allocation using e.g. function - vmaCreateBuffer(), vmaCreateImage(), functions that operate on these resources refer to the beginning of the buffer or image, - not entire device memory block. Functions like vmaMapMemory(), vmaBindBufferMemory() also refer to the beginning of the allocation - and apply this offset automatically. - - It can change after the allocation is moved during \ref defragmentation. - */ - VkDeviceSize offset; - /** \brief Size of this allocation, in bytes. - - It never changes. - - \note Allocation size returned in this variable may be greater than the size - requested for the resource e.g. as `VkBufferCreateInfo::size`. Whole size of the - allocation is accessible for operations on memory e.g. using a pointer after - mapping with vmaMapMemory(), but operations on the resource e.g. using - `vkCmdCopyBuffer` must be limited to the size of the resource. - */ - VkDeviceSize size; - /** \brief Pointer to the beginning of this allocation as mapped data. - - If the allocation hasn't been mapped using vmaMapMemory() and hasn't been - created with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag, this value is null. - - It can change after call to vmaMapMemory(), vmaUnmapMemory(). - It can also change after the allocation is moved during \ref defragmentation. - */ - void* VMA_NULLABLE pMappedData; - /** \brief Custom general-purpose pointer that was passed as VmaAllocationCreateInfo::pUserData or set using vmaSetAllocationUserData(). - - It can change after call to vmaSetAllocationUserData() for this allocation. - */ - void* VMA_NULLABLE pUserData; - /** \brief Custom allocation name that was set with vmaSetAllocationName(). - - It can change after call to vmaSetAllocationName() for this allocation. - - Another way to set custom name is to pass it in VmaAllocationCreateInfo::pUserData with - additional flag #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT set [DEPRECATED]. - */ - const char* VMA_NULLABLE pName; -} VmaAllocationInfo; - -/** \brief Parameters for defragmentation. - -To be used with function vmaBeginDefragmentation(). -*/ -typedef struct VmaDefragmentationInfo -{ - /// \brief Use combination of #VmaDefragmentationFlagBits. - VmaDefragmentationFlags flags; - /** \brief Custom pool to be defragmented. - - If null then default pools will undergo defragmentation process. - */ - VmaPool VMA_NULLABLE pool; - /** \brief Maximum numbers of bytes that can be copied during single pass, while moving allocations to different places. - - `0` means no limit. - */ - VkDeviceSize maxBytesPerPass; - /** \brief Maximum number of allocations that can be moved during single pass to a different place. - - `0` means no limit. - */ - uint32_t maxAllocationsPerPass; -} VmaDefragmentationInfo; - -/// Single move of an allocation to be done for defragmentation. -typedef struct VmaDefragmentationMove -{ - /// Operation to be performed on the allocation by vmaEndDefragmentationPass(). Default value is #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY. You can modify it. - VmaDefragmentationMoveOperation operation; - /// Allocation that should be moved. - VmaAllocation VMA_NOT_NULL srcAllocation; - /** \brief Temporary allocation pointing to destination memory that will replace `srcAllocation`. - - \warning Do not store this allocation in your data structures! It exists only temporarily, for the duration of the defragmentation pass, - to be used for binding new buffer/image to the destination memory using e.g. vmaBindBufferMemory(). - vmaEndDefragmentationPass() will destroy it and make `srcAllocation` point to this memory. - */ - VmaAllocation VMA_NOT_NULL dstTmpAllocation; -} VmaDefragmentationMove; - -/** \brief Parameters for incremental defragmentation steps. - -To be used with function vmaBeginDefragmentationPass(). -*/ -typedef struct VmaDefragmentationPassMoveInfo -{ - /// Number of elements in the `pMoves` array. - uint32_t moveCount; - /** \brief Array of moves to be performed by the user in the current defragmentation pass. - - Pointer to an array of `moveCount` elements, owned by VMA, created in vmaBeginDefragmentationPass(), destroyed in vmaEndDefragmentationPass(). - - For each element, you should: - - 1. Create a new buffer/image in the place pointed by VmaDefragmentationMove::dstMemory + VmaDefragmentationMove::dstOffset. - 2. Copy data from the VmaDefragmentationMove::srcAllocation e.g. using `vkCmdCopyBuffer`, `vkCmdCopyImage`. - 3. Make sure these commands finished executing on the GPU. - 4. Destroy the old buffer/image. - - Only then you can finish defragmentation pass by calling vmaEndDefragmentationPass(). - After this call, the allocation will point to the new place in memory. - - Alternatively, if you cannot move specific allocation, you can set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE. - - Alternatively, if you decide you want to completely remove the allocation: - - 1. Destroy its buffer/image. - 2. Set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY. - - Then, after vmaEndDefragmentationPass() the allocation will be freed. - */ - VmaDefragmentationMove* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(moveCount) pMoves; -} VmaDefragmentationPassMoveInfo; - -/// Statistics returned for defragmentation process in function vmaEndDefragmentation(). -typedef struct VmaDefragmentationStats -{ - /// Total number of bytes that have been copied while moving allocations to different places. - VkDeviceSize bytesMoved; - /// Total number of bytes that have been released to the system by freeing empty `VkDeviceMemory` objects. - VkDeviceSize bytesFreed; - /// Number of allocations that have been moved to different places. - uint32_t allocationsMoved; - /// Number of empty `VkDeviceMemory` objects that have been released to the system. - uint32_t deviceMemoryBlocksFreed; -} VmaDefragmentationStats; - -/** @} */ - -/** -\addtogroup group_virtual -@{ -*/ - -/// Parameters of created #VmaVirtualBlock object to be passed to vmaCreateVirtualBlock(). -typedef struct VmaVirtualBlockCreateInfo -{ - /** \brief Total size of the virtual block. - - Sizes can be expressed in bytes or any units you want as long as you are consistent in using them. - For example, if you allocate from some array of structures, 1 can mean single instance of entire structure. - */ - VkDeviceSize size; - - /** \brief Use combination of #VmaVirtualBlockCreateFlagBits. - */ - VmaVirtualBlockCreateFlags flags; - - /** \brief Custom CPU memory allocation callbacks. Optional. - - Optional, can be null. When specified, they will be used for all CPU-side memory allocations. - */ - const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks; -} VmaVirtualBlockCreateInfo; - -/// Parameters of created virtual allocation to be passed to vmaVirtualAllocate(). -typedef struct VmaVirtualAllocationCreateInfo -{ - /** \brief Size of the allocation. - - Cannot be zero. - */ - VkDeviceSize size; - /** \brief Required alignment of the allocation. Optional. - - Must be power of two. Special value 0 has the same meaning as 1 - means no special alignment is required, so allocation can start at any offset. - */ - VkDeviceSize alignment; - /** \brief Use combination of #VmaVirtualAllocationCreateFlagBits. - */ - VmaVirtualAllocationCreateFlags flags; - /** \brief Custom pointer to be associated with the allocation. Optional. - - It can be any value and can be used for user-defined purposes. It can be fetched or changed later. - */ - void* VMA_NULLABLE pUserData; -} VmaVirtualAllocationCreateInfo; - -/// Parameters of an existing virtual allocation, returned by vmaGetVirtualAllocationInfo(). -typedef struct VmaVirtualAllocationInfo -{ - /** \brief Offset of the allocation. - - Offset at which the allocation was made. - */ - VkDeviceSize offset; - /** \brief Size of the allocation. - - Same value as passed in VmaVirtualAllocationCreateInfo::size. - */ - VkDeviceSize size; - /** \brief Custom pointer associated with the allocation. - - Same value as passed in VmaVirtualAllocationCreateInfo::pUserData or to vmaSetVirtualAllocationUserData(). - */ - void* VMA_NULLABLE pUserData; -} VmaVirtualAllocationInfo; - -/** @} */ - -#endif // _VMA_DATA_TYPES_DECLARATIONS - -#ifndef _VMA_FUNCTION_HEADERS - -/** -\addtogroup group_init -@{ -*/ - -/// Creates #VmaAllocator object. -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator( - const VmaAllocatorCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaAllocator VMA_NULLABLE* VMA_NOT_NULL pAllocator); - -/// Destroys allocator object. -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator( - VmaAllocator VMA_NULLABLE allocator); - -/** \brief Returns information about existing #VmaAllocator object - handle to Vulkan device etc. - -It might be useful if you want to keep just the #VmaAllocator handle and fetch other required handles to -`VkPhysicalDevice`, `VkDevice` etc. every time using this function. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocatorInfo* VMA_NOT_NULL pAllocatorInfo); - -/** -PhysicalDeviceProperties are fetched from physicalDevice by the allocator. -You can access it here, without fetching it again on your own. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties( - VmaAllocator VMA_NOT_NULL allocator, - const VkPhysicalDeviceProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceProperties); - -/** -PhysicalDeviceMemoryProperties are fetched from physicalDevice by the allocator. -You can access it here, without fetching it again on your own. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties( - VmaAllocator VMA_NOT_NULL allocator, - const VkPhysicalDeviceMemoryProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceMemoryProperties); - -/** -\brief Given Memory Type Index, returns Property Flags of this memory type. - -This is just a convenience function. Same information can be obtained using -vmaGetMemoryProperties(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t memoryTypeIndex, - VkMemoryPropertyFlags* VMA_NOT_NULL pFlags); - -/** \brief Sets index of the current frame. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t frameIndex); - -/** @} */ - -/** -\addtogroup group_stats -@{ -*/ - -/** \brief Retrieves statistics from current state of the Allocator. - -This function is called "calculate" not "get" because it has to traverse all -internal data structures, so it may be quite slow. Use it for debugging purposes. -For faster but more brief statistics suitable to be called every frame or every allocation, -use vmaGetHeapBudgets(). - -Note that when using allocator from multiple threads, returned information may immediately -become outdated. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics( - VmaAllocator VMA_NOT_NULL allocator, - VmaTotalStatistics* VMA_NOT_NULL pStats); - -/** \brief Retrieves information about current memory usage and budget for all memory heaps. - -\param allocator -\param[out] pBudgets Must point to array with number of elements at least equal to number of memory heaps in physical device used. - -This function is called "get" not "calculate" because it is very fast, suitable to be called -every frame or every allocation. For more detailed statistics use vmaCalculateStatistics(). - -Note that when using allocator from multiple threads, returned information may immediately -become outdated. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets( - VmaAllocator VMA_NOT_NULL allocator, - VmaBudget* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pBudgets); - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/** -\brief Helps to find memoryTypeIndex, given memoryTypeBits and VmaAllocationCreateInfo. - -This algorithm tries to find a memory type that: - -- Is allowed by memoryTypeBits. -- Contains all the flags from pAllocationCreateInfo->requiredFlags. -- Matches intended usage. -- Has as many flags from pAllocationCreateInfo->preferredFlags as possible. - -\return Returns VK_ERROR_FEATURE_NOT_PRESENT if not found. Receiving such result -from this function or any other allocating function probably means that your -device doesn't support any memory type with requested features for the specific -type of resource you want to use it for. Please check parameters of your -resource, like image layout (OPTIMAL versus LINEAR) or mip level count. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t memoryTypeBits, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - uint32_t* VMA_NOT_NULL pMemoryTypeIndex); - -/** -\brief Helps to find memoryTypeIndex, given VkBufferCreateInfo and VmaAllocationCreateInfo. - -It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex. -It internally creates a temporary, dummy buffer that never has memory bound. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo( - VmaAllocator VMA_NOT_NULL allocator, - const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - uint32_t* VMA_NOT_NULL pMemoryTypeIndex); - -/** -\brief Helps to find memoryTypeIndex, given VkImageCreateInfo and VmaAllocationCreateInfo. - -It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex. -It internally creates a temporary, dummy image that never has memory bound. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo( - VmaAllocator VMA_NOT_NULL allocator, - const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - uint32_t* VMA_NOT_NULL pMemoryTypeIndex); - -/** \brief Allocates Vulkan device memory and creates #VmaPool object. - -\param allocator Allocator object. -\param pCreateInfo Parameters of pool to create. -\param[out] pPool Handle to created pool. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool( - VmaAllocator VMA_NOT_NULL allocator, - const VmaPoolCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaPool VMA_NULLABLE* VMA_NOT_NULL pPool); - -/** \brief Destroys #VmaPool object and frees Vulkan device memory. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NULLABLE pool); - -/** @} */ - -/** -\addtogroup group_stats -@{ -*/ - -/** \brief Retrieves statistics of existing #VmaPool object. - -\param allocator Allocator object. -\param pool Pool object. -\param[out] pPoolStats Statistics of specified pool. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NOT_NULL pool, - VmaStatistics* VMA_NOT_NULL pPoolStats); - -/** \brief Retrieves detailed statistics of existing #VmaPool object. - -\param allocator Allocator object. -\param pool Pool object. -\param[out] pPoolStats Statistics of specified pool. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NOT_NULL pool, - VmaDetailedStatistics* VMA_NOT_NULL pPoolStats); - -/** @} */ - -/** -\addtogroup group_alloc -@{ -*/ - -/** \brief Checks magic number in margins around all allocations in given memory pool in search for corruptions. - -Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero, -`VMA_DEBUG_MARGIN` is defined to nonzero and the pool is created in memory type that is -`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection). - -Possible return values: - -- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for specified pool. -- `VK_SUCCESS` - corruption detection has been performed and succeeded. -- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations. - `VMA_ASSERT` is also fired in that case. -- Other value: Error returned by Vulkan, e.g. memory mapping failure. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NOT_NULL pool); - -/** \brief Retrieves name of a custom pool. - -After the call `ppName` is either null or points to an internally-owned null-terminated string -containing name of the pool that was previously set. The pointer becomes invalid when the pool is -destroyed or its name is changed using vmaSetPoolName(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NOT_NULL pool, - const char* VMA_NULLABLE* VMA_NOT_NULL ppName); - -/** \brief Sets name of a custom pool. - -`pName` can be either null or pointer to a null-terminated string with new name for the pool. -Function makes internal copy of the string, so it can be changed or freed immediately after this call. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName( - VmaAllocator VMA_NOT_NULL allocator, - VmaPool VMA_NOT_NULL pool, - const char* VMA_NULLABLE pName); - -/** \brief General purpose memory allocation. - -\param allocator -\param pVkMemoryRequirements -\param pCreateInfo -\param[out] pAllocation Handle to allocated memory. -\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo(). - -You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages(). - -It is recommended to use vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage(), -vmaCreateBuffer(), vmaCreateImage() instead whenever possible. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory( - VmaAllocator VMA_NOT_NULL allocator, - const VkMemoryRequirements* VMA_NOT_NULL pVkMemoryRequirements, - const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/** \brief General purpose memory allocation for multiple allocation objects at once. - -\param allocator Allocator object. -\param pVkMemoryRequirements Memory requirements for each allocation. -\param pCreateInfo Creation parameters for each allocation. -\param allocationCount Number of allocations to make. -\param[out] pAllocations Pointer to array that will be filled with handles to created allocations. -\param[out] pAllocationInfo Optional. Pointer to array that will be filled with parameters of created allocations. - -You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages(). - -Word "pages" is just a suggestion to use this function to allocate pieces of memory needed for sparse binding. -It is just a general purpose allocation function able to make multiple allocations at once. -It may be internally optimized to be more efficient than calling vmaAllocateMemory() `allocationCount` times. - -All allocations are made using same parameters. All of them are created out of the same memory pool and type. -If any allocation fails, all allocations already made within this function call are also freed, so that when -returned result is not `VK_SUCCESS`, `pAllocation` array is always entirely filled with `VK_NULL_HANDLE`. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages( - VmaAllocator VMA_NOT_NULL allocator, - const VkMemoryRequirements* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pVkMemoryRequirements, - const VmaAllocationCreateInfo* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pCreateInfo, - size_t allocationCount, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations, - VmaAllocationInfo* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) pAllocationInfo); - -/** \brief Allocates memory suitable for given `VkBuffer`. - -\param allocator -\param buffer -\param pCreateInfo -\param[out] pAllocation Handle to allocated memory. -\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo(). - -It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindBufferMemory(). - -This is a special-purpose function. In most cases you should use vmaCreateBuffer(). - -You must free the allocation using vmaFreeMemory() when no longer needed. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer( - VmaAllocator VMA_NOT_NULL allocator, - VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer, - const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/** \brief Allocates memory suitable for given `VkImage`. - -\param allocator -\param image -\param pCreateInfo -\param[out] pAllocation Handle to allocated memory. -\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo(). - -It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindImageMemory(). - -This is a special-purpose function. In most cases you should use vmaCreateImage(). - -You must free the allocation using vmaFreeMemory() when no longer needed. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage( - VmaAllocator VMA_NOT_NULL allocator, - VkImage VMA_NOT_NULL_NON_DISPATCHABLE image, - const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/** \brief Frees memory previously allocated using vmaAllocateMemory(), vmaAllocateMemoryForBuffer(), or vmaAllocateMemoryForImage(). - -Passing `VK_NULL_HANDLE` as `allocation` is valid. Such function call is just skipped. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory( - VmaAllocator VMA_NOT_NULL allocator, - const VmaAllocation VMA_NULLABLE allocation); - -/** \brief Frees memory and destroys multiple allocations. - -Word "pages" is just a suggestion to use this function to free pieces of memory used for sparse binding. -It is just a general purpose function to free memory and destroy allocations made using e.g. vmaAllocateMemory(), -vmaAllocateMemoryPages() and other functions. -It may be internally optimized to be more efficient than calling vmaFreeMemory() `allocationCount` times. - -Allocations in `pAllocations` array can come from any memory pools and types. -Passing `VK_NULL_HANDLE` as elements of `pAllocations` array is valid. Such entries are just skipped. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages( - VmaAllocator VMA_NOT_NULL allocator, - size_t allocationCount, - const VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations); - -/** \brief Returns current information about specified allocation. - -Current paramteres of given allocation are returned in `pAllocationInfo`. - -Although this function doesn't lock any mutex, so it should be quite efficient, -you should avoid calling it too often. -You can retrieve same VmaAllocationInfo structure while creating your resource, from function -vmaCreateBuffer(), vmaCreateImage(). You can remember it if you are sure parameters don't change -(e.g. due to defragmentation). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VmaAllocationInfo* VMA_NOT_NULL pAllocationInfo); - -/** \brief Sets pUserData in given allocation to new value. - -The value of pointer `pUserData` is copied to allocation's `pUserData`. -It is opaque, so you can use it however you want - e.g. -as a pointer, ordinal number or some handle to you own data. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - void* VMA_NULLABLE pUserData); - -/** \brief Sets pName in given allocation to new value. - -`pName` must be either null, or pointer to a null-terminated string. The function -makes local copy of the string and sets it as allocation's `pName`. String -passed as pName doesn't need to be valid for whole lifetime of the allocation - -you can free it after this call. String previously pointed by allocation's -`pName` is freed from memory. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const char* VMA_NULLABLE pName); - -/** -\brief Given an allocation, returns Property Flags of its memory type. - -This is just a convenience function. Same information can be obtained using -vmaGetAllocationInfo() + vmaGetMemoryProperties(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkMemoryPropertyFlags* VMA_NOT_NULL pFlags); - -/** \brief Maps memory represented by given allocation and returns pointer to it. - -Maps memory represented by given allocation to make it accessible to CPU code. -When succeeded, `*ppData` contains pointer to first byte of this memory. - -\warning -If the allocation is part of a bigger `VkDeviceMemory` block, returned pointer is -correctly offsetted to the beginning of region assigned to this particular allocation. -Unlike the result of `vkMapMemory`, it points to the allocation, not to the beginning of the whole block. -You should not add VmaAllocationInfo::offset to it! - -Mapping is internally reference-counted and synchronized, so despite raw Vulkan -function `vkMapMemory()` cannot be used to map same block of `VkDeviceMemory` -multiple times simultaneously, it is safe to call this function on allocations -assigned to the same memory block. Actual Vulkan memory will be mapped on first -mapping and unmapped on last unmapping. - -If the function succeeded, you must call vmaUnmapMemory() to unmap the -allocation when mapping is no longer needed or before freeing the allocation, at -the latest. - -It also safe to call this function multiple times on the same allocation. You -must call vmaUnmapMemory() same number of times as you called vmaMapMemory(). - -It is also safe to call this function on allocation created with -#VMA_ALLOCATION_CREATE_MAPPED_BIT flag. Its memory stays mapped all the time. -You must still call vmaUnmapMemory() same number of times as you called -vmaMapMemory(). You must not call vmaUnmapMemory() additional time to free the -"0-th" mapping made automatically due to #VMA_ALLOCATION_CREATE_MAPPED_BIT flag. - -This function fails when used on allocation made in memory type that is not -`HOST_VISIBLE`. - -This function doesn't automatically flush or invalidate caches. -If the allocation is made from a memory types that is not `HOST_COHERENT`, -you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - void* VMA_NULLABLE* VMA_NOT_NULL ppData); - -/** \brief Unmaps memory represented by given allocation, mapped previously using vmaMapMemory(). - -For details, see description of vmaMapMemory(). - -This function doesn't automatically flush or invalidate caches. -If the allocation is made from a memory types that is not `HOST_COHERENT`, -you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation); - -/** \brief Flushes memory of given allocation. - -Calls `vkFlushMappedMemoryRanges()` for memory associated with given range of given allocation. -It needs to be called after writing to a mapped memory for memory types that are not `HOST_COHERENT`. -Unmap operation doesn't do that automatically. - -- `offset` must be relative to the beginning of allocation. -- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation. -- `offset` and `size` don't have to be aligned. - They are internally rounded down/up to multiply of `nonCoherentAtomSize`. -- If `size` is 0, this call is ignored. -- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`, - this call is ignored. - -Warning! `offset` and `size` are relative to the contents of given `allocation`. -If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively. -Do not pass allocation's offset as `offset`!!! - -This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is -called, otherwise `VK_SUCCESS`. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkDeviceSize offset, - VkDeviceSize size); - -/** \brief Invalidates memory of given allocation. - -Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given range of given allocation. -It needs to be called before reading from a mapped memory for memory types that are not `HOST_COHERENT`. -Map operation doesn't do that automatically. - -- `offset` must be relative to the beginning of allocation. -- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation. -- `offset` and `size` don't have to be aligned. - They are internally rounded down/up to multiply of `nonCoherentAtomSize`. -- If `size` is 0, this call is ignored. -- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`, - this call is ignored. - -Warning! `offset` and `size` are relative to the contents of given `allocation`. -If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively. -Do not pass allocation's offset as `offset`!!! - -This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if -it is called, otherwise `VK_SUCCESS`. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkDeviceSize offset, - VkDeviceSize size); - -/** \brief Flushes memory of given set of allocations. - -Calls `vkFlushMappedMemoryRanges()` for memory associated with given ranges of given allocations. -For more information, see documentation of vmaFlushAllocation(). - -\param allocator -\param allocationCount -\param allocations -\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero. -\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations. - -This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is -called, otherwise `VK_SUCCESS`. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t allocationCount, - const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations, - const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets, - const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes); - -/** \brief Invalidates memory of given set of allocations. - -Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given ranges of given allocations. -For more information, see documentation of vmaInvalidateAllocation(). - -\param allocator -\param allocationCount -\param allocations -\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero. -\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations. - -This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if it is -called, otherwise `VK_SUCCESS`. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t allocationCount, - const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations, - const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets, - const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes); - -/** \brief Checks magic number in margins around all allocations in given memory types (in both default and custom pools) in search for corruptions. - -\param allocator -\param memoryTypeBits Bit mask, where each bit set means that a memory type with that index should be checked. - -Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero, -`VMA_DEBUG_MARGIN` is defined to nonzero and only for memory types that are -`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection). - -Possible return values: - -- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for any of specified memory types. -- `VK_SUCCESS` - corruption detection has been performed and succeeded. -- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations. - `VMA_ASSERT` is also fired in that case. -- Other value: Error returned by Vulkan, e.g. memory mapping failure. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption( - VmaAllocator VMA_NOT_NULL allocator, - uint32_t memoryTypeBits); - -/** \brief Begins defragmentation process. - -\param allocator Allocator object. -\param pInfo Structure filled with parameters of defragmentation. -\param[out] pContext Context object that must be passed to vmaEndDefragmentation() to finish defragmentation. -\returns -- `VK_SUCCESS` if defragmentation can begin. -- `VK_ERROR_FEATURE_NOT_PRESENT` if defragmentation is not supported. - -For more information about defragmentation, see documentation chapter: -[Defragmentation](@ref defragmentation). -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation( - VmaAllocator VMA_NOT_NULL allocator, - const VmaDefragmentationInfo* VMA_NOT_NULL pInfo, - VmaDefragmentationContext VMA_NULLABLE* VMA_NOT_NULL pContext); - -/** \brief Ends defragmentation process. - -\param allocator Allocator object. -\param context Context object that has been created by vmaBeginDefragmentation(). -\param[out] pStats Optional stats for the defragmentation. Can be null. - -Use this function to finish defragmentation started by vmaBeginDefragmentation(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation( - VmaAllocator VMA_NOT_NULL allocator, - VmaDefragmentationContext VMA_NOT_NULL context, - VmaDefragmentationStats* VMA_NULLABLE pStats); - -/** \brief Starts single defragmentation pass. - -\param allocator Allocator object. -\param context Context object that has been created by vmaBeginDefragmentation(). -\param[out] pPassInfo Computed informations for current pass. -\returns -- `VK_SUCCESS` if no more moves are possible. Then you can omit call to vmaEndDefragmentationPass() and simply end whole defragmentation. -- `VK_INCOMPLETE` if there are pending moves returned in `pPassInfo`. You need to perform them, call vmaEndDefragmentationPass(), - and then preferably try another pass with vmaBeginDefragmentationPass(). -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass( - VmaAllocator VMA_NOT_NULL allocator, - VmaDefragmentationContext VMA_NOT_NULL context, - VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo); - -/** \brief Ends single defragmentation pass. - -\param allocator Allocator object. -\param context Context object that has been created by vmaBeginDefragmentation(). -\param pPassInfo Computed informations for current pass filled by vmaBeginDefragmentationPass() and possibly modified by you. - -Returns `VK_SUCCESS` if no more moves are possible or `VK_INCOMPLETE` if more defragmentations are possible. - -Ends incremental defragmentation pass and commits all defragmentation moves from `pPassInfo`. -After this call: - -- Allocations at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY - (which is the default) will be pointing to the new destination place. -- Allocation at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY - will be freed. - -If no more moves are possible you can end whole defragmentation. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass( - VmaAllocator VMA_NOT_NULL allocator, - VmaDefragmentationContext VMA_NOT_NULL context, - VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo); - -/** \brief Binds buffer to allocation. - -Binds specified buffer to region of memory represented by specified allocation. -Gets `VkDeviceMemory` handle and offset from the allocation. -If you want to create a buffer, allocate memory for it and bind them together separately, -you should use this function for binding instead of standard `vkBindBufferMemory()`, -because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple -allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously -(which is illegal in Vulkan). - -It is recommended to use function vmaCreateBuffer() instead of this one. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer); - -/** \brief Binds buffer to allocation with additional parameters. - -\param allocator -\param allocation -\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0. -\param buffer -\param pNext A chain of structures to be attached to `VkBindBufferMemoryInfoKHR` structure used internally. Normally it should be null. - -This function is similar to vmaBindBufferMemory(), but it provides additional parameters. - -If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag -or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkDeviceSize allocationLocalOffset, - VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer, - const void* VMA_NULLABLE pNext); - -/** \brief Binds image to allocation. - -Binds specified image to region of memory represented by specified allocation. -Gets `VkDeviceMemory` handle and offset from the allocation. -If you want to create an image, allocate memory for it and bind them together separately, -you should use this function for binding instead of standard `vkBindImageMemory()`, -because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple -allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously -(which is illegal in Vulkan). - -It is recommended to use function vmaCreateImage() instead of this one. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkImage VMA_NOT_NULL_NON_DISPATCHABLE image); - -/** \brief Binds image to allocation with additional parameters. - -\param allocator -\param allocation -\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0. -\param image -\param pNext A chain of structures to be attached to `VkBindImageMemoryInfoKHR` structure used internally. Normally it should be null. - -This function is similar to vmaBindImageMemory(), but it provides additional parameters. - -If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag -or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkDeviceSize allocationLocalOffset, - VkImage VMA_NOT_NULL_NON_DISPATCHABLE image, - const void* VMA_NULLABLE pNext); - -/** \brief Creates a new `VkBuffer`, allocates and binds memory for it. - -\param allocator -\param pBufferCreateInfo -\param pAllocationCreateInfo -\param[out] pBuffer Buffer that was created. -\param[out] pAllocation Allocation that was created. -\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo(). - -This function automatically: - --# Creates buffer. --# Allocates appropriate memory for it. --# Binds the buffer with the memory. - -If any of these operations fail, buffer and allocation are not created, -returned value is negative error code, `*pBuffer` and `*pAllocation` are null. - -If the function succeeded, you must destroy both buffer and allocation when you -no longer need them using either convenience function vmaDestroyBuffer() or -separately, using `vkDestroyBuffer()` and vmaFreeMemory(). - -If #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag was used, -VK_KHR_dedicated_allocation extension is used internally to query driver whether -it requires or prefers the new buffer to have dedicated allocation. If yes, -and if dedicated allocation is possible -(#VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT is not used), it creates dedicated -allocation for this buffer, just like when using -#VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. - -\note This function creates a new `VkBuffer`. Sub-allocation of parts of one large buffer, -although recommended as a good practice, is out of scope of this library and could be implemented -by the user as a higher-level logic on top of VMA. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer( - VmaAllocator VMA_NOT_NULL allocator, - const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/** \brief Creates a buffer with additional minimum alignment. - -Similar to vmaCreateBuffer() but provides additional parameter `minAlignment` which allows to specify custom, -minimum alignment to be used when placing the buffer inside a larger memory block, which may be needed e.g. -for interop with OpenGL. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment( - VmaAllocator VMA_NOT_NULL allocator, - const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - VkDeviceSize minAlignment, - VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/** \brief Creates a new `VkBuffer`, binds already created memory for it. - -\param allocator -\param allocation Allocation that provides memory to be used for binding new buffer to it. -\param pBufferCreateInfo -\param[out] pBuffer Buffer that was created. - -This function automatically: - --# Creates buffer. --# Binds the buffer with the supplied memory. - -If any of these operations fail, buffer is not created, -returned value is negative error code and `*pBuffer` is null. - -If the function succeeded, you must destroy the buffer when you -no longer need it using `vkDestroyBuffer()`. If you want to also destroy the corresponding -allocation you can use convenience function vmaDestroyBuffer(). -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo, - VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer); - -/** \brief Destroys Vulkan buffer and frees allocated memory. - -This is just a convenience function equivalent to: - -\code -vkDestroyBuffer(device, buffer, allocationCallbacks); -vmaFreeMemory(allocator, allocation); -\endcode - -It it safe to pass null as buffer and/or allocation. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer( - VmaAllocator VMA_NOT_NULL allocator, - VkBuffer VMA_NULLABLE_NON_DISPATCHABLE buffer, - VmaAllocation VMA_NULLABLE allocation); - -/// Function similar to vmaCreateBuffer(). -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage( - VmaAllocator VMA_NOT_NULL allocator, - const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo, - const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo, - VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage, - VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation, - VmaAllocationInfo* VMA_NULLABLE pAllocationInfo); - -/// Function similar to vmaCreateAliasingBuffer(). -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo, - VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage); - -/** \brief Destroys Vulkan image and frees allocated memory. - -This is just a convenience function equivalent to: - -\code -vkDestroyImage(device, image, allocationCallbacks); -vmaFreeMemory(allocator, allocation); -\endcode - -It it safe to pass null as image and/or allocation. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage( - VmaAllocator VMA_NOT_NULL allocator, - VkImage VMA_NULLABLE_NON_DISPATCHABLE image, - VmaAllocation VMA_NULLABLE allocation); - -/** @} */ - -/** -\addtogroup group_virtual -@{ -*/ - -/** \brief Creates new #VmaVirtualBlock object. - -\param pCreateInfo Parameters for creation. -\param[out] pVirtualBlock Returned virtual block object or `VMA_NULL` if creation failed. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock( - const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaVirtualBlock VMA_NULLABLE* VMA_NOT_NULL pVirtualBlock); - -/** \brief Destroys #VmaVirtualBlock object. - -Please note that you should consciously handle virtual allocations that could remain unfreed in the block. -You should either free them individually using vmaVirtualFree() or call vmaClearVirtualBlock() -if you are sure this is what you want. If you do neither, an assert is called. - -If you keep pointers to some additional metadata associated with your virtual allocations in their `pUserData`, -don't forget to free them. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock( - VmaVirtualBlock VMA_NULLABLE virtualBlock); - -/** \brief Returns true of the #VmaVirtualBlock is empty - contains 0 virtual allocations and has all its space available for new allocations. -*/ -VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty( - VmaVirtualBlock VMA_NOT_NULL virtualBlock); - -/** \brief Returns information about a specific virtual allocation within a virtual block, like its size and `pUserData` pointer. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo); - -/** \brief Allocates new virtual allocation inside given #VmaVirtualBlock. - -If the allocation fails due to not enough free space available, `VK_ERROR_OUT_OF_DEVICE_MEMORY` is returned -(despite the function doesn't ever allocate actual GPU memory). -`pAllocation` is then set to `VK_NULL_HANDLE` and `pOffset`, if not null, it set to `UINT64_MAX`. - -\param virtualBlock Virtual block -\param pCreateInfo Parameters for the allocation -\param[out] pAllocation Returned handle of the new allocation -\param[out] pOffset Returned offset of the new allocation. Optional, can be null. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation, - VkDeviceSize* VMA_NULLABLE pOffset); - -/** \brief Frees virtual allocation inside given #VmaVirtualBlock. - -It is correct to call this function with `allocation == VK_NULL_HANDLE` - it does nothing. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation); - -/** \brief Frees all virtual allocations inside given #VmaVirtualBlock. - -You must either call this function or free each virtual allocation individually with vmaVirtualFree() -before destroying a virtual block. Otherwise, an assert is called. - -If you keep pointer to some additional metadata associated with your virtual allocation in its `pUserData`, -don't forget to free it as well. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock( - VmaVirtualBlock VMA_NOT_NULL virtualBlock); - -/** \brief Changes custom pointer associated with given virtual allocation. -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, - void* VMA_NULLABLE pUserData); - -/** \brief Calculates and returns statistics about virtual allocations and memory usage in given #VmaVirtualBlock. - -This function is fast to call. For more detailed statistics, see vmaCalculateVirtualBlockStatistics(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaStatistics* VMA_NOT_NULL pStats); - -/** \brief Calculates and returns detailed statistics about virtual allocations and memory usage in given #VmaVirtualBlock. - -This function is slow to call. Use for debugging purposes. -For less detailed statistics, see vmaGetVirtualBlockStatistics(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaDetailedStatistics* VMA_NOT_NULL pStats); - -/** @} */ - -#if VMA_STATS_STRING_ENABLED -/** -\addtogroup group_stats -@{ -*/ - -/** \brief Builds and returns a null-terminated string in JSON format with information about given #VmaVirtualBlock. -\param virtualBlock Virtual block. -\param[out] ppStatsString Returned string. -\param detailedMap Pass `VK_FALSE` to only obtain statistics as returned by vmaCalculateVirtualBlockStatistics(). Pass `VK_TRUE` to also obtain full list of allocations and free spaces. - -Returned string must be freed using vmaFreeVirtualBlockStatsString(). -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString, - VkBool32 detailedMap); - -/// Frees a string returned by vmaBuildVirtualBlockStatsString(). -VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString( - VmaVirtualBlock VMA_NOT_NULL virtualBlock, - char* VMA_NULLABLE pStatsString); - -/** \brief Builds and returns statistics as a null-terminated string in JSON format. -\param allocator -\param[out] ppStatsString Must be freed using vmaFreeStatsString() function. -\param detailedMap -*/ -VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString( - VmaAllocator VMA_NOT_NULL allocator, - char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString, - VkBool32 detailedMap); - -VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString( - VmaAllocator VMA_NOT_NULL allocator, - char* VMA_NULLABLE pStatsString); - -/** @} */ - -#endif // VMA_STATS_STRING_ENABLED - -#endif // _VMA_FUNCTION_HEADERS - -#ifdef __cplusplus -} -#endif - -#endif // AMD_VULKAN_MEMORY_ALLOCATOR_H - -//////////////////////////////////////////////////////////////////////////////// -//////////////////////////////////////////////////////////////////////////////// -// -// IMPLEMENTATION -// -//////////////////////////////////////////////////////////////////////////////// -//////////////////////////////////////////////////////////////////////////////// - -// For Visual Studio IntelliSense. -#if defined(__cplusplus) && defined(__INTELLISENSE__) -#define VMA_IMPLEMENTATION -#endif - -#ifdef VMA_IMPLEMENTATION -#undef VMA_IMPLEMENTATION - -#include -#include -#include -#include -#include - -#ifdef _MSC_VER - #include // For functions like __popcnt, _BitScanForward etc. -#endif -#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20 - #include // For std::popcount -#endif - -/******************************************************************************* -CONFIGURATION SECTION - -Define some of these macros before each #include of this header or change them -here if you need other then default behavior depending on your environment. -*/ -#ifndef _VMA_CONFIGURATION - -/* -Define this macro to 1 to make the library fetch pointers to Vulkan functions -internally, like: - - vulkanFunctions.vkAllocateMemory = &vkAllocateMemory; -*/ -#if !defined(VMA_STATIC_VULKAN_FUNCTIONS) && !defined(VK_NO_PROTOTYPES) - #define VMA_STATIC_VULKAN_FUNCTIONS 1 -#endif - -/* -Define this macro to 1 to make the library fetch pointers to Vulkan functions -internally, like: - - vulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkGetDeviceProcAddr(device, "vkAllocateMemory"); - -To use this feature in new versions of VMA you now have to pass -VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as -VmaAllocatorCreateInfo::pVulkanFunctions. Other members can be null. -*/ -#if !defined(VMA_DYNAMIC_VULKAN_FUNCTIONS) - #define VMA_DYNAMIC_VULKAN_FUNCTIONS 1 -#endif - -#ifndef VMA_USE_STL_SHARED_MUTEX - // Compiler conforms to C++17. - #if __cplusplus >= 201703L - #define VMA_USE_STL_SHARED_MUTEX 1 - // Visual studio defines __cplusplus properly only when passed additional parameter: /Zc:__cplusplus - // Otherwise it is always 199711L, despite shared_mutex works since Visual Studio 2015 Update 2. - #elif defined(_MSC_FULL_VER) && _MSC_FULL_VER >= 190023918 && __cplusplus == 199711L && _MSVC_LANG >= 201703L - #define VMA_USE_STL_SHARED_MUTEX 1 - #else - #define VMA_USE_STL_SHARED_MUTEX 0 - #endif -#endif - -/* -Define this macro to include custom header files without having to edit this file directly, e.g.: - - // Inside of "my_vma_configuration_user_includes.h": - - #include "my_custom_assert.h" // for MY_CUSTOM_ASSERT - #include "my_custom_min.h" // for my_custom_min - #include - #include - - // Inside a different file, which includes "vk_mem_alloc.h": - - #define VMA_CONFIGURATION_USER_INCLUDES_H "my_vma_configuration_user_includes.h" - #define VMA_ASSERT(expr) MY_CUSTOM_ASSERT(expr) - #define VMA_MIN(v1, v2) (my_custom_min(v1, v2)) - #include "vk_mem_alloc.h" - ... - -The following headers are used in this CONFIGURATION section only, so feel free to -remove them if not needed. -*/ -#if !defined(VMA_CONFIGURATION_USER_INCLUDES_H) - #include // for assert - #include // for min, max - #include -#else - #include VMA_CONFIGURATION_USER_INCLUDES_H -#endif - -#ifndef VMA_NULL - // Value used as null pointer. Define it to e.g.: nullptr, NULL, 0, (void*)0. - #define VMA_NULL nullptr -#endif - -#if defined(__ANDROID_API__) && (__ANDROID_API__ < 16) -#include -static void* vma_aligned_alloc(size_t alignment, size_t size) -{ - // alignment must be >= sizeof(void*) - if(alignment < sizeof(void*)) - { - alignment = sizeof(void*); - } - - return memalign(alignment, size); -} -#elif defined(__APPLE__) || defined(__ANDROID__) || (defined(__linux__) && defined(__GLIBCXX__) && !defined(_GLIBCXX_HAVE_ALIGNED_ALLOC)) -#include - -#if defined(__APPLE__) -#include -#endif - -static void* vma_aligned_alloc(size_t alignment, size_t size) -{ - // Unfortunately, aligned_alloc causes VMA to crash due to it returning null pointers. (At least under 11.4) - // Therefore, for now disable this specific exception until a proper solution is found. - //#if defined(__APPLE__) && (defined(MAC_OS_X_VERSION_10_16) || defined(__IPHONE_14_0)) - //#if MAC_OS_X_VERSION_MAX_ALLOWED >= MAC_OS_X_VERSION_10_16 || __IPHONE_OS_VERSION_MAX_ALLOWED >= __IPHONE_14_0 - // // For C++14, usr/include/malloc/_malloc.h declares aligned_alloc()) only - // // with the MacOSX11.0 SDK in Xcode 12 (which is what adds - // // MAC_OS_X_VERSION_10_16), even though the function is marked - // // availabe for 10.15. That is why the preprocessor checks for 10.16 but - // // the __builtin_available checks for 10.15. - // // People who use C++17 could call aligned_alloc with the 10.15 SDK already. - // if (__builtin_available(macOS 10.15, iOS 13, *)) - // return aligned_alloc(alignment, size); - //#endif - //#endif - - // alignment must be >= sizeof(void*) - if(alignment < sizeof(void*)) - { - alignment = sizeof(void*); - } - - void *pointer; - if(posix_memalign(&pointer, alignment, size) == 0) - return pointer; - return VMA_NULL; -} -#elif defined(_WIN32) -static void* vma_aligned_alloc(size_t alignment, size_t size) -{ - return _aligned_malloc(size, alignment); -} -#else -static void* vma_aligned_alloc(size_t alignment, size_t size) -{ - return aligned_alloc(alignment, size); -} -#endif - -#if defined(_WIN32) -static void vma_aligned_free(void* ptr) -{ - _aligned_free(ptr); -} -#else -static void vma_aligned_free(void* VMA_NULLABLE ptr) -{ - free(ptr); -} -#endif - -// If your compiler is not compatible with C++11 and definition of -// aligned_alloc() function is missing, uncommeting following line may help: - -//#include - -// Normal assert to check for programmer's errors, especially in Debug configuration. -#ifndef VMA_ASSERT - #ifdef NDEBUG - #define VMA_ASSERT(expr) - #else - #define VMA_ASSERT(expr) assert(expr) - #endif -#endif - -// Assert that will be called very often, like inside data structures e.g. operator[]. -// Making it non-empty can make program slow. -#ifndef VMA_HEAVY_ASSERT - #ifdef NDEBUG - #define VMA_HEAVY_ASSERT(expr) - #else - #define VMA_HEAVY_ASSERT(expr) //VMA_ASSERT(expr) - #endif -#endif - -#ifndef VMA_ALIGN_OF - #define VMA_ALIGN_OF(type) (__alignof(type)) -#endif - -#ifndef VMA_SYSTEM_ALIGNED_MALLOC - #define VMA_SYSTEM_ALIGNED_MALLOC(size, alignment) vma_aligned_alloc((alignment), (size)) -#endif - -#ifndef VMA_SYSTEM_ALIGNED_FREE - // VMA_SYSTEM_FREE is the old name, but might have been defined by the user - #if defined(VMA_SYSTEM_FREE) - #define VMA_SYSTEM_ALIGNED_FREE(ptr) VMA_SYSTEM_FREE(ptr) - #else - #define VMA_SYSTEM_ALIGNED_FREE(ptr) vma_aligned_free(ptr) - #endif -#endif - -#ifndef VMA_COUNT_BITS_SET - // Returns number of bits set to 1 in (v) - #define VMA_COUNT_BITS_SET(v) VmaCountBitsSet(v) -#endif - -#ifndef VMA_BITSCAN_LSB - // Scans integer for index of first nonzero value from the Least Significant Bit (LSB). If mask is 0 then returns UINT8_MAX - #define VMA_BITSCAN_LSB(mask) VmaBitScanLSB(mask) -#endif - -#ifndef VMA_BITSCAN_MSB - // Scans integer for index of first nonzero value from the Most Significant Bit (MSB). If mask is 0 then returns UINT8_MAX - #define VMA_BITSCAN_MSB(mask) VmaBitScanMSB(mask) -#endif - -#ifndef VMA_MIN - #define VMA_MIN(v1, v2) ((std::min)((v1), (v2))) -#endif - -#ifndef VMA_MAX - #define VMA_MAX(v1, v2) ((std::max)((v1), (v2))) -#endif - -#ifndef VMA_SWAP - #define VMA_SWAP(v1, v2) std::swap((v1), (v2)) -#endif - -#ifndef VMA_SORT - #define VMA_SORT(beg, end, cmp) std::sort(beg, end, cmp) -#endif - -#ifndef VMA_DEBUG_LOG - #define VMA_DEBUG_LOG(format, ...) - /* - #define VMA_DEBUG_LOG(format, ...) do { \ - printf(format, __VA_ARGS__); \ - printf("\n"); \ - } while(false) - */ -#endif - -// Define this macro to 1 to enable functions: vmaBuildStatsString, vmaFreeStatsString. -#if VMA_STATS_STRING_ENABLED - static inline void VmaUint32ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint32_t num) - { - snprintf(outStr, strLen, "%u", static_cast(num)); - } - static inline void VmaUint64ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint64_t num) - { - snprintf(outStr, strLen, "%llu", static_cast(num)); - } - static inline void VmaPtrToStr(char* VMA_NOT_NULL outStr, size_t strLen, const void* ptr) - { - snprintf(outStr, strLen, "%p", ptr); - } -#endif - -#ifndef VMA_MUTEX - class VmaMutex - { - public: - void Lock() { m_Mutex.lock(); } - void Unlock() { m_Mutex.unlock(); } - bool TryLock() { return m_Mutex.try_lock(); } - private: - std::mutex m_Mutex; - }; - #define VMA_MUTEX VmaMutex -#endif - -// Read-write mutex, where "read" is shared access, "write" is exclusive access. -#ifndef VMA_RW_MUTEX - #if VMA_USE_STL_SHARED_MUTEX - // Use std::shared_mutex from C++17. - #include - class VmaRWMutex - { - public: - void LockRead() { m_Mutex.lock_shared(); } - void UnlockRead() { m_Mutex.unlock_shared(); } - bool TryLockRead() { return m_Mutex.try_lock_shared(); } - void LockWrite() { m_Mutex.lock(); } - void UnlockWrite() { m_Mutex.unlock(); } - bool TryLockWrite() { return m_Mutex.try_lock(); } - private: - std::shared_mutex m_Mutex; - }; - #define VMA_RW_MUTEX VmaRWMutex - #elif defined(_WIN32) && defined(WINVER) && WINVER >= 0x0600 - // Use SRWLOCK from WinAPI. - // Minimum supported client = Windows Vista, server = Windows Server 2008. - class VmaRWMutex - { - public: - VmaRWMutex() { InitializeSRWLock(&m_Lock); } - void LockRead() { AcquireSRWLockShared(&m_Lock); } - void UnlockRead() { ReleaseSRWLockShared(&m_Lock); } - bool TryLockRead() { return TryAcquireSRWLockShared(&m_Lock) != FALSE; } - void LockWrite() { AcquireSRWLockExclusive(&m_Lock); } - void UnlockWrite() { ReleaseSRWLockExclusive(&m_Lock); } - bool TryLockWrite() { return TryAcquireSRWLockExclusive(&m_Lock) != FALSE; } - private: - SRWLOCK m_Lock; - }; - #define VMA_RW_MUTEX VmaRWMutex - #else - // Less efficient fallback: Use normal mutex. - class VmaRWMutex - { - public: - void LockRead() { m_Mutex.Lock(); } - void UnlockRead() { m_Mutex.Unlock(); } - bool TryLockRead() { return m_Mutex.TryLock(); } - void LockWrite() { m_Mutex.Lock(); } - void UnlockWrite() { m_Mutex.Unlock(); } - bool TryLockWrite() { return m_Mutex.TryLock(); } - private: - VMA_MUTEX m_Mutex; - }; - #define VMA_RW_MUTEX VmaRWMutex - #endif // #if VMA_USE_STL_SHARED_MUTEX -#endif // #ifndef VMA_RW_MUTEX - -/* -If providing your own implementation, you need to implement a subset of std::atomic. -*/ -#ifndef VMA_ATOMIC_UINT32 - #include - #define VMA_ATOMIC_UINT32 std::atomic -#endif - -#ifndef VMA_ATOMIC_UINT64 - #include - #define VMA_ATOMIC_UINT64 std::atomic -#endif - -#ifndef VMA_DEBUG_ALWAYS_DEDICATED_MEMORY - /** - Every allocation will have its own memory block. - Define to 1 for debugging purposes only. - */ - #define VMA_DEBUG_ALWAYS_DEDICATED_MEMORY (0) -#endif - -#ifndef VMA_MIN_ALIGNMENT - /** - Minimum alignment of all allocations, in bytes. - Set to more than 1 for debugging purposes. Must be power of two. - */ - #ifdef VMA_DEBUG_ALIGNMENT // Old name - #define VMA_MIN_ALIGNMENT VMA_DEBUG_ALIGNMENT - #else - #define VMA_MIN_ALIGNMENT (1) - #endif -#endif - -#ifndef VMA_DEBUG_MARGIN - /** - Minimum margin after every allocation, in bytes. - Set nonzero for debugging purposes only. - */ - #define VMA_DEBUG_MARGIN (0) -#endif - -#ifndef VMA_DEBUG_INITIALIZE_ALLOCATIONS - /** - Define this macro to 1 to automatically fill new allocations and destroyed - allocations with some bit pattern. - */ - #define VMA_DEBUG_INITIALIZE_ALLOCATIONS (0) -#endif - -#ifndef VMA_DEBUG_DETECT_CORRUPTION - /** - Define this macro to 1 together with non-zero value of VMA_DEBUG_MARGIN to - enable writing magic value to the margin after every allocation and - validating it, so that memory corruptions (out-of-bounds writes) are detected. - */ - #define VMA_DEBUG_DETECT_CORRUPTION (0) -#endif - -#ifndef VMA_DEBUG_GLOBAL_MUTEX - /** - Set this to 1 for debugging purposes only, to enable single mutex protecting all - entry calls to the library. Can be useful for debugging multithreading issues. - */ - #define VMA_DEBUG_GLOBAL_MUTEX (0) -#endif - -#ifndef VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY - /** - Minimum value for VkPhysicalDeviceLimits::bufferImageGranularity. - Set to more than 1 for debugging purposes only. Must be power of two. - */ - #define VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY (1) -#endif - -#ifndef VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT - /* - Set this to 1 to make VMA never exceed VkPhysicalDeviceLimits::maxMemoryAllocationCount - and return error instead of leaving up to Vulkan implementation what to do in such cases. - */ - #define VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT (0) -#endif - -#ifndef VMA_SMALL_HEAP_MAX_SIZE - /// Maximum size of a memory heap in Vulkan to consider it "small". - #define VMA_SMALL_HEAP_MAX_SIZE (1024ull * 1024 * 1024) -#endif - -#ifndef VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE - /// Default size of a block allocated as single VkDeviceMemory from a "large" heap. - #define VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE (256ull * 1024 * 1024) -#endif - -/* -Mapping hysteresis is a logic that launches when vmaMapMemory/vmaUnmapMemory is called -or a persistently mapped allocation is created and destroyed several times in a row. -It keeps additional +1 mapping of a device memory block to prevent calling actual -vkMapMemory/vkUnmapMemory too many times, which may improve performance and help -tools like RenderDOc. -*/ -#ifndef VMA_MAPPING_HYSTERESIS_ENABLED - #define VMA_MAPPING_HYSTERESIS_ENABLED 1 -#endif - -#ifndef VMA_CLASS_NO_COPY - #define VMA_CLASS_NO_COPY(className) \ - private: \ - className(const className&) = delete; \ - className& operator=(const className&) = delete; -#endif - -#define VMA_VALIDATE(cond) do { if(!(cond)) { \ - VMA_ASSERT(0 && "Validation failed: " #cond); \ - return false; \ - } } while(false) - -/******************************************************************************* -END OF CONFIGURATION -*/ -#endif // _VMA_CONFIGURATION - - -static const uint8_t VMA_ALLOCATION_FILL_PATTERN_CREATED = 0xDC; -static const uint8_t VMA_ALLOCATION_FILL_PATTERN_DESTROYED = 0xEF; -// Decimal 2139416166, float NaN, little-endian binary 66 E6 84 7F. -static const uint32_t VMA_CORRUPTION_DETECTION_MAGIC_VALUE = 0x7F84E666; - -// Copy of some Vulkan definitions so we don't need to check their existence just to handle few constants. -static const uint32_t VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY = 0x00000040; -static const uint32_t VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY = 0x00000080; -static const uint32_t VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY = 0x00020000; -static const uint32_t VK_IMAGE_CREATE_DISJOINT_BIT_COPY = 0x00000200; -static const int32_t VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY = 1000158000; -static const uint32_t VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET = 0x10000000u; -static const uint32_t VMA_ALLOCATION_TRY_COUNT = 32; -static const uint32_t VMA_VENDOR_ID_AMD = 4098; - -// This one is tricky. Vulkan specification defines this code as available since -// Vulkan 1.0, but doesn't actually define it in Vulkan SDK earlier than 1.2.131. -// See pull request #207. -#define VK_ERROR_UNKNOWN_COPY ((VkResult)-13) - - -#if VMA_STATS_STRING_ENABLED -// Correspond to values of enum VmaSuballocationType. -static const char* VMA_SUBALLOCATION_TYPE_NAMES[] = -{ - "FREE", - "UNKNOWN", - "BUFFER", - "IMAGE_UNKNOWN", - "IMAGE_LINEAR", - "IMAGE_OPTIMAL", -}; -#endif - -static VkAllocationCallbacks VmaEmptyAllocationCallbacks = - { VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL }; - - -#ifndef _VMA_ENUM_DECLARATIONS - -enum VmaSuballocationType -{ - VMA_SUBALLOCATION_TYPE_FREE = 0, - VMA_SUBALLOCATION_TYPE_UNKNOWN = 1, - VMA_SUBALLOCATION_TYPE_BUFFER = 2, - VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN = 3, - VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR = 4, - VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL = 5, - VMA_SUBALLOCATION_TYPE_MAX_ENUM = 0x7FFFFFFF -}; - -enum VMA_CACHE_OPERATION -{ - VMA_CACHE_FLUSH, - VMA_CACHE_INVALIDATE -}; - -enum class VmaAllocationRequestType -{ - Normal, - TLSF, - // Used by "Linear" algorithm. - UpperAddress, - EndOf1st, - EndOf2nd, -}; - -#endif // _VMA_ENUM_DECLARATIONS - -#ifndef _VMA_FORWARD_DECLARATIONS -// Opaque handle used by allocation algorithms to identify single allocation in any conforming way. -VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaAllocHandle); - -struct VmaMutexLock; -struct VmaMutexLockRead; -struct VmaMutexLockWrite; - -template -struct AtomicTransactionalIncrement; - -template -struct VmaStlAllocator; - -template -class VmaVector; - -template -class VmaSmallVector; - -template -class VmaPoolAllocator; - -template -struct VmaListItem; - -template -class VmaRawList; - -template -class VmaList; - -template -class VmaIntrusiveLinkedList; - -// Unused in this version -#if 0 -template -struct VmaPair; -template -struct VmaPairFirstLess; - -template -class VmaMap; -#endif - -#if VMA_STATS_STRING_ENABLED -class VmaStringBuilder; -class VmaJsonWriter; -#endif - -class VmaDeviceMemoryBlock; - -struct VmaDedicatedAllocationListItemTraits; -class VmaDedicatedAllocationList; - -struct VmaSuballocation; -struct VmaSuballocationOffsetLess; -struct VmaSuballocationOffsetGreater; -struct VmaSuballocationItemSizeLess; - -typedef VmaList> VmaSuballocationList; - -struct VmaAllocationRequest; - -class VmaBlockMetadata; -class VmaBlockMetadata_Linear; -class VmaBlockMetadata_TLSF; - -class VmaBlockVector; - -struct VmaPoolListItemTraits; - -struct VmaCurrentBudgetData; - -class VmaAllocationObjectAllocator; - -#endif // _VMA_FORWARD_DECLARATIONS - - -#ifndef _VMA_FUNCTIONS - -/* -Returns number of bits set to 1 in (v). - -On specific platforms and compilers you can use instrinsics like: - -Visual Studio: - return __popcnt(v); -GCC, Clang: - return static_cast(__builtin_popcount(v)); - -Define macro VMA_COUNT_BITS_SET to provide your optimized implementation. -But you need to check in runtime whether user's CPU supports these, as some old processors don't. -*/ -static inline uint32_t VmaCountBitsSet(uint32_t v) -{ -#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20 - return std::popcount(v); -#else - uint32_t c = v - ((v >> 1) & 0x55555555); - c = ((c >> 2) & 0x33333333) + (c & 0x33333333); - c = ((c >> 4) + c) & 0x0F0F0F0F; - c = ((c >> 8) + c) & 0x00FF00FF; - c = ((c >> 16) + c) & 0x0000FFFF; - return c; -#endif -} - -static inline uint8_t VmaBitScanLSB(uint64_t mask) -{ -#if defined(_MSC_VER) && defined(_WIN64) - unsigned long pos; - if (_BitScanForward64(&pos, mask)) - return static_cast(pos); - return UINT8_MAX; -#elif defined __GNUC__ || defined __clang__ - return static_cast(__builtin_ffsll(mask)) - 1U; -#else - uint8_t pos = 0; - uint64_t bit = 1; - do - { - if (mask & bit) - return pos; - bit <<= 1; - } while (pos++ < 63); - return UINT8_MAX; -#endif -} - -static inline uint8_t VmaBitScanLSB(uint32_t mask) -{ -#ifdef _MSC_VER - unsigned long pos; - if (_BitScanForward(&pos, mask)) - return static_cast(pos); - return UINT8_MAX; -#elif defined __GNUC__ || defined __clang__ - return static_cast(__builtin_ffs(mask)) - 1U; -#else - uint8_t pos = 0; - uint32_t bit = 1; - do - { - if (mask & bit) - return pos; - bit <<= 1; - } while (pos++ < 31); - return UINT8_MAX; -#endif -} - -static inline uint8_t VmaBitScanMSB(uint64_t mask) -{ -#if defined(_MSC_VER) && defined(_WIN64) - unsigned long pos; - if (_BitScanReverse64(&pos, mask)) - return static_cast(pos); -#elif defined __GNUC__ || defined __clang__ - if (mask) - return 63 - static_cast(__builtin_clzll(mask)); -#else - uint8_t pos = 63; - uint64_t bit = 1ULL << 63; - do - { - if (mask & bit) - return pos; - bit >>= 1; - } while (pos-- > 0); -#endif - return UINT8_MAX; -} - -static inline uint8_t VmaBitScanMSB(uint32_t mask) -{ -#ifdef _MSC_VER - unsigned long pos; - if (_BitScanReverse(&pos, mask)) - return static_cast(pos); -#elif defined __GNUC__ || defined __clang__ - if (mask) - return 31 - static_cast(__builtin_clz(mask)); -#else - uint8_t pos = 31; - uint32_t bit = 1UL << 31; - do - { - if (mask & bit) - return pos; - bit >>= 1; - } while (pos-- > 0); -#endif - return UINT8_MAX; -} - -/* -Returns true if given number is a power of two. -T must be unsigned integer number or signed integer but always nonnegative. -For 0 returns true. -*/ -template -inline bool VmaIsPow2(T x) -{ - return (x & (x - 1)) == 0; -} - -// Aligns given value up to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 16. -// Use types like uint32_t, uint64_t as T. -template -static inline T VmaAlignUp(T val, T alignment) -{ - VMA_HEAVY_ASSERT(VmaIsPow2(alignment)); - return (val + alignment - 1) & ~(alignment - 1); -} - -// Aligns given value down to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 8. -// Use types like uint32_t, uint64_t as T. -template -static inline T VmaAlignDown(T val, T alignment) -{ - VMA_HEAVY_ASSERT(VmaIsPow2(alignment)); - return val & ~(alignment - 1); -} - -// Division with mathematical rounding to nearest number. -template -static inline T VmaRoundDiv(T x, T y) -{ - return (x + (y / (T)2)) / y; -} - -// Divide by 'y' and round up to nearest integer. -template -static inline T VmaDivideRoundingUp(T x, T y) -{ - return (x + y - (T)1) / y; -} - -// Returns smallest power of 2 greater or equal to v. -static inline uint32_t VmaNextPow2(uint32_t v) -{ - v--; - v |= v >> 1; - v |= v >> 2; - v |= v >> 4; - v |= v >> 8; - v |= v >> 16; - v++; - return v; -} - -static inline uint64_t VmaNextPow2(uint64_t v) -{ - v--; - v |= v >> 1; - v |= v >> 2; - v |= v >> 4; - v |= v >> 8; - v |= v >> 16; - v |= v >> 32; - v++; - return v; -} - -// Returns largest power of 2 less or equal to v. -static inline uint32_t VmaPrevPow2(uint32_t v) -{ - v |= v >> 1; - v |= v >> 2; - v |= v >> 4; - v |= v >> 8; - v |= v >> 16; - v = v ^ (v >> 1); - return v; -} - -static inline uint64_t VmaPrevPow2(uint64_t v) -{ - v |= v >> 1; - v |= v >> 2; - v |= v >> 4; - v |= v >> 8; - v |= v >> 16; - v |= v >> 32; - v = v ^ (v >> 1); - return v; -} - -static inline bool VmaStrIsEmpty(const char* pStr) -{ - return pStr == VMA_NULL || *pStr == '\0'; -} - -/* -Returns true if two memory blocks occupy overlapping pages. -ResourceA must be in less memory offset than ResourceB. - -Algorithm is based on "Vulkan 1.0.39 - A Specification (with all registered Vulkan extensions)" -chapter 11.6 "Resource Memory Association", paragraph "Buffer-Image Granularity". -*/ -static inline bool VmaBlocksOnSamePage( - VkDeviceSize resourceAOffset, - VkDeviceSize resourceASize, - VkDeviceSize resourceBOffset, - VkDeviceSize pageSize) -{ - VMA_ASSERT(resourceAOffset + resourceASize <= resourceBOffset && resourceASize > 0 && pageSize > 0); - VkDeviceSize resourceAEnd = resourceAOffset + resourceASize - 1; - VkDeviceSize resourceAEndPage = resourceAEnd & ~(pageSize - 1); - VkDeviceSize resourceBStart = resourceBOffset; - VkDeviceSize resourceBStartPage = resourceBStart & ~(pageSize - 1); - return resourceAEndPage == resourceBStartPage; -} - -/* -Returns true if given suballocation types could conflict and must respect -VkPhysicalDeviceLimits::bufferImageGranularity. They conflict if one is buffer -or linear image and another one is optimal image. If type is unknown, behave -conservatively. -*/ -static inline bool VmaIsBufferImageGranularityConflict( - VmaSuballocationType suballocType1, - VmaSuballocationType suballocType2) -{ - if (suballocType1 > suballocType2) - { - VMA_SWAP(suballocType1, suballocType2); - } - - switch (suballocType1) - { - case VMA_SUBALLOCATION_TYPE_FREE: - return false; - case VMA_SUBALLOCATION_TYPE_UNKNOWN: - return true; - case VMA_SUBALLOCATION_TYPE_BUFFER: - return - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN || - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL; - case VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN: - return - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN || - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR || - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL; - case VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR: - return - suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL; - case VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL: - return false; - default: - VMA_ASSERT(0); - return true; - } -} - -static void VmaWriteMagicValue(void* pData, VkDeviceSize offset) -{ -#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION - uint32_t* pDst = (uint32_t*)((char*)pData + offset); - const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t); - for (size_t i = 0; i < numberCount; ++i, ++pDst) - { - *pDst = VMA_CORRUPTION_DETECTION_MAGIC_VALUE; - } -#else - // no-op -#endif -} - -static bool VmaValidateMagicValue(const void* pData, VkDeviceSize offset) -{ -#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION - const uint32_t* pSrc = (const uint32_t*)((const char*)pData + offset); - const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t); - for (size_t i = 0; i < numberCount; ++i, ++pSrc) - { - if (*pSrc != VMA_CORRUPTION_DETECTION_MAGIC_VALUE) - { - return false; - } - } -#endif - return true; -} - -/* -Fills structure with parameters of an example buffer to be used for transfers -during GPU memory defragmentation. -*/ -static void VmaFillGpuDefragmentationBufferCreateInfo(VkBufferCreateInfo& outBufCreateInfo) -{ - memset(&outBufCreateInfo, 0, sizeof(outBufCreateInfo)); - outBufCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO; - outBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - outBufCreateInfo.size = (VkDeviceSize)VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE; // Example size. -} - - -/* -Performs binary search and returns iterator to first element that is greater or -equal to (key), according to comparison (cmp). - -Cmp should return true if first argument is less than second argument. - -Returned value is the found element, if present in the collection or place where -new element with value (key) should be inserted. -*/ -template -static IterT VmaBinaryFindFirstNotLess(IterT beg, IterT end, const KeyT& key, const CmpLess& cmp) -{ - size_t down = 0, up = (end - beg); - while (down < up) - { - const size_t mid = down + (up - down) / 2; // Overflow-safe midpoint calculation - if (cmp(*(beg + mid), key)) - { - down = mid + 1; - } - else - { - up = mid; - } - } - return beg + down; -} - -template -IterT VmaBinaryFindSorted(const IterT& beg, const IterT& end, const KeyT& value, const CmpLess& cmp) -{ - IterT it = VmaBinaryFindFirstNotLess( - beg, end, value, cmp); - if (it == end || - (!cmp(*it, value) && !cmp(value, *it))) - { - return it; - } - return end; -} - -/* -Returns true if all pointers in the array are not-null and unique. -Warning! O(n^2) complexity. Use only inside VMA_HEAVY_ASSERT. -T must be pointer type, e.g. VmaAllocation, VmaPool. -*/ -template -static bool VmaValidatePointerArray(uint32_t count, const T* arr) -{ - for (uint32_t i = 0; i < count; ++i) - { - const T iPtr = arr[i]; - if (iPtr == VMA_NULL) - { - return false; - } - for (uint32_t j = i + 1; j < count; ++j) - { - if (iPtr == arr[j]) - { - return false; - } - } - } - return true; -} - -template -static inline void VmaPnextChainPushFront(MainT* mainStruct, NewT* newStruct) -{ - newStruct->pNext = mainStruct->pNext; - mainStruct->pNext = newStruct; -} - -// This is the main algorithm that guides the selection of a memory type best for an allocation - -// converts usage to required/preferred/not preferred flags. -static bool FindMemoryPreferences( - bool isIntegratedGPU, - const VmaAllocationCreateInfo& allocCreateInfo, - VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown. - VkMemoryPropertyFlags& outRequiredFlags, - VkMemoryPropertyFlags& outPreferredFlags, - VkMemoryPropertyFlags& outNotPreferredFlags) -{ - outRequiredFlags = allocCreateInfo.requiredFlags; - outPreferredFlags = allocCreateInfo.preferredFlags; - outNotPreferredFlags = 0; - - switch(allocCreateInfo.usage) - { - case VMA_MEMORY_USAGE_UNKNOWN: - break; - case VMA_MEMORY_USAGE_GPU_ONLY: - if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0) - { - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - break; - case VMA_MEMORY_USAGE_CPU_ONLY: - outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT; - break; - case VMA_MEMORY_USAGE_CPU_TO_GPU: - outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; - if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0) - { - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - break; - case VMA_MEMORY_USAGE_GPU_TO_CPU: - outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; - outPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT; - break; - case VMA_MEMORY_USAGE_CPU_COPY: - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - break; - case VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED: - outRequiredFlags |= VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT; - break; - case VMA_MEMORY_USAGE_AUTO: - case VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE: - case VMA_MEMORY_USAGE_AUTO_PREFER_HOST: - { - if(bufImgUsage == UINT32_MAX) - { - VMA_ASSERT(0 && "VMA_MEMORY_USAGE_AUTO* values can only be used with functions like vmaCreateBuffer, vmaCreateImage so that the details of the created resource are known."); - return false; - } - // This relies on values of VK_IMAGE_USAGE_TRANSFER* being the same VK_BUFFER_IMAGE_TRANSFER*. - const bool deviceAccess = (bufImgUsage & ~(VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT)) != 0; - const bool hostAccessSequentialWrite = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT) != 0; - const bool hostAccessRandom = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) != 0; - const bool hostAccessAllowTransferInstead = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) != 0; - const bool preferDevice = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE; - const bool preferHost = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST; - - // CPU random access - e.g. a buffer written to or transferred from GPU to read back on CPU. - if(hostAccessRandom) - { - if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost) - { - // Nice if it will end up in HOST_VISIBLE, but more importantly prefer DEVICE_LOCAL. - // Omitting HOST_VISIBLE here is intentional. - // In case there is DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED, it will pick that one. - // Otherwise, this will give same weight to DEVICE_LOCAL as HOST_VISIBLE | HOST_CACHED and select the former if occurs first on the list. - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT; - } - else - { - // Always CPU memory, cached. - outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT; - } - } - // CPU sequential write - may be CPU or host-visible GPU memory, uncached and write-combined. - else if(hostAccessSequentialWrite) - { - // Want uncached and write-combined. - outNotPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT; - - if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost) - { - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; - } - else - { - outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; - // Direct GPU access, CPU sequential write (e.g. a dynamic uniform buffer updated every frame) - if(deviceAccess) - { - // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose GPU memory. - if(preferHost) - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - else - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - // GPU no direct access, CPU sequential write (e.g. an upload buffer to be transferred to the GPU) - else - { - // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose CPU memory. - if(preferDevice) - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - else - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - } - } - // No CPU access - else - { - // GPU access, no CPU access (e.g. a color attachment image) - prefer GPU memory - if(deviceAccess) - { - // ...unless there is a clear preference from the user not to do so. - if(preferHost) - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - else - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - // No direct GPU access, no CPU access, just transfers. - // It may be staging copy intended for e.g. preserving image for next frame (then better GPU memory) or - // a "swap file" copy to free some GPU memory (then better CPU memory). - // Up to the user to decide. If no preferece, assume the former and choose GPU memory. - if(preferHost) - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - else - outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - } - break; - } - default: - VMA_ASSERT(0); - } - - // Avoid DEVICE_COHERENT unless explicitly requested. - if(((allocCreateInfo.requiredFlags | allocCreateInfo.preferredFlags) & - (VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY)) == 0) - { - outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY; - } - - return true; -} - -//////////////////////////////////////////////////////////////////////////////// -// Memory allocation - -static void* VmaMalloc(const VkAllocationCallbacks* pAllocationCallbacks, size_t size, size_t alignment) -{ - void* result = VMA_NULL; - if ((pAllocationCallbacks != VMA_NULL) && - (pAllocationCallbacks->pfnAllocation != VMA_NULL)) - { - result = (*pAllocationCallbacks->pfnAllocation)( - pAllocationCallbacks->pUserData, - size, - alignment, - VK_SYSTEM_ALLOCATION_SCOPE_OBJECT); - } - else - { - result = VMA_SYSTEM_ALIGNED_MALLOC(size, alignment); - } - VMA_ASSERT(result != VMA_NULL && "CPU memory allocation failed."); - return result; -} - -static void VmaFree(const VkAllocationCallbacks* pAllocationCallbacks, void* ptr) -{ - if ((pAllocationCallbacks != VMA_NULL) && - (pAllocationCallbacks->pfnFree != VMA_NULL)) - { - (*pAllocationCallbacks->pfnFree)(pAllocationCallbacks->pUserData, ptr); - } - else - { - VMA_SYSTEM_ALIGNED_FREE(ptr); - } -} - -template -static T* VmaAllocate(const VkAllocationCallbacks* pAllocationCallbacks) -{ - return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T), VMA_ALIGN_OF(T)); -} - -template -static T* VmaAllocateArray(const VkAllocationCallbacks* pAllocationCallbacks, size_t count) -{ - return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T) * count, VMA_ALIGN_OF(T)); -} - -#define vma_new(allocator, type) new(VmaAllocate(allocator))(type) - -#define vma_new_array(allocator, type, count) new(VmaAllocateArray((allocator), (count)))(type) - -template -static void vma_delete(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr) -{ - ptr->~T(); - VmaFree(pAllocationCallbacks, ptr); -} - -template -static void vma_delete_array(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr, size_t count) -{ - if (ptr != VMA_NULL) - { - for (size_t i = count; i--; ) - { - ptr[i].~T(); - } - VmaFree(pAllocationCallbacks, ptr); - } -} - -static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr) -{ - if (srcStr != VMA_NULL) - { - const size_t len = strlen(srcStr); - char* const result = vma_new_array(allocs, char, len + 1); - memcpy(result, srcStr, len + 1); - return result; - } - return VMA_NULL; -} - -#if VMA_STATS_STRING_ENABLED -static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr, size_t strLen) -{ - if (srcStr != VMA_NULL) - { - char* const result = vma_new_array(allocs, char, strLen + 1); - memcpy(result, srcStr, strLen); - result[strLen] = '\0'; - return result; - } - return VMA_NULL; -} -#endif // VMA_STATS_STRING_ENABLED - -static void VmaFreeString(const VkAllocationCallbacks* allocs, char* str) -{ - if (str != VMA_NULL) - { - const size_t len = strlen(str); - vma_delete_array(allocs, str, len + 1); - } -} - -template -size_t VmaVectorInsertSorted(VectorT& vector, const typename VectorT::value_type& value) -{ - const size_t indexToInsert = VmaBinaryFindFirstNotLess( - vector.data(), - vector.data() + vector.size(), - value, - CmpLess()) - vector.data(); - VmaVectorInsert(vector, indexToInsert, value); - return indexToInsert; -} - -template -bool VmaVectorRemoveSorted(VectorT& vector, const typename VectorT::value_type& value) -{ - CmpLess comparator; - typename VectorT::iterator it = VmaBinaryFindFirstNotLess( - vector.begin(), - vector.end(), - value, - comparator); - if ((it != vector.end()) && !comparator(*it, value) && !comparator(value, *it)) - { - size_t indexToRemove = it - vector.begin(); - VmaVectorRemove(vector, indexToRemove); - return true; - } - return false; -} -#endif // _VMA_FUNCTIONS - -#ifndef _VMA_STATISTICS_FUNCTIONS - -static void VmaClearStatistics(VmaStatistics& outStats) -{ - outStats.blockCount = 0; - outStats.allocationCount = 0; - outStats.blockBytes = 0; - outStats.allocationBytes = 0; -} - -static void VmaAddStatistics(VmaStatistics& inoutStats, const VmaStatistics& src) -{ - inoutStats.blockCount += src.blockCount; - inoutStats.allocationCount += src.allocationCount; - inoutStats.blockBytes += src.blockBytes; - inoutStats.allocationBytes += src.allocationBytes; -} - -static void VmaClearDetailedStatistics(VmaDetailedStatistics& outStats) -{ - VmaClearStatistics(outStats.statistics); - outStats.unusedRangeCount = 0; - outStats.allocationSizeMin = VK_WHOLE_SIZE; - outStats.allocationSizeMax = 0; - outStats.unusedRangeSizeMin = VK_WHOLE_SIZE; - outStats.unusedRangeSizeMax = 0; -} - -static void VmaAddDetailedStatisticsAllocation(VmaDetailedStatistics& inoutStats, VkDeviceSize size) -{ - inoutStats.statistics.allocationCount++; - inoutStats.statistics.allocationBytes += size; - inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, size); - inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, size); -} - -static void VmaAddDetailedStatisticsUnusedRange(VmaDetailedStatistics& inoutStats, VkDeviceSize size) -{ - inoutStats.unusedRangeCount++; - inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, size); - inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, size); -} - -static void VmaAddDetailedStatistics(VmaDetailedStatistics& inoutStats, const VmaDetailedStatistics& src) -{ - VmaAddStatistics(inoutStats.statistics, src.statistics); - inoutStats.unusedRangeCount += src.unusedRangeCount; - inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, src.allocationSizeMin); - inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, src.allocationSizeMax); - inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, src.unusedRangeSizeMin); - inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, src.unusedRangeSizeMax); -} - -#endif // _VMA_STATISTICS_FUNCTIONS - -#ifndef _VMA_MUTEX_LOCK -// Helper RAII class to lock a mutex in constructor and unlock it in destructor (at the end of scope). -struct VmaMutexLock -{ - VMA_CLASS_NO_COPY(VmaMutexLock) -public: - VmaMutexLock(VMA_MUTEX& mutex, bool useMutex = true) : - m_pMutex(useMutex ? &mutex : VMA_NULL) - { - if (m_pMutex) { m_pMutex->Lock(); } - } - ~VmaMutexLock() { if (m_pMutex) { m_pMutex->Unlock(); } } - -private: - VMA_MUTEX* m_pMutex; -}; - -// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for reading. -struct VmaMutexLockRead -{ - VMA_CLASS_NO_COPY(VmaMutexLockRead) -public: - VmaMutexLockRead(VMA_RW_MUTEX& mutex, bool useMutex) : - m_pMutex(useMutex ? &mutex : VMA_NULL) - { - if (m_pMutex) { m_pMutex->LockRead(); } - } - ~VmaMutexLockRead() { if (m_pMutex) { m_pMutex->UnlockRead(); } } - -private: - VMA_RW_MUTEX* m_pMutex; -}; - -// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for writing. -struct VmaMutexLockWrite -{ - VMA_CLASS_NO_COPY(VmaMutexLockWrite) -public: - VmaMutexLockWrite(VMA_RW_MUTEX& mutex, bool useMutex) - : m_pMutex(useMutex ? &mutex : VMA_NULL) - { - if (m_pMutex) { m_pMutex->LockWrite(); } - } - ~VmaMutexLockWrite() { if (m_pMutex) { m_pMutex->UnlockWrite(); } } - -private: - VMA_RW_MUTEX* m_pMutex; -}; - -#if VMA_DEBUG_GLOBAL_MUTEX - static VMA_MUTEX gDebugGlobalMutex; - #define VMA_DEBUG_GLOBAL_MUTEX_LOCK VmaMutexLock debugGlobalMutexLock(gDebugGlobalMutex, true); -#else - #define VMA_DEBUG_GLOBAL_MUTEX_LOCK -#endif -#endif // _VMA_MUTEX_LOCK - -#ifndef _VMA_ATOMIC_TRANSACTIONAL_INCREMENT -// An object that increments given atomic but decrements it back in the destructor unless Commit() is called. -template -struct AtomicTransactionalIncrement -{ -public: - typedef std::atomic AtomicT; - - ~AtomicTransactionalIncrement() - { - if(m_Atomic) - --(*m_Atomic); - } - - void Commit() { m_Atomic = nullptr; } - T Increment(AtomicT* atomic) - { - m_Atomic = atomic; - return m_Atomic->fetch_add(1); - } - -private: - AtomicT* m_Atomic = nullptr; -}; -#endif // _VMA_ATOMIC_TRANSACTIONAL_INCREMENT - -#ifndef _VMA_STL_ALLOCATOR -// STL-compatible allocator. -template -struct VmaStlAllocator -{ - const VkAllocationCallbacks* const m_pCallbacks; - typedef T value_type; - - VmaStlAllocator(const VkAllocationCallbacks* pCallbacks) : m_pCallbacks(pCallbacks) {} - template - VmaStlAllocator(const VmaStlAllocator& src) : m_pCallbacks(src.m_pCallbacks) {} - VmaStlAllocator(const VmaStlAllocator&) = default; - VmaStlAllocator& operator=(const VmaStlAllocator&) = delete; - - T* allocate(size_t n) { return VmaAllocateArray(m_pCallbacks, n); } - void deallocate(T* p, size_t n) { VmaFree(m_pCallbacks, p); } - - template - bool operator==(const VmaStlAllocator& rhs) const - { - return m_pCallbacks == rhs.m_pCallbacks; - } - template - bool operator!=(const VmaStlAllocator& rhs) const - { - return m_pCallbacks != rhs.m_pCallbacks; - } -}; -#endif // _VMA_STL_ALLOCATOR - -#ifndef _VMA_VECTOR -/* Class with interface compatible with subset of std::vector. -T must be POD because constructors and destructors are not called and memcpy is -used for these objects. */ -template -class VmaVector -{ -public: - typedef T value_type; - typedef T* iterator; - typedef const T* const_iterator; - - VmaVector(const AllocatorT& allocator); - VmaVector(size_t count, const AllocatorT& allocator); - // This version of the constructor is here for compatibility with pre-C++14 std::vector. - // value is unused. - VmaVector(size_t count, const T& value, const AllocatorT& allocator) : VmaVector(count, allocator) {} - VmaVector(const VmaVector& src); - VmaVector& operator=(const VmaVector& rhs); - ~VmaVector() { VmaFree(m_Allocator.m_pCallbacks, m_pArray); } - - bool empty() const { return m_Count == 0; } - size_t size() const { return m_Count; } - T* data() { return m_pArray; } - T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; } - T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; } - const T* data() const { return m_pArray; } - const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; } - const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; } - - iterator begin() { return m_pArray; } - iterator end() { return m_pArray + m_Count; } - const_iterator cbegin() const { return m_pArray; } - const_iterator cend() const { return m_pArray + m_Count; } - const_iterator begin() const { return cbegin(); } - const_iterator end() const { return cend(); } - - void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); } - void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); } - void push_front(const T& src) { insert(0, src); } - - void push_back(const T& src); - void reserve(size_t newCapacity, bool freeMemory = false); - void resize(size_t newCount); - void clear() { resize(0); } - void shrink_to_fit(); - void insert(size_t index, const T& src); - void remove(size_t index); - - T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; } - const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; } - -private: - AllocatorT m_Allocator; - T* m_pArray; - size_t m_Count; - size_t m_Capacity; -}; - -#ifndef _VMA_VECTOR_FUNCTIONS -template -VmaVector::VmaVector(const AllocatorT& allocator) - : m_Allocator(allocator), - m_pArray(VMA_NULL), - m_Count(0), - m_Capacity(0) {} - -template -VmaVector::VmaVector(size_t count, const AllocatorT& allocator) - : m_Allocator(allocator), - m_pArray(count ? (T*)VmaAllocateArray(allocator.m_pCallbacks, count) : VMA_NULL), - m_Count(count), - m_Capacity(count) {} - -template -VmaVector::VmaVector(const VmaVector& src) - : m_Allocator(src.m_Allocator), - m_pArray(src.m_Count ? (T*)VmaAllocateArray(src.m_Allocator.m_pCallbacks, src.m_Count) : VMA_NULL), - m_Count(src.m_Count), - m_Capacity(src.m_Count) -{ - if (m_Count != 0) - { - memcpy(m_pArray, src.m_pArray, m_Count * sizeof(T)); - } -} - -template -VmaVector& VmaVector::operator=(const VmaVector& rhs) -{ - if (&rhs != this) - { - resize(rhs.m_Count); - if (m_Count != 0) - { - memcpy(m_pArray, rhs.m_pArray, m_Count * sizeof(T)); - } - } - return *this; -} - -template -void VmaVector::push_back(const T& src) -{ - const size_t newIndex = size(); - resize(newIndex + 1); - m_pArray[newIndex] = src; -} - -template -void VmaVector::reserve(size_t newCapacity, bool freeMemory) -{ - newCapacity = VMA_MAX(newCapacity, m_Count); - - if ((newCapacity < m_Capacity) && !freeMemory) - { - newCapacity = m_Capacity; - } - - if (newCapacity != m_Capacity) - { - T* const newArray = newCapacity ? VmaAllocateArray(m_Allocator, newCapacity) : VMA_NULL; - if (m_Count != 0) - { - memcpy(newArray, m_pArray, m_Count * sizeof(T)); - } - VmaFree(m_Allocator.m_pCallbacks, m_pArray); - m_Capacity = newCapacity; - m_pArray = newArray; - } -} - -template -void VmaVector::resize(size_t newCount) -{ - size_t newCapacity = m_Capacity; - if (newCount > m_Capacity) - { - newCapacity = VMA_MAX(newCount, VMA_MAX(m_Capacity * 3 / 2, (size_t)8)); - } - - if (newCapacity != m_Capacity) - { - T* const newArray = newCapacity ? VmaAllocateArray(m_Allocator.m_pCallbacks, newCapacity) : VMA_NULL; - const size_t elementsToCopy = VMA_MIN(m_Count, newCount); - if (elementsToCopy != 0) - { - memcpy(newArray, m_pArray, elementsToCopy * sizeof(T)); - } - VmaFree(m_Allocator.m_pCallbacks, m_pArray); - m_Capacity = newCapacity; - m_pArray = newArray; - } - - m_Count = newCount; -} - -template -void VmaVector::shrink_to_fit() -{ - if (m_Capacity > m_Count) - { - T* newArray = VMA_NULL; - if (m_Count > 0) - { - newArray = VmaAllocateArray(m_Allocator.m_pCallbacks, m_Count); - memcpy(newArray, m_pArray, m_Count * sizeof(T)); - } - VmaFree(m_Allocator.m_pCallbacks, m_pArray); - m_Capacity = m_Count; - m_pArray = newArray; - } -} - -template -void VmaVector::insert(size_t index, const T& src) -{ - VMA_HEAVY_ASSERT(index <= m_Count); - const size_t oldCount = size(); - resize(oldCount + 1); - if (index < oldCount) - { - memmove(m_pArray + (index + 1), m_pArray + index, (oldCount - index) * sizeof(T)); - } - m_pArray[index] = src; -} - -template -void VmaVector::remove(size_t index) -{ - VMA_HEAVY_ASSERT(index < m_Count); - const size_t oldCount = size(); - if (index < oldCount - 1) - { - memmove(m_pArray + index, m_pArray + (index + 1), (oldCount - index - 1) * sizeof(T)); - } - resize(oldCount - 1); -} -#endif // _VMA_VECTOR_FUNCTIONS - -template -static void VmaVectorInsert(VmaVector& vec, size_t index, const T& item) -{ - vec.insert(index, item); -} - -template -static void VmaVectorRemove(VmaVector& vec, size_t index) -{ - vec.remove(index); -} -#endif // _VMA_VECTOR - -#ifndef _VMA_SMALL_VECTOR -/* -This is a vector (a variable-sized array), optimized for the case when the array is small. - -It contains some number of elements in-place, which allows it to avoid heap allocation -when the actual number of elements is below that threshold. This allows normal "small" -cases to be fast without losing generality for large inputs. -*/ -template -class VmaSmallVector -{ -public: - typedef T value_type; - typedef T* iterator; - - VmaSmallVector(const AllocatorT& allocator); - VmaSmallVector(size_t count, const AllocatorT& allocator); - template - VmaSmallVector(const VmaSmallVector&) = delete; - template - VmaSmallVector& operator=(const VmaSmallVector&) = delete; - ~VmaSmallVector() = default; - - bool empty() const { return m_Count == 0; } - size_t size() const { return m_Count; } - T* data() { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; } - T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; } - T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; } - const T* data() const { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; } - const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; } - const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; } - - iterator begin() { return data(); } - iterator end() { return data() + m_Count; } - - void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); } - void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); } - void push_front(const T& src) { insert(0, src); } - - void push_back(const T& src); - void resize(size_t newCount, bool freeMemory = false); - void clear(bool freeMemory = false); - void insert(size_t index, const T& src); - void remove(size_t index); - - T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; } - const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; } - -private: - size_t m_Count; - T m_StaticArray[N]; // Used when m_Size <= N - VmaVector m_DynamicArray; // Used when m_Size > N -}; - -#ifndef _VMA_SMALL_VECTOR_FUNCTIONS -template -VmaSmallVector::VmaSmallVector(const AllocatorT& allocator) - : m_Count(0), - m_DynamicArray(allocator) {} - -template -VmaSmallVector::VmaSmallVector(size_t count, const AllocatorT& allocator) - : m_Count(count), - m_DynamicArray(count > N ? count : 0, allocator) {} - -template -void VmaSmallVector::push_back(const T& src) -{ - const size_t newIndex = size(); - resize(newIndex + 1); - data()[newIndex] = src; -} - -template -void VmaSmallVector::resize(size_t newCount, bool freeMemory) -{ - if (newCount > N && m_Count > N) - { - // Any direction, staying in m_DynamicArray - m_DynamicArray.resize(newCount); - if (freeMemory) - { - m_DynamicArray.shrink_to_fit(); - } - } - else if (newCount > N && m_Count <= N) - { - // Growing, moving from m_StaticArray to m_DynamicArray - m_DynamicArray.resize(newCount); - if (m_Count > 0) - { - memcpy(m_DynamicArray.data(), m_StaticArray, m_Count * sizeof(T)); - } - } - else if (newCount <= N && m_Count > N) - { - // Shrinking, moving from m_DynamicArray to m_StaticArray - if (newCount > 0) - { - memcpy(m_StaticArray, m_DynamicArray.data(), newCount * sizeof(T)); - } - m_DynamicArray.resize(0); - if (freeMemory) - { - m_DynamicArray.shrink_to_fit(); - } - } - else - { - // Any direction, staying in m_StaticArray - nothing to do here - } - m_Count = newCount; -} - -template -void VmaSmallVector::clear(bool freeMemory) -{ - m_DynamicArray.clear(); - if (freeMemory) - { - m_DynamicArray.shrink_to_fit(); - } - m_Count = 0; -} - -template -void VmaSmallVector::insert(size_t index, const T& src) -{ - VMA_HEAVY_ASSERT(index <= m_Count); - const size_t oldCount = size(); - resize(oldCount + 1); - T* const dataPtr = data(); - if (index < oldCount) - { - // I know, this could be more optimal for case where memmove can be memcpy directly from m_StaticArray to m_DynamicArray. - memmove(dataPtr + (index + 1), dataPtr + index, (oldCount - index) * sizeof(T)); - } - dataPtr[index] = src; -} - -template -void VmaSmallVector::remove(size_t index) -{ - VMA_HEAVY_ASSERT(index < m_Count); - const size_t oldCount = size(); - if (index < oldCount - 1) - { - // I know, this could be more optimal for case where memmove can be memcpy directly from m_DynamicArray to m_StaticArray. - T* const dataPtr = data(); - memmove(dataPtr + index, dataPtr + (index + 1), (oldCount - index - 1) * sizeof(T)); - } - resize(oldCount - 1); -} -#endif // _VMA_SMALL_VECTOR_FUNCTIONS -#endif // _VMA_SMALL_VECTOR - -#ifndef _VMA_POOL_ALLOCATOR -/* -Allocator for objects of type T using a list of arrays (pools) to speed up -allocation. Number of elements that can be allocated is not bounded because -allocator can create multiple blocks. -*/ -template -class VmaPoolAllocator -{ - VMA_CLASS_NO_COPY(VmaPoolAllocator) -public: - VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity); - ~VmaPoolAllocator(); - template T* Alloc(Types&&... args); - void Free(T* ptr); - -private: - union Item - { - uint32_t NextFreeIndex; - alignas(T) char Value[sizeof(T)]; - }; - struct ItemBlock - { - Item* pItems; - uint32_t Capacity; - uint32_t FirstFreeIndex; - }; - - const VkAllocationCallbacks* m_pAllocationCallbacks; - const uint32_t m_FirstBlockCapacity; - VmaVector> m_ItemBlocks; - - ItemBlock& CreateNewBlock(); -}; - -#ifndef _VMA_POOL_ALLOCATOR_FUNCTIONS -template -VmaPoolAllocator::VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity) - : m_pAllocationCallbacks(pAllocationCallbacks), - m_FirstBlockCapacity(firstBlockCapacity), - m_ItemBlocks(VmaStlAllocator(pAllocationCallbacks)) -{ - VMA_ASSERT(m_FirstBlockCapacity > 1); -} - -template -VmaPoolAllocator::~VmaPoolAllocator() -{ - for (size_t i = m_ItemBlocks.size(); i--;) - vma_delete_array(m_pAllocationCallbacks, m_ItemBlocks[i].pItems, m_ItemBlocks[i].Capacity); - m_ItemBlocks.clear(); -} - -template -template T* VmaPoolAllocator::Alloc(Types&&... args) -{ - for (size_t i = m_ItemBlocks.size(); i--; ) - { - ItemBlock& block = m_ItemBlocks[i]; - // This block has some free items: Use first one. - if (block.FirstFreeIndex != UINT32_MAX) - { - Item* const pItem = &block.pItems[block.FirstFreeIndex]; - block.FirstFreeIndex = pItem->NextFreeIndex; - T* result = (T*)&pItem->Value; - new(result)T(std::forward(args)...); // Explicit constructor call. - return result; - } - } - - // No block has free item: Create new one and use it. - ItemBlock& newBlock = CreateNewBlock(); - Item* const pItem = &newBlock.pItems[0]; - newBlock.FirstFreeIndex = pItem->NextFreeIndex; - T* result = (T*)&pItem->Value; - new(result) T(std::forward(args)...); // Explicit constructor call. - return result; -} - -template -void VmaPoolAllocator::Free(T* ptr) -{ - // Search all memory blocks to find ptr. - for (size_t i = m_ItemBlocks.size(); i--; ) - { - ItemBlock& block = m_ItemBlocks[i]; - - // Casting to union. - Item* pItemPtr; - memcpy(&pItemPtr, &ptr, sizeof(pItemPtr)); - - // Check if pItemPtr is in address range of this block. - if ((pItemPtr >= block.pItems) && (pItemPtr < block.pItems + block.Capacity)) - { - ptr->~T(); // Explicit destructor call. - const uint32_t index = static_cast(pItemPtr - block.pItems); - pItemPtr->NextFreeIndex = block.FirstFreeIndex; - block.FirstFreeIndex = index; - return; - } - } - VMA_ASSERT(0 && "Pointer doesn't belong to this memory pool."); -} - -template -typename VmaPoolAllocator::ItemBlock& VmaPoolAllocator::CreateNewBlock() -{ - const uint32_t newBlockCapacity = m_ItemBlocks.empty() ? - m_FirstBlockCapacity : m_ItemBlocks.back().Capacity * 3 / 2; - - const ItemBlock newBlock = - { - vma_new_array(m_pAllocationCallbacks, Item, newBlockCapacity), - newBlockCapacity, - 0 - }; - - m_ItemBlocks.push_back(newBlock); - - // Setup singly-linked list of all free items in this block. - for (uint32_t i = 0; i < newBlockCapacity - 1; ++i) - newBlock.pItems[i].NextFreeIndex = i + 1; - newBlock.pItems[newBlockCapacity - 1].NextFreeIndex = UINT32_MAX; - return m_ItemBlocks.back(); -} -#endif // _VMA_POOL_ALLOCATOR_FUNCTIONS -#endif // _VMA_POOL_ALLOCATOR - -#ifndef _VMA_RAW_LIST -template -struct VmaListItem -{ - VmaListItem* pPrev; - VmaListItem* pNext; - T Value; -}; - -// Doubly linked list. -template -class VmaRawList -{ - VMA_CLASS_NO_COPY(VmaRawList) -public: - typedef VmaListItem ItemType; - - VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks); - // Intentionally not calling Clear, because that would be unnecessary - // computations to return all items to m_ItemAllocator as free. - ~VmaRawList() = default; - - size_t GetCount() const { return m_Count; } - bool IsEmpty() const { return m_Count == 0; } - - ItemType* Front() { return m_pFront; } - ItemType* Back() { return m_pBack; } - const ItemType* Front() const { return m_pFront; } - const ItemType* Back() const { return m_pBack; } - - ItemType* PushFront(); - ItemType* PushBack(); - ItemType* PushFront(const T& value); - ItemType* PushBack(const T& value); - void PopFront(); - void PopBack(); - - // Item can be null - it means PushBack. - ItemType* InsertBefore(ItemType* pItem); - // Item can be null - it means PushFront. - ItemType* InsertAfter(ItemType* pItem); - ItemType* InsertBefore(ItemType* pItem, const T& value); - ItemType* InsertAfter(ItemType* pItem, const T& value); - - void Clear(); - void Remove(ItemType* pItem); - -private: - const VkAllocationCallbacks* const m_pAllocationCallbacks; - VmaPoolAllocator m_ItemAllocator; - ItemType* m_pFront; - ItemType* m_pBack; - size_t m_Count; -}; - -#ifndef _VMA_RAW_LIST_FUNCTIONS -template -VmaRawList::VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks) - : m_pAllocationCallbacks(pAllocationCallbacks), - m_ItemAllocator(pAllocationCallbacks, 128), - m_pFront(VMA_NULL), - m_pBack(VMA_NULL), - m_Count(0) {} - -template -VmaListItem* VmaRawList::PushFront() -{ - ItemType* const pNewItem = m_ItemAllocator.Alloc(); - pNewItem->pPrev = VMA_NULL; - if (IsEmpty()) - { - pNewItem->pNext = VMA_NULL; - m_pFront = pNewItem; - m_pBack = pNewItem; - m_Count = 1; - } - else - { - pNewItem->pNext = m_pFront; - m_pFront->pPrev = pNewItem; - m_pFront = pNewItem; - ++m_Count; - } - return pNewItem; -} - -template -VmaListItem* VmaRawList::PushBack() -{ - ItemType* const pNewItem = m_ItemAllocator.Alloc(); - pNewItem->pNext = VMA_NULL; - if(IsEmpty()) - { - pNewItem->pPrev = VMA_NULL; - m_pFront = pNewItem; - m_pBack = pNewItem; - m_Count = 1; - } - else - { - pNewItem->pPrev = m_pBack; - m_pBack->pNext = pNewItem; - m_pBack = pNewItem; - ++m_Count; - } - return pNewItem; -} - -template -VmaListItem* VmaRawList::PushFront(const T& value) -{ - ItemType* const pNewItem = PushFront(); - pNewItem->Value = value; - return pNewItem; -} - -template -VmaListItem* VmaRawList::PushBack(const T& value) -{ - ItemType* const pNewItem = PushBack(); - pNewItem->Value = value; - return pNewItem; -} - -template -void VmaRawList::PopFront() -{ - VMA_HEAVY_ASSERT(m_Count > 0); - ItemType* const pFrontItem = m_pFront; - ItemType* const pNextItem = pFrontItem->pNext; - if (pNextItem != VMA_NULL) - { - pNextItem->pPrev = VMA_NULL; - } - m_pFront = pNextItem; - m_ItemAllocator.Free(pFrontItem); - --m_Count; -} - -template -void VmaRawList::PopBack() -{ - VMA_HEAVY_ASSERT(m_Count > 0); - ItemType* const pBackItem = m_pBack; - ItemType* const pPrevItem = pBackItem->pPrev; - if(pPrevItem != VMA_NULL) - { - pPrevItem->pNext = VMA_NULL; - } - m_pBack = pPrevItem; - m_ItemAllocator.Free(pBackItem); - --m_Count; -} - -template -void VmaRawList::Clear() -{ - if (IsEmpty() == false) - { - ItemType* pItem = m_pBack; - while (pItem != VMA_NULL) - { - ItemType* const pPrevItem = pItem->pPrev; - m_ItemAllocator.Free(pItem); - pItem = pPrevItem; - } - m_pFront = VMA_NULL; - m_pBack = VMA_NULL; - m_Count = 0; - } -} - -template -void VmaRawList::Remove(ItemType* pItem) -{ - VMA_HEAVY_ASSERT(pItem != VMA_NULL); - VMA_HEAVY_ASSERT(m_Count > 0); - - if(pItem->pPrev != VMA_NULL) - { - pItem->pPrev->pNext = pItem->pNext; - } - else - { - VMA_HEAVY_ASSERT(m_pFront == pItem); - m_pFront = pItem->pNext; - } - - if(pItem->pNext != VMA_NULL) - { - pItem->pNext->pPrev = pItem->pPrev; - } - else - { - VMA_HEAVY_ASSERT(m_pBack == pItem); - m_pBack = pItem->pPrev; - } - - m_ItemAllocator.Free(pItem); - --m_Count; -} - -template -VmaListItem* VmaRawList::InsertBefore(ItemType* pItem) -{ - if(pItem != VMA_NULL) - { - ItemType* const prevItem = pItem->pPrev; - ItemType* const newItem = m_ItemAllocator.Alloc(); - newItem->pPrev = prevItem; - newItem->pNext = pItem; - pItem->pPrev = newItem; - if(prevItem != VMA_NULL) - { - prevItem->pNext = newItem; - } - else - { - VMA_HEAVY_ASSERT(m_pFront == pItem); - m_pFront = newItem; - } - ++m_Count; - return newItem; - } - else - return PushBack(); -} - -template -VmaListItem* VmaRawList::InsertAfter(ItemType* pItem) -{ - if(pItem != VMA_NULL) - { - ItemType* const nextItem = pItem->pNext; - ItemType* const newItem = m_ItemAllocator.Alloc(); - newItem->pNext = nextItem; - newItem->pPrev = pItem; - pItem->pNext = newItem; - if(nextItem != VMA_NULL) - { - nextItem->pPrev = newItem; - } - else - { - VMA_HEAVY_ASSERT(m_pBack == pItem); - m_pBack = newItem; - } - ++m_Count; - return newItem; - } - else - return PushFront(); -} - -template -VmaListItem* VmaRawList::InsertBefore(ItemType* pItem, const T& value) -{ - ItemType* const newItem = InsertBefore(pItem); - newItem->Value = value; - return newItem; -} - -template -VmaListItem* VmaRawList::InsertAfter(ItemType* pItem, const T& value) -{ - ItemType* const newItem = InsertAfter(pItem); - newItem->Value = value; - return newItem; -} -#endif // _VMA_RAW_LIST_FUNCTIONS -#endif // _VMA_RAW_LIST - -#ifndef _VMA_LIST -template -class VmaList -{ - VMA_CLASS_NO_COPY(VmaList) -public: - class reverse_iterator; - class const_iterator; - class const_reverse_iterator; - - class iterator - { - friend class const_iterator; - friend class VmaList; - public: - iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {} - iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - - T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; } - T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; } - - bool operator==(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; } - bool operator!=(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; } - - iterator operator++(int) { iterator result = *this; ++*this; return result; } - iterator operator--(int) { iterator result = *this; --*this; return result; } - - iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; } - iterator& operator--(); - - private: - VmaRawList* m_pList; - VmaListItem* m_pItem; - - iterator(VmaRawList* pList, VmaListItem* pItem) : m_pList(pList), m_pItem(pItem) {} - }; - class reverse_iterator - { - friend class const_reverse_iterator; - friend class VmaList; - public: - reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {} - reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - - T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; } - T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; } - - bool operator==(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; } - bool operator!=(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; } - - reverse_iterator operator++(int) { reverse_iterator result = *this; ++* this; return result; } - reverse_iterator operator--(int) { reverse_iterator result = *this; --* this; return result; } - - reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; } - reverse_iterator& operator--(); - - private: - VmaRawList* m_pList; - VmaListItem* m_pItem; - - reverse_iterator(VmaRawList* pList, VmaListItem* pItem) : m_pList(pList), m_pItem(pItem) {} - }; - class const_iterator - { - friend class VmaList; - public: - const_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {} - const_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - const_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - - iterator drop_const() { return { const_cast*>(m_pList), const_cast*>(m_pItem) }; } - - const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; } - const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; } - - bool operator==(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; } - bool operator!=(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; } - - const_iterator operator++(int) { const_iterator result = *this; ++* this; return result; } - const_iterator operator--(int) { const_iterator result = *this; --* this; return result; } - - const_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; } - const_iterator& operator--(); - - private: - const VmaRawList* m_pList; - const VmaListItem* m_pItem; - - const_iterator(const VmaRawList* pList, const VmaListItem* pItem) : m_pList(pList), m_pItem(pItem) {} - }; - class const_reverse_iterator - { - friend class VmaList; - public: - const_reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {} - const_reverse_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - const_reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {} - - reverse_iterator drop_const() { return { const_cast*>(m_pList), const_cast*>(m_pItem) }; } - - const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; } - const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; } - - bool operator==(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; } - bool operator!=(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; } - - const_reverse_iterator operator++(int) { const_reverse_iterator result = *this; ++* this; return result; } - const_reverse_iterator operator--(int) { const_reverse_iterator result = *this; --* this; return result; } - - const_reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; } - const_reverse_iterator& operator--(); - - private: - const VmaRawList* m_pList; - const VmaListItem* m_pItem; - - const_reverse_iterator(const VmaRawList* pList, const VmaListItem* pItem) : m_pList(pList), m_pItem(pItem) {} - }; - - VmaList(const AllocatorT& allocator) : m_RawList(allocator.m_pCallbacks) {} - - bool empty() const { return m_RawList.IsEmpty(); } - size_t size() const { return m_RawList.GetCount(); } - - iterator begin() { return iterator(&m_RawList, m_RawList.Front()); } - iterator end() { return iterator(&m_RawList, VMA_NULL); } - - const_iterator cbegin() const { return const_iterator(&m_RawList, m_RawList.Front()); } - const_iterator cend() const { return const_iterator(&m_RawList, VMA_NULL); } - - const_iterator begin() const { return cbegin(); } - const_iterator end() const { return cend(); } - - reverse_iterator rbegin() { return reverse_iterator(&m_RawList, m_RawList.Back()); } - reverse_iterator rend() { return reverse_iterator(&m_RawList, VMA_NULL); } - - const_reverse_iterator crbegin() const { return const_reverse_iterator(&m_RawList, m_RawList.Back()); } - const_reverse_iterator crend() const { return const_reverse_iterator(&m_RawList, VMA_NULL); } - - const_reverse_iterator rbegin() const { return crbegin(); } - const_reverse_iterator rend() const { return crend(); } - - void push_back(const T& value) { m_RawList.PushBack(value); } - iterator insert(iterator it, const T& value) { return iterator(&m_RawList, m_RawList.InsertBefore(it.m_pItem, value)); } - - void clear() { m_RawList.Clear(); } - void erase(iterator it) { m_RawList.Remove(it.m_pItem); } - -private: - VmaRawList m_RawList; -}; - -#ifndef _VMA_LIST_FUNCTIONS -template -typename VmaList::iterator& VmaList::iterator::operator--() -{ - if (m_pItem != VMA_NULL) - { - m_pItem = m_pItem->pPrev; - } - else - { - VMA_HEAVY_ASSERT(!m_pList->IsEmpty()); - m_pItem = m_pList->Back(); - } - return *this; -} - -template -typename VmaList::reverse_iterator& VmaList::reverse_iterator::operator--() -{ - if (m_pItem != VMA_NULL) - { - m_pItem = m_pItem->pNext; - } - else - { - VMA_HEAVY_ASSERT(!m_pList->IsEmpty()); - m_pItem = m_pList->Front(); - } - return *this; -} - -template -typename VmaList::const_iterator& VmaList::const_iterator::operator--() -{ - if (m_pItem != VMA_NULL) - { - m_pItem = m_pItem->pPrev; - } - else - { - VMA_HEAVY_ASSERT(!m_pList->IsEmpty()); - m_pItem = m_pList->Back(); - } - return *this; -} - -template -typename VmaList::const_reverse_iterator& VmaList::const_reverse_iterator::operator--() -{ - if (m_pItem != VMA_NULL) - { - m_pItem = m_pItem->pNext; - } - else - { - VMA_HEAVY_ASSERT(!m_pList->IsEmpty()); - m_pItem = m_pList->Back(); - } - return *this; -} -#endif // _VMA_LIST_FUNCTIONS -#endif // _VMA_LIST - -#ifndef _VMA_INTRUSIVE_LINKED_LIST -/* -Expected interface of ItemTypeTraits: -struct MyItemTypeTraits -{ - typedef MyItem ItemType; - static ItemType* GetPrev(const ItemType* item) { return item->myPrevPtr; } - static ItemType* GetNext(const ItemType* item) { return item->myNextPtr; } - static ItemType*& AccessPrev(ItemType* item) { return item->myPrevPtr; } - static ItemType*& AccessNext(ItemType* item) { return item->myNextPtr; } -}; -*/ -template -class VmaIntrusiveLinkedList -{ -public: - typedef typename ItemTypeTraits::ItemType ItemType; - static ItemType* GetPrev(const ItemType* item) { return ItemTypeTraits::GetPrev(item); } - static ItemType* GetNext(const ItemType* item) { return ItemTypeTraits::GetNext(item); } - - // Movable, not copyable. - VmaIntrusiveLinkedList() = default; - VmaIntrusiveLinkedList(VmaIntrusiveLinkedList && src); - VmaIntrusiveLinkedList(const VmaIntrusiveLinkedList&) = delete; - VmaIntrusiveLinkedList& operator=(VmaIntrusiveLinkedList&& src); - VmaIntrusiveLinkedList& operator=(const VmaIntrusiveLinkedList&) = delete; - ~VmaIntrusiveLinkedList() { VMA_HEAVY_ASSERT(IsEmpty()); } - - size_t GetCount() const { return m_Count; } - bool IsEmpty() const { return m_Count == 0; } - ItemType* Front() { return m_Front; } - ItemType* Back() { return m_Back; } - const ItemType* Front() const { return m_Front; } - const ItemType* Back() const { return m_Back; } - - void PushBack(ItemType* item); - void PushFront(ItemType* item); - ItemType* PopBack(); - ItemType* PopFront(); - - // MyItem can be null - it means PushBack. - void InsertBefore(ItemType* existingItem, ItemType* newItem); - // MyItem can be null - it means PushFront. - void InsertAfter(ItemType* existingItem, ItemType* newItem); - void Remove(ItemType* item); - void RemoveAll(); - -private: - ItemType* m_Front = VMA_NULL; - ItemType* m_Back = VMA_NULL; - size_t m_Count = 0; -}; - -#ifndef _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS -template -VmaIntrusiveLinkedList::VmaIntrusiveLinkedList(VmaIntrusiveLinkedList&& src) - : m_Front(src.m_Front), m_Back(src.m_Back), m_Count(src.m_Count) -{ - src.m_Front = src.m_Back = VMA_NULL; - src.m_Count = 0; -} - -template -VmaIntrusiveLinkedList& VmaIntrusiveLinkedList::operator=(VmaIntrusiveLinkedList&& src) -{ - if (&src != this) - { - VMA_HEAVY_ASSERT(IsEmpty()); - m_Front = src.m_Front; - m_Back = src.m_Back; - m_Count = src.m_Count; - src.m_Front = src.m_Back = VMA_NULL; - src.m_Count = 0; - } - return *this; -} - -template -void VmaIntrusiveLinkedList::PushBack(ItemType* item) -{ - VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL); - if (IsEmpty()) - { - m_Front = item; - m_Back = item; - m_Count = 1; - } - else - { - ItemTypeTraits::AccessPrev(item) = m_Back; - ItemTypeTraits::AccessNext(m_Back) = item; - m_Back = item; - ++m_Count; - } -} - -template -void VmaIntrusiveLinkedList::PushFront(ItemType* item) -{ - VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL); - if (IsEmpty()) - { - m_Front = item; - m_Back = item; - m_Count = 1; - } - else - { - ItemTypeTraits::AccessNext(item) = m_Front; - ItemTypeTraits::AccessPrev(m_Front) = item; - m_Front = item; - ++m_Count; - } -} - -template -typename VmaIntrusiveLinkedList::ItemType* VmaIntrusiveLinkedList::PopBack() -{ - VMA_HEAVY_ASSERT(m_Count > 0); - ItemType* const backItem = m_Back; - ItemType* const prevItem = ItemTypeTraits::GetPrev(backItem); - if (prevItem != VMA_NULL) - { - ItemTypeTraits::AccessNext(prevItem) = VMA_NULL; - } - m_Back = prevItem; - --m_Count; - ItemTypeTraits::AccessPrev(backItem) = VMA_NULL; - ItemTypeTraits::AccessNext(backItem) = VMA_NULL; - return backItem; -} - -template -typename VmaIntrusiveLinkedList::ItemType* VmaIntrusiveLinkedList::PopFront() -{ - VMA_HEAVY_ASSERT(m_Count > 0); - ItemType* const frontItem = m_Front; - ItemType* const nextItem = ItemTypeTraits::GetNext(frontItem); - if (nextItem != VMA_NULL) - { - ItemTypeTraits::AccessPrev(nextItem) = VMA_NULL; - } - m_Front = nextItem; - --m_Count; - ItemTypeTraits::AccessPrev(frontItem) = VMA_NULL; - ItemTypeTraits::AccessNext(frontItem) = VMA_NULL; - return frontItem; -} - -template -void VmaIntrusiveLinkedList::InsertBefore(ItemType* existingItem, ItemType* newItem) -{ - VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL); - if (existingItem != VMA_NULL) - { - ItemType* const prevItem = ItemTypeTraits::GetPrev(existingItem); - ItemTypeTraits::AccessPrev(newItem) = prevItem; - ItemTypeTraits::AccessNext(newItem) = existingItem; - ItemTypeTraits::AccessPrev(existingItem) = newItem; - if (prevItem != VMA_NULL) - { - ItemTypeTraits::AccessNext(prevItem) = newItem; - } - else - { - VMA_HEAVY_ASSERT(m_Front == existingItem); - m_Front = newItem; - } - ++m_Count; - } - else - PushBack(newItem); -} - -template -void VmaIntrusiveLinkedList::InsertAfter(ItemType* existingItem, ItemType* newItem) -{ - VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL); - if (existingItem != VMA_NULL) - { - ItemType* const nextItem = ItemTypeTraits::GetNext(existingItem); - ItemTypeTraits::AccessNext(newItem) = nextItem; - ItemTypeTraits::AccessPrev(newItem) = existingItem; - ItemTypeTraits::AccessNext(existingItem) = newItem; - if (nextItem != VMA_NULL) - { - ItemTypeTraits::AccessPrev(nextItem) = newItem; - } - else - { - VMA_HEAVY_ASSERT(m_Back == existingItem); - m_Back = newItem; - } - ++m_Count; - } - else - return PushFront(newItem); -} - -template -void VmaIntrusiveLinkedList::Remove(ItemType* item) -{ - VMA_HEAVY_ASSERT(item != VMA_NULL && m_Count > 0); - if (ItemTypeTraits::GetPrev(item) != VMA_NULL) - { - ItemTypeTraits::AccessNext(ItemTypeTraits::AccessPrev(item)) = ItemTypeTraits::GetNext(item); - } - else - { - VMA_HEAVY_ASSERT(m_Front == item); - m_Front = ItemTypeTraits::GetNext(item); - } - - if (ItemTypeTraits::GetNext(item) != VMA_NULL) - { - ItemTypeTraits::AccessPrev(ItemTypeTraits::AccessNext(item)) = ItemTypeTraits::GetPrev(item); - } - else - { - VMA_HEAVY_ASSERT(m_Back == item); - m_Back = ItemTypeTraits::GetPrev(item); - } - ItemTypeTraits::AccessPrev(item) = VMA_NULL; - ItemTypeTraits::AccessNext(item) = VMA_NULL; - --m_Count; -} - -template -void VmaIntrusiveLinkedList::RemoveAll() -{ - if (!IsEmpty()) - { - ItemType* item = m_Back; - while (item != VMA_NULL) - { - ItemType* const prevItem = ItemTypeTraits::AccessPrev(item); - ItemTypeTraits::AccessPrev(item) = VMA_NULL; - ItemTypeTraits::AccessNext(item) = VMA_NULL; - item = prevItem; - } - m_Front = VMA_NULL; - m_Back = VMA_NULL; - m_Count = 0; - } -} -#endif // _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS -#endif // _VMA_INTRUSIVE_LINKED_LIST - -// Unused in this version. -#if 0 - -#ifndef _VMA_PAIR -template -struct VmaPair -{ - T1 first; - T2 second; - - VmaPair() : first(), second() {} - VmaPair(const T1& firstSrc, const T2& secondSrc) : first(firstSrc), second(secondSrc) {} -}; - -template -struct VmaPairFirstLess -{ - bool operator()(const VmaPair& lhs, const VmaPair& rhs) const - { - return lhs.first < rhs.first; - } - bool operator()(const VmaPair& lhs, const FirstT& rhsFirst) const - { - return lhs.first < rhsFirst; - } -}; -#endif // _VMA_PAIR - -#ifndef _VMA_MAP -/* Class compatible with subset of interface of std::unordered_map. -KeyT, ValueT must be POD because they will be stored in VmaVector. -*/ -template -class VmaMap -{ -public: - typedef VmaPair PairType; - typedef PairType* iterator; - - VmaMap(const VmaStlAllocator& allocator) : m_Vector(allocator) {} - - iterator begin() { return m_Vector.begin(); } - iterator end() { return m_Vector.end(); } - size_t size() { return m_Vector.size(); } - - void insert(const PairType& pair); - iterator find(const KeyT& key); - void erase(iterator it); - -private: - VmaVector< PairType, VmaStlAllocator> m_Vector; -}; - -#ifndef _VMA_MAP_FUNCTIONS -template -void VmaMap::insert(const PairType& pair) -{ - const size_t indexToInsert = VmaBinaryFindFirstNotLess( - m_Vector.data(), - m_Vector.data() + m_Vector.size(), - pair, - VmaPairFirstLess()) - m_Vector.data(); - VmaVectorInsert(m_Vector, indexToInsert, pair); -} - -template -VmaPair* VmaMap::find(const KeyT& key) -{ - PairType* it = VmaBinaryFindFirstNotLess( - m_Vector.data(), - m_Vector.data() + m_Vector.size(), - key, - VmaPairFirstLess()); - if ((it != m_Vector.end()) && (it->first == key)) - { - return it; - } - else - { - return m_Vector.end(); - } -} - -template -void VmaMap::erase(iterator it) -{ - VmaVectorRemove(m_Vector, it - m_Vector.begin()); -} -#endif // _VMA_MAP_FUNCTIONS -#endif // _VMA_MAP - -#endif // #if 0 - -#if !defined(_VMA_STRING_BUILDER) && VMA_STATS_STRING_ENABLED -class VmaStringBuilder -{ -public: - VmaStringBuilder(const VkAllocationCallbacks* allocationCallbacks) : m_Data(VmaStlAllocator(allocationCallbacks)) {} - ~VmaStringBuilder() = default; - - size_t GetLength() const { return m_Data.size(); } - const char* GetData() const { return m_Data.data(); } - void AddNewLine() { Add('\n'); } - void Add(char ch) { m_Data.push_back(ch); } - - void Add(const char* pStr); - void AddNumber(uint32_t num); - void AddNumber(uint64_t num); - void AddPointer(const void* ptr); - -private: - VmaVector> m_Data; -}; - -#ifndef _VMA_STRING_BUILDER_FUNCTIONS -void VmaStringBuilder::Add(const char* pStr) -{ - const size_t strLen = strlen(pStr); - if (strLen > 0) - { - const size_t oldCount = m_Data.size(); - m_Data.resize(oldCount + strLen); - memcpy(m_Data.data() + oldCount, pStr, strLen); - } -} - -void VmaStringBuilder::AddNumber(uint32_t num) -{ - char buf[11]; - buf[10] = '\0'; - char* p = &buf[10]; - do - { - *--p = '0' + (num % 10); - num /= 10; - } while (num); - Add(p); -} - -void VmaStringBuilder::AddNumber(uint64_t num) -{ - char buf[21]; - buf[20] = '\0'; - char* p = &buf[20]; - do - { - *--p = '0' + (num % 10); - num /= 10; - } while (num); - Add(p); -} - -void VmaStringBuilder::AddPointer(const void* ptr) -{ - char buf[21]; - VmaPtrToStr(buf, sizeof(buf), ptr); - Add(buf); -} -#endif //_VMA_STRING_BUILDER_FUNCTIONS -#endif // _VMA_STRING_BUILDER - -#if !defined(_VMA_JSON_WRITER) && VMA_STATS_STRING_ENABLED -/* -Allows to conveniently build a correct JSON document to be written to the -VmaStringBuilder passed to the constructor. -*/ -class VmaJsonWriter -{ - VMA_CLASS_NO_COPY(VmaJsonWriter) -public: - // sb - string builder to write the document to. Must remain alive for the whole lifetime of this object. - VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb); - ~VmaJsonWriter(); - - // Begins object by writing "{". - // Inside an object, you must call pairs of WriteString and a value, e.g.: - // j.BeginObject(true); j.WriteString("A"); j.WriteNumber(1); j.WriteString("B"); j.WriteNumber(2); j.EndObject(); - // Will write: { "A": 1, "B": 2 } - void BeginObject(bool singleLine = false); - // Ends object by writing "}". - void EndObject(); - - // Begins array by writing "[". - // Inside an array, you can write a sequence of any values. - void BeginArray(bool singleLine = false); - // Ends array by writing "[". - void EndArray(); - - // Writes a string value inside "". - // pStr can contain any ANSI characters, including '"', new line etc. - they will be properly escaped. - void WriteString(const char* pStr); - - // Begins writing a string value. - // Call BeginString, ContinueString, ContinueString, ..., EndString instead of - // WriteString to conveniently build the string content incrementally, made of - // parts including numbers. - void BeginString(const char* pStr = VMA_NULL); - // Posts next part of an open string. - void ContinueString(const char* pStr); - // Posts next part of an open string. The number is converted to decimal characters. - void ContinueString(uint32_t n); - void ContinueString(uint64_t n); - void ContinueString_Size(size_t n); - // Posts next part of an open string. Pointer value is converted to characters - // using "%p" formatting - shown as hexadecimal number, e.g.: 000000081276Ad00 - void ContinueString_Pointer(const void* ptr); - // Ends writing a string value by writing '"'. - void EndString(const char* pStr = VMA_NULL); - - // Writes a number value. - void WriteNumber(uint32_t n); - void WriteNumber(uint64_t n); - void WriteSize(size_t n); - // Writes a boolean value - false or true. - void WriteBool(bool b); - // Writes a null value. - void WriteNull(); - -private: - enum COLLECTION_TYPE - { - COLLECTION_TYPE_OBJECT, - COLLECTION_TYPE_ARRAY, - }; - struct StackItem - { - COLLECTION_TYPE type; - uint32_t valueCount; - bool singleLineMode; - }; - - static const char* const INDENT; - - VmaStringBuilder& m_SB; - VmaVector< StackItem, VmaStlAllocator > m_Stack; - bool m_InsideString; - - // Write size_t for less than 64bits - void WriteSize(size_t n, std::integral_constant) { m_SB.AddNumber(static_cast(n)); } - // Write size_t for 64bits - void WriteSize(size_t n, std::integral_constant) { m_SB.AddNumber(static_cast(n)); } - - void BeginValue(bool isString); - void WriteIndent(bool oneLess = false); -}; -const char* const VmaJsonWriter::INDENT = " "; - -#ifndef _VMA_JSON_WRITER_FUNCTIONS -VmaJsonWriter::VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb) - : m_SB(sb), - m_Stack(VmaStlAllocator(pAllocationCallbacks)), - m_InsideString(false) {} - -VmaJsonWriter::~VmaJsonWriter() -{ - VMA_ASSERT(!m_InsideString); - VMA_ASSERT(m_Stack.empty()); -} - -void VmaJsonWriter::BeginObject(bool singleLine) -{ - VMA_ASSERT(!m_InsideString); - - BeginValue(false); - m_SB.Add('{'); - - StackItem item; - item.type = COLLECTION_TYPE_OBJECT; - item.valueCount = 0; - item.singleLineMode = singleLine; - m_Stack.push_back(item); -} - -void VmaJsonWriter::EndObject() -{ - VMA_ASSERT(!m_InsideString); - - WriteIndent(true); - m_SB.Add('}'); - - VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_OBJECT); - m_Stack.pop_back(); -} - -void VmaJsonWriter::BeginArray(bool singleLine) -{ - VMA_ASSERT(!m_InsideString); - - BeginValue(false); - m_SB.Add('['); - - StackItem item; - item.type = COLLECTION_TYPE_ARRAY; - item.valueCount = 0; - item.singleLineMode = singleLine; - m_Stack.push_back(item); -} - -void VmaJsonWriter::EndArray() -{ - VMA_ASSERT(!m_InsideString); - - WriteIndent(true); - m_SB.Add(']'); - - VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_ARRAY); - m_Stack.pop_back(); -} - -void VmaJsonWriter::WriteString(const char* pStr) -{ - BeginString(pStr); - EndString(); -} - -void VmaJsonWriter::BeginString(const char* pStr) -{ - VMA_ASSERT(!m_InsideString); - - BeginValue(true); - m_SB.Add('"'); - m_InsideString = true; - if (pStr != VMA_NULL && pStr[0] != '\0') - { - ContinueString(pStr); - } -} - -void VmaJsonWriter::ContinueString(const char* pStr) -{ - VMA_ASSERT(m_InsideString); - - const size_t strLen = strlen(pStr); - for (size_t i = 0; i < strLen; ++i) - { - char ch = pStr[i]; - if (ch == '\\') - { - m_SB.Add("\\\\"); - } - else if (ch == '"') - { - m_SB.Add("\\\""); - } - else if (ch >= 32) - { - m_SB.Add(ch); - } - else switch (ch) - { - case '\b': - m_SB.Add("\\b"); - break; - case '\f': - m_SB.Add("\\f"); - break; - case '\n': - m_SB.Add("\\n"); - break; - case '\r': - m_SB.Add("\\r"); - break; - case '\t': - m_SB.Add("\\t"); - break; - default: - VMA_ASSERT(0 && "Character not currently supported."); - break; - } - } -} - -void VmaJsonWriter::ContinueString(uint32_t n) -{ - VMA_ASSERT(m_InsideString); - m_SB.AddNumber(n); -} - -void VmaJsonWriter::ContinueString(uint64_t n) -{ - VMA_ASSERT(m_InsideString); - m_SB.AddNumber(n); -} - -void VmaJsonWriter::ContinueString_Size(size_t n) -{ - VMA_ASSERT(m_InsideString); - // Fix for AppleClang incorrect type casting - // TODO: Change to if constexpr when C++17 used as minimal standard - WriteSize(n, std::is_same{}); -} - -void VmaJsonWriter::ContinueString_Pointer(const void* ptr) -{ - VMA_ASSERT(m_InsideString); - m_SB.AddPointer(ptr); -} - -void VmaJsonWriter::EndString(const char* pStr) -{ - VMA_ASSERT(m_InsideString); - if (pStr != VMA_NULL && pStr[0] != '\0') - { - ContinueString(pStr); - } - m_SB.Add('"'); - m_InsideString = false; -} - -void VmaJsonWriter::WriteNumber(uint32_t n) -{ - VMA_ASSERT(!m_InsideString); - BeginValue(false); - m_SB.AddNumber(n); -} - -void VmaJsonWriter::WriteNumber(uint64_t n) -{ - VMA_ASSERT(!m_InsideString); - BeginValue(false); - m_SB.AddNumber(n); -} - -void VmaJsonWriter::WriteSize(size_t n) -{ - VMA_ASSERT(!m_InsideString); - BeginValue(false); - // Fix for AppleClang incorrect type casting - // TODO: Change to if constexpr when C++17 used as minimal standard - WriteSize(n, std::is_same{}); -} - -void VmaJsonWriter::WriteBool(bool b) -{ - VMA_ASSERT(!m_InsideString); - BeginValue(false); - m_SB.Add(b ? "true" : "false"); -} - -void VmaJsonWriter::WriteNull() -{ - VMA_ASSERT(!m_InsideString); - BeginValue(false); - m_SB.Add("null"); -} - -void VmaJsonWriter::BeginValue(bool isString) -{ - if (!m_Stack.empty()) - { - StackItem& currItem = m_Stack.back(); - if (currItem.type == COLLECTION_TYPE_OBJECT && - currItem.valueCount % 2 == 0) - { - VMA_ASSERT(isString); - } - - if (currItem.type == COLLECTION_TYPE_OBJECT && - currItem.valueCount % 2 != 0) - { - m_SB.Add(": "); - } - else if (currItem.valueCount > 0) - { - m_SB.Add(", "); - WriteIndent(); - } - else - { - WriteIndent(); - } - ++currItem.valueCount; - } -} - -void VmaJsonWriter::WriteIndent(bool oneLess) -{ - if (!m_Stack.empty() && !m_Stack.back().singleLineMode) - { - m_SB.AddNewLine(); - - size_t count = m_Stack.size(); - if (count > 0 && oneLess) - { - --count; - } - for (size_t i = 0; i < count; ++i) - { - m_SB.Add(INDENT); - } - } -} -#endif // _VMA_JSON_WRITER_FUNCTIONS - -static void VmaPrintDetailedStatistics(VmaJsonWriter& json, const VmaDetailedStatistics& stat) -{ - json.BeginObject(); - - json.WriteString("BlockCount"); - json.WriteNumber(stat.statistics.blockCount); - json.WriteString("BlockBytes"); - json.WriteNumber(stat.statistics.blockBytes); - json.WriteString("AllocationCount"); - json.WriteNumber(stat.statistics.allocationCount); - json.WriteString("AllocationBytes"); - json.WriteNumber(stat.statistics.allocationBytes); - json.WriteString("UnusedRangeCount"); - json.WriteNumber(stat.unusedRangeCount); - - if (stat.statistics.allocationCount > 1) - { - json.WriteString("AllocationSizeMin"); - json.WriteNumber(stat.allocationSizeMin); - json.WriteString("AllocationSizeMax"); - json.WriteNumber(stat.allocationSizeMax); - } - if (stat.unusedRangeCount > 1) - { - json.WriteString("UnusedRangeSizeMin"); - json.WriteNumber(stat.unusedRangeSizeMin); - json.WriteString("UnusedRangeSizeMax"); - json.WriteNumber(stat.unusedRangeSizeMax); - } - json.EndObject(); -} -#endif // _VMA_JSON_WRITER - -#ifndef _VMA_MAPPING_HYSTERESIS - -class VmaMappingHysteresis -{ - VMA_CLASS_NO_COPY(VmaMappingHysteresis) -public: - VmaMappingHysteresis() = default; - - uint32_t GetExtraMapping() const { return m_ExtraMapping; } - - // Call when Map was called. - // Returns true if switched to extra +1 mapping reference count. - bool PostMap() - { -#if VMA_MAPPING_HYSTERESIS_ENABLED - if(m_ExtraMapping == 0) - { - ++m_MajorCounter; - if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING) - { - m_ExtraMapping = 1; - m_MajorCounter = 0; - m_MinorCounter = 0; - return true; - } - } - else // m_ExtraMapping == 1 - PostMinorCounter(); -#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED - return false; - } - - // Call when Unmap was called. - void PostUnmap() - { -#if VMA_MAPPING_HYSTERESIS_ENABLED - if(m_ExtraMapping == 0) - ++m_MajorCounter; - else // m_ExtraMapping == 1 - PostMinorCounter(); -#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED - } - - // Call when allocation was made from the memory block. - void PostAlloc() - { -#if VMA_MAPPING_HYSTERESIS_ENABLED - if(m_ExtraMapping == 1) - ++m_MajorCounter; - else // m_ExtraMapping == 0 - PostMinorCounter(); -#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED - } - - // Call when allocation was freed from the memory block. - // Returns true if switched to extra -1 mapping reference count. - bool PostFree() - { -#if VMA_MAPPING_HYSTERESIS_ENABLED - if(m_ExtraMapping == 1) - { - ++m_MajorCounter; - if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING && - m_MajorCounter > m_MinorCounter + 1) - { - m_ExtraMapping = 0; - m_MajorCounter = 0; - m_MinorCounter = 0; - return true; - } - } - else // m_ExtraMapping == 0 - PostMinorCounter(); -#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED - return false; - } - -private: - static const int32_t COUNTER_MIN_EXTRA_MAPPING = 7; - - uint32_t m_MinorCounter = 0; - uint32_t m_MajorCounter = 0; - uint32_t m_ExtraMapping = 0; // 0 or 1. - - void PostMinorCounter() - { - if(m_MinorCounter < m_MajorCounter) - { - ++m_MinorCounter; - } - else if(m_MajorCounter > 0) - { - --m_MajorCounter; - --m_MinorCounter; - } - } -}; - -#endif // _VMA_MAPPING_HYSTERESIS - -#ifndef _VMA_DEVICE_MEMORY_BLOCK -/* -Represents a single block of device memory (`VkDeviceMemory`) with all the -data about its regions (aka suballocations, #VmaAllocation), assigned and free. - -Thread-safety: -- Access to m_pMetadata must be externally synchronized. -- Map, Unmap, Bind* are synchronized internally. -*/ -class VmaDeviceMemoryBlock -{ - VMA_CLASS_NO_COPY(VmaDeviceMemoryBlock) -public: - VmaBlockMetadata* m_pMetadata; - - VmaDeviceMemoryBlock(VmaAllocator hAllocator); - ~VmaDeviceMemoryBlock(); - - // Always call after construction. - void Init( - VmaAllocator hAllocator, - VmaPool hParentPool, - uint32_t newMemoryTypeIndex, - VkDeviceMemory newMemory, - VkDeviceSize newSize, - uint32_t id, - uint32_t algorithm, - VkDeviceSize bufferImageGranularity); - // Always call before destruction. - void Destroy(VmaAllocator allocator); - - VmaPool GetParentPool() const { return m_hParentPool; } - VkDeviceMemory GetDeviceMemory() const { return m_hMemory; } - uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; } - uint32_t GetId() const { return m_Id; } - void* GetMappedData() const { return m_pMappedData; } - uint32_t GetMapRefCount() const { return m_MapCount; } - - // Call when allocation/free was made from m_pMetadata. - // Used for m_MappingHysteresis. - void PostAlloc() { m_MappingHysteresis.PostAlloc(); } - void PostFree(VmaAllocator hAllocator); - - // Validates all data structures inside this object. If not valid, returns false. - bool Validate() const; - VkResult CheckCorruption(VmaAllocator hAllocator); - - // ppData can be null. - VkResult Map(VmaAllocator hAllocator, uint32_t count, void** ppData); - void Unmap(VmaAllocator hAllocator, uint32_t count); - - VkResult WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize); - VkResult ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize); - - VkResult BindBufferMemory( - const VmaAllocator hAllocator, - const VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkBuffer hBuffer, - const void* pNext); - VkResult BindImageMemory( - const VmaAllocator hAllocator, - const VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkImage hImage, - const void* pNext); - -private: - VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool. - uint32_t m_MemoryTypeIndex; - uint32_t m_Id; - VkDeviceMemory m_hMemory; - - /* - Protects access to m_hMemory so it is not used by multiple threads simultaneously, e.g. vkMapMemory, vkBindBufferMemory. - Also protects m_MapCount, m_pMappedData. - Allocations, deallocations, any change in m_pMetadata is protected by parent's VmaBlockVector::m_Mutex. - */ - VMA_MUTEX m_MapAndBindMutex; - VmaMappingHysteresis m_MappingHysteresis; - uint32_t m_MapCount; - void* m_pMappedData; -}; -#endif // _VMA_DEVICE_MEMORY_BLOCK - -#ifndef _VMA_ALLOCATION_T -struct VmaAllocation_T -{ - friend struct VmaDedicatedAllocationListItemTraits; - - enum FLAGS - { - FLAG_PERSISTENT_MAP = 0x01, - FLAG_MAPPING_ALLOWED = 0x02, - }; - -public: - enum ALLOCATION_TYPE - { - ALLOCATION_TYPE_NONE, - ALLOCATION_TYPE_BLOCK, - ALLOCATION_TYPE_DEDICATED, - }; - - // This struct is allocated using VmaPoolAllocator. - VmaAllocation_T(bool mappingAllowed); - ~VmaAllocation_T(); - - void InitBlockAllocation( - VmaDeviceMemoryBlock* block, - VmaAllocHandle allocHandle, - VkDeviceSize alignment, - VkDeviceSize size, - uint32_t memoryTypeIndex, - VmaSuballocationType suballocationType, - bool mapped); - // pMappedData not null means allocation is created with MAPPED flag. - void InitDedicatedAllocation( - VmaPool hParentPool, - uint32_t memoryTypeIndex, - VkDeviceMemory hMemory, - VmaSuballocationType suballocationType, - void* pMappedData, - VkDeviceSize size); - - ALLOCATION_TYPE GetType() const { return (ALLOCATION_TYPE)m_Type; } - VkDeviceSize GetAlignment() const { return m_Alignment; } - VkDeviceSize GetSize() const { return m_Size; } - void* GetUserData() const { return m_pUserData; } - const char* GetName() const { return m_pName; } - VmaSuballocationType GetSuballocationType() const { return (VmaSuballocationType)m_SuballocationType; } - - VmaDeviceMemoryBlock* GetBlock() const { VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK); return m_BlockAllocation.m_Block; } - uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; } - bool IsPersistentMap() const { return (m_Flags & FLAG_PERSISTENT_MAP) != 0; } - bool IsMappingAllowed() const { return (m_Flags & FLAG_MAPPING_ALLOWED) != 0; } - - void SetUserData(VmaAllocator hAllocator, void* pUserData) { m_pUserData = pUserData; } - void SetName(VmaAllocator hAllocator, const char* pName); - void FreeName(VmaAllocator hAllocator); - uint8_t SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation); - VmaAllocHandle GetAllocHandle() const; - VkDeviceSize GetOffset() const; - VmaPool GetParentPool() const; - VkDeviceMemory GetMemory() const; - void* GetMappedData() const; - - void BlockAllocMap(); - void BlockAllocUnmap(); - VkResult DedicatedAllocMap(VmaAllocator hAllocator, void** ppData); - void DedicatedAllocUnmap(VmaAllocator hAllocator); - -#if VMA_STATS_STRING_ENABLED - uint32_t GetBufferImageUsage() const { return m_BufferImageUsage; } - - void InitBufferImageUsage(uint32_t bufferImageUsage); - void PrintParameters(class VmaJsonWriter& json) const; -#endif - -private: - // Allocation out of VmaDeviceMemoryBlock. - struct BlockAllocation - { - VmaDeviceMemoryBlock* m_Block; - VmaAllocHandle m_AllocHandle; - }; - // Allocation for an object that has its own private VkDeviceMemory. - struct DedicatedAllocation - { - VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool. - VkDeviceMemory m_hMemory; - void* m_pMappedData; // Not null means memory is mapped. - VmaAllocation_T* m_Prev; - VmaAllocation_T* m_Next; - }; - union - { - // Allocation out of VmaDeviceMemoryBlock. - BlockAllocation m_BlockAllocation; - // Allocation for an object that has its own private VkDeviceMemory. - DedicatedAllocation m_DedicatedAllocation; - }; - - VkDeviceSize m_Alignment; - VkDeviceSize m_Size; - void* m_pUserData; - char* m_pName; - uint32_t m_MemoryTypeIndex; - uint8_t m_Type; // ALLOCATION_TYPE - uint8_t m_SuballocationType; // VmaSuballocationType - // Reference counter for vmaMapMemory()/vmaUnmapMemory(). - uint8_t m_MapCount; - uint8_t m_Flags; // enum FLAGS -#if VMA_STATS_STRING_ENABLED - uint32_t m_BufferImageUsage; // 0 if unknown. -#endif -}; -#endif // _VMA_ALLOCATION_T - -#ifndef _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS -struct VmaDedicatedAllocationListItemTraits -{ - typedef VmaAllocation_T ItemType; - - static ItemType* GetPrev(const ItemType* item) - { - VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED); - return item->m_DedicatedAllocation.m_Prev; - } - static ItemType* GetNext(const ItemType* item) - { - VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED); - return item->m_DedicatedAllocation.m_Next; - } - static ItemType*& AccessPrev(ItemType* item) - { - VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED); - return item->m_DedicatedAllocation.m_Prev; - } - static ItemType*& AccessNext(ItemType* item) - { - VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED); - return item->m_DedicatedAllocation.m_Next; - } -}; -#endif // _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS - -#ifndef _VMA_DEDICATED_ALLOCATION_LIST -/* -Stores linked list of VmaAllocation_T objects. -Thread-safe, synchronized internally. -*/ -class VmaDedicatedAllocationList -{ -public: - VmaDedicatedAllocationList() {} - ~VmaDedicatedAllocationList(); - - void Init(bool useMutex) { m_UseMutex = useMutex; } - bool Validate(); - - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats); - void AddStatistics(VmaStatistics& inoutStats); -#if VMA_STATS_STRING_ENABLED - // Writes JSON array with the list of allocations. - void BuildStatsString(VmaJsonWriter& json); -#endif - - bool IsEmpty(); - void Register(VmaAllocation alloc); - void Unregister(VmaAllocation alloc); - -private: - typedef VmaIntrusiveLinkedList DedicatedAllocationLinkedList; - - bool m_UseMutex = true; - VMA_RW_MUTEX m_Mutex; - DedicatedAllocationLinkedList m_AllocationList; -}; - -#ifndef _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS - -VmaDedicatedAllocationList::~VmaDedicatedAllocationList() -{ - VMA_HEAVY_ASSERT(Validate()); - - if (!m_AllocationList.IsEmpty()) - { - VMA_ASSERT(false && "Unfreed dedicated allocations found!"); - } -} - -bool VmaDedicatedAllocationList::Validate() -{ - const size_t declaredCount = m_AllocationList.GetCount(); - size_t actualCount = 0; - VmaMutexLockRead lock(m_Mutex, m_UseMutex); - for (VmaAllocation alloc = m_AllocationList.Front(); - alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc)) - { - ++actualCount; - } - VMA_VALIDATE(actualCount == declaredCount); - - return true; -} - -void VmaDedicatedAllocationList::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) -{ - for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item)) - { - const VkDeviceSize size = item->GetSize(); - inoutStats.statistics.blockCount++; - inoutStats.statistics.blockBytes += size; - VmaAddDetailedStatisticsAllocation(inoutStats, item->GetSize()); - } -} - -void VmaDedicatedAllocationList::AddStatistics(VmaStatistics& inoutStats) -{ - VmaMutexLockRead lock(m_Mutex, m_UseMutex); - - const uint32_t allocCount = (uint32_t)m_AllocationList.GetCount(); - inoutStats.blockCount += allocCount; - inoutStats.allocationCount += allocCount; - - for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item)) - { - const VkDeviceSize size = item->GetSize(); - inoutStats.blockBytes += size; - inoutStats.allocationBytes += size; - } -} - -#if VMA_STATS_STRING_ENABLED -void VmaDedicatedAllocationList::BuildStatsString(VmaJsonWriter& json) -{ - VmaMutexLockRead lock(m_Mutex, m_UseMutex); - json.BeginArray(); - for (VmaAllocation alloc = m_AllocationList.Front(); - alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc)) - { - json.BeginObject(true); - alloc->PrintParameters(json); - json.EndObject(); - } - json.EndArray(); -} -#endif // VMA_STATS_STRING_ENABLED - -bool VmaDedicatedAllocationList::IsEmpty() -{ - VmaMutexLockRead lock(m_Mutex, m_UseMutex); - return m_AllocationList.IsEmpty(); -} - -void VmaDedicatedAllocationList::Register(VmaAllocation alloc) -{ - VmaMutexLockWrite lock(m_Mutex, m_UseMutex); - m_AllocationList.PushBack(alloc); -} - -void VmaDedicatedAllocationList::Unregister(VmaAllocation alloc) -{ - VmaMutexLockWrite lock(m_Mutex, m_UseMutex); - m_AllocationList.Remove(alloc); -} -#endif // _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS -#endif // _VMA_DEDICATED_ALLOCATION_LIST - -#ifndef _VMA_SUBALLOCATION -/* -Represents a region of VmaDeviceMemoryBlock that is either assigned and returned as -allocated memory block or free. -*/ -struct VmaSuballocation -{ - VkDeviceSize offset; - VkDeviceSize size; - void* userData; - VmaSuballocationType type; -}; - -// Comparator for offsets. -struct VmaSuballocationOffsetLess -{ - bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const - { - return lhs.offset < rhs.offset; - } -}; - -struct VmaSuballocationOffsetGreater -{ - bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const - { - return lhs.offset > rhs.offset; - } -}; - -struct VmaSuballocationItemSizeLess -{ - bool operator()(const VmaSuballocationList::iterator lhs, - const VmaSuballocationList::iterator rhs) const - { - return lhs->size < rhs->size; - } - - bool operator()(const VmaSuballocationList::iterator lhs, - VkDeviceSize rhsSize) const - { - return lhs->size < rhsSize; - } -}; -#endif // _VMA_SUBALLOCATION - -#ifndef _VMA_ALLOCATION_REQUEST -/* -Parameters of planned allocation inside a VmaDeviceMemoryBlock. -item points to a FREE suballocation. -*/ -struct VmaAllocationRequest -{ - VmaAllocHandle allocHandle; - VkDeviceSize size; - VmaSuballocationList::iterator item; - void* customData; - uint64_t algorithmData; - VmaAllocationRequestType type; -}; -#endif // _VMA_ALLOCATION_REQUEST - -#ifndef _VMA_BLOCK_METADATA -/* -Data structure used for bookkeeping of allocations and unused ranges of memory -in a single VkDeviceMemory block. -*/ -class VmaBlockMetadata -{ -public: - // pAllocationCallbacks, if not null, must be owned externally - alive and unchanged for the whole lifetime of this object. - VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual); - virtual ~VmaBlockMetadata() = default; - - virtual void Init(VkDeviceSize size) { m_Size = size; } - bool IsVirtual() const { return m_IsVirtual; } - VkDeviceSize GetSize() const { return m_Size; } - - // Validates all data structures inside this object. If not valid, returns false. - virtual bool Validate() const = 0; - virtual size_t GetAllocationCount() const = 0; - virtual size_t GetFreeRegionsCount() const = 0; - virtual VkDeviceSize GetSumFreeSize() const = 0; - // Returns true if this block is empty - contains only single free suballocation. - virtual bool IsEmpty() const = 0; - virtual void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) = 0; - virtual VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const = 0; - virtual void* GetAllocationUserData(VmaAllocHandle allocHandle) const = 0; - - virtual VmaAllocHandle GetAllocationListBegin() const = 0; - virtual VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const = 0; - virtual VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const = 0; - - // Shouldn't modify blockCount. - virtual void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const = 0; - virtual void AddStatistics(VmaStatistics& inoutStats) const = 0; - -#if VMA_STATS_STRING_ENABLED - virtual void PrintDetailedMap(class VmaJsonWriter& json) const = 0; -#endif - - // Tries to find a place for suballocation with given parameters inside this block. - // If succeeded, fills pAllocationRequest and returns true. - // If failed, returns false. - virtual bool CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - // Always one of VMA_ALLOCATION_CREATE_STRATEGY_* or VMA_ALLOCATION_INTERNAL_STRATEGY_* flags. - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) = 0; - - virtual VkResult CheckCorruption(const void* pBlockData) = 0; - - // Makes actual allocation based on request. Request must already be checked and valid. - virtual void Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) = 0; - - // Frees suballocation assigned to given memory region. - virtual void Free(VmaAllocHandle allocHandle) = 0; - - // Frees all allocations. - // Careful! Don't call it if there are VmaAllocation objects owned by userData of cleared allocations! - virtual void Clear() = 0; - - virtual void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) = 0; - virtual void DebugLogAllAllocations() const = 0; - -protected: - const VkAllocationCallbacks* GetAllocationCallbacks() const { return m_pAllocationCallbacks; } - VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; } - VkDeviceSize GetDebugMargin() const { return IsVirtual() ? 0 : VMA_DEBUG_MARGIN; } - - void DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const; -#if VMA_STATS_STRING_ENABLED - // mapRefCount == UINT32_MAX means unspecified. - void PrintDetailedMap_Begin(class VmaJsonWriter& json, - VkDeviceSize unusedBytes, - size_t allocationCount, - size_t unusedRangeCount) const; - void PrintDetailedMap_Allocation(class VmaJsonWriter& json, - VkDeviceSize offset, VkDeviceSize size, void* userData) const; - void PrintDetailedMap_UnusedRange(class VmaJsonWriter& json, - VkDeviceSize offset, - VkDeviceSize size) const; - void PrintDetailedMap_End(class VmaJsonWriter& json) const; -#endif - -private: - VkDeviceSize m_Size; - const VkAllocationCallbacks* m_pAllocationCallbacks; - const VkDeviceSize m_BufferImageGranularity; - const bool m_IsVirtual; -}; - -#ifndef _VMA_BLOCK_METADATA_FUNCTIONS -VmaBlockMetadata::VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual) - : m_Size(0), - m_pAllocationCallbacks(pAllocationCallbacks), - m_BufferImageGranularity(bufferImageGranularity), - m_IsVirtual(isVirtual) {} - -void VmaBlockMetadata::DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const -{ - if (IsVirtual()) - { - VMA_DEBUG_LOG("UNFREED VIRTUAL ALLOCATION; Offset: %llu; Size: %llu; UserData: %p", offset, size, userData); - } - else - { - VMA_ASSERT(userData != VMA_NULL); - VmaAllocation allocation = reinterpret_cast(userData); - - userData = allocation->GetUserData(); - const char* name = allocation->GetName(); - -#if VMA_STATS_STRING_ENABLED - VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %s; Usage: %u", - offset, size, userData, name ? name : "vma_empty", - VMA_SUBALLOCATION_TYPE_NAMES[allocation->GetSuballocationType()], - allocation->GetBufferImageUsage()); -#else - VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %u", - offset, size, userData, name ? name : "vma_empty", - (uint32_t)allocation->GetSuballocationType()); -#endif // VMA_STATS_STRING_ENABLED - } - -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata::PrintDetailedMap_Begin(class VmaJsonWriter& json, - VkDeviceSize unusedBytes, size_t allocationCount, size_t unusedRangeCount) const -{ - json.WriteString("TotalBytes"); - json.WriteNumber(GetSize()); - - json.WriteString("UnusedBytes"); - json.WriteSize(unusedBytes); - - json.WriteString("Allocations"); - json.WriteSize(allocationCount); - - json.WriteString("UnusedRanges"); - json.WriteSize(unusedRangeCount); - - json.WriteString("Suballocations"); - json.BeginArray(); -} - -void VmaBlockMetadata::PrintDetailedMap_Allocation(class VmaJsonWriter& json, - VkDeviceSize offset, VkDeviceSize size, void* userData) const -{ - json.BeginObject(true); - - json.WriteString("Offset"); - json.WriteNumber(offset); - - if (IsVirtual()) - { - json.WriteString("Size"); - json.WriteNumber(size); - if (userData) - { - json.WriteString("CustomData"); - json.BeginString(); - json.ContinueString_Pointer(userData); - json.EndString(); - } - } - else - { - ((VmaAllocation)userData)->PrintParameters(json); - } - - json.EndObject(); -} - -void VmaBlockMetadata::PrintDetailedMap_UnusedRange(class VmaJsonWriter& json, - VkDeviceSize offset, VkDeviceSize size) const -{ - json.BeginObject(true); - - json.WriteString("Offset"); - json.WriteNumber(offset); - - json.WriteString("Type"); - json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[VMA_SUBALLOCATION_TYPE_FREE]); - - json.WriteString("Size"); - json.WriteNumber(size); - - json.EndObject(); -} - -void VmaBlockMetadata::PrintDetailedMap_End(class VmaJsonWriter& json) const -{ - json.EndArray(); -} -#endif // VMA_STATS_STRING_ENABLED -#endif // _VMA_BLOCK_METADATA_FUNCTIONS -#endif // _VMA_BLOCK_METADATA - -#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY -// Before deleting object of this class remember to call 'Destroy()' -class VmaBlockBufferImageGranularity final -{ -public: - struct ValidationContext - { - const VkAllocationCallbacks* allocCallbacks; - uint16_t* pageAllocs; - }; - - VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity); - ~VmaBlockBufferImageGranularity(); - - bool IsEnabled() const { return m_BufferImageGranularity > MAX_LOW_BUFFER_IMAGE_GRANULARITY; } - - void Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size); - // Before destroying object you must call free it's memory - void Destroy(const VkAllocationCallbacks* pAllocationCallbacks); - - void RoundupAllocRequest(VmaSuballocationType allocType, - VkDeviceSize& inOutAllocSize, - VkDeviceSize& inOutAllocAlignment) const; - - bool CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset, - VkDeviceSize allocSize, - VkDeviceSize blockOffset, - VkDeviceSize blockSize, - VmaSuballocationType allocType) const; - - void AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size); - void FreePages(VkDeviceSize offset, VkDeviceSize size); - void Clear(); - - ValidationContext StartValidation(const VkAllocationCallbacks* pAllocationCallbacks, - bool isVirutal) const; - bool Validate(ValidationContext& ctx, VkDeviceSize offset, VkDeviceSize size) const; - bool FinishValidation(ValidationContext& ctx) const; - -private: - static const uint16_t MAX_LOW_BUFFER_IMAGE_GRANULARITY = 256; - - struct RegionInfo - { - uint8_t allocType; - uint16_t allocCount; - }; - - VkDeviceSize m_BufferImageGranularity; - uint32_t m_RegionCount; - RegionInfo* m_RegionInfo; - - uint32_t GetStartPage(VkDeviceSize offset) const { return OffsetToPageIndex(offset & ~(m_BufferImageGranularity - 1)); } - uint32_t GetEndPage(VkDeviceSize offset, VkDeviceSize size) const { return OffsetToPageIndex((offset + size - 1) & ~(m_BufferImageGranularity - 1)); } - - uint32_t OffsetToPageIndex(VkDeviceSize offset) const; - void AllocPage(RegionInfo& page, uint8_t allocType); -}; - -#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS -VmaBlockBufferImageGranularity::VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity) - : m_BufferImageGranularity(bufferImageGranularity), - m_RegionCount(0), - m_RegionInfo(VMA_NULL) {} - -VmaBlockBufferImageGranularity::~VmaBlockBufferImageGranularity() -{ - VMA_ASSERT(m_RegionInfo == VMA_NULL && "Free not called before destroying object!"); -} - -void VmaBlockBufferImageGranularity::Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size) -{ - if (IsEnabled()) - { - m_RegionCount = static_cast(VmaDivideRoundingUp(size, m_BufferImageGranularity)); - m_RegionInfo = vma_new_array(pAllocationCallbacks, RegionInfo, m_RegionCount); - memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo)); - } -} - -void VmaBlockBufferImageGranularity::Destroy(const VkAllocationCallbacks* pAllocationCallbacks) -{ - if (m_RegionInfo) - { - vma_delete_array(pAllocationCallbacks, m_RegionInfo, m_RegionCount); - m_RegionInfo = VMA_NULL; - } -} - -void VmaBlockBufferImageGranularity::RoundupAllocRequest(VmaSuballocationType allocType, - VkDeviceSize& inOutAllocSize, - VkDeviceSize& inOutAllocAlignment) const -{ - if (m_BufferImageGranularity > 1 && - m_BufferImageGranularity <= MAX_LOW_BUFFER_IMAGE_GRANULARITY) - { - if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN || - allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN || - allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL) - { - inOutAllocAlignment = VMA_MAX(inOutAllocAlignment, m_BufferImageGranularity); - inOutAllocSize = VmaAlignUp(inOutAllocSize, m_BufferImageGranularity); - } - } -} - -bool VmaBlockBufferImageGranularity::CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset, - VkDeviceSize allocSize, - VkDeviceSize blockOffset, - VkDeviceSize blockSize, - VmaSuballocationType allocType) const -{ - if (IsEnabled()) - { - uint32_t startPage = GetStartPage(inOutAllocOffset); - if (m_RegionInfo[startPage].allocCount > 0 && - VmaIsBufferImageGranularityConflict(static_cast(m_RegionInfo[startPage].allocType), allocType)) - { - inOutAllocOffset = VmaAlignUp(inOutAllocOffset, m_BufferImageGranularity); - if (blockSize < allocSize + inOutAllocOffset - blockOffset) - return true; - ++startPage; - } - uint32_t endPage = GetEndPage(inOutAllocOffset, allocSize); - if (endPage != startPage && - m_RegionInfo[endPage].allocCount > 0 && - VmaIsBufferImageGranularityConflict(static_cast(m_RegionInfo[endPage].allocType), allocType)) - { - return true; - } - } - return false; -} - -void VmaBlockBufferImageGranularity::AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size) -{ - if (IsEnabled()) - { - uint32_t startPage = GetStartPage(offset); - AllocPage(m_RegionInfo[startPage], allocType); - - uint32_t endPage = GetEndPage(offset, size); - if (startPage != endPage) - AllocPage(m_RegionInfo[endPage], allocType); - } -} - -void VmaBlockBufferImageGranularity::FreePages(VkDeviceSize offset, VkDeviceSize size) -{ - if (IsEnabled()) - { - uint32_t startPage = GetStartPage(offset); - --m_RegionInfo[startPage].allocCount; - if (m_RegionInfo[startPage].allocCount == 0) - m_RegionInfo[startPage].allocType = VMA_SUBALLOCATION_TYPE_FREE; - uint32_t endPage = GetEndPage(offset, size); - if (startPage != endPage) - { - --m_RegionInfo[endPage].allocCount; - if (m_RegionInfo[endPage].allocCount == 0) - m_RegionInfo[endPage].allocType = VMA_SUBALLOCATION_TYPE_FREE; - } - } -} - -void VmaBlockBufferImageGranularity::Clear() -{ - if (m_RegionInfo) - memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo)); -} - -VmaBlockBufferImageGranularity::ValidationContext VmaBlockBufferImageGranularity::StartValidation( - const VkAllocationCallbacks* pAllocationCallbacks, bool isVirutal) const -{ - ValidationContext ctx{ pAllocationCallbacks, VMA_NULL }; - if (!isVirutal && IsEnabled()) - { - ctx.pageAllocs = vma_new_array(pAllocationCallbacks, uint16_t, m_RegionCount); - memset(ctx.pageAllocs, 0, m_RegionCount * sizeof(uint16_t)); - } - return ctx; -} - -bool VmaBlockBufferImageGranularity::Validate(ValidationContext& ctx, - VkDeviceSize offset, VkDeviceSize size) const -{ - if (IsEnabled()) - { - uint32_t start = GetStartPage(offset); - ++ctx.pageAllocs[start]; - VMA_VALIDATE(m_RegionInfo[start].allocCount > 0); - - uint32_t end = GetEndPage(offset, size); - if (start != end) - { - ++ctx.pageAllocs[end]; - VMA_VALIDATE(m_RegionInfo[end].allocCount > 0); - } - } - return true; -} - -bool VmaBlockBufferImageGranularity::FinishValidation(ValidationContext& ctx) const -{ - // Check proper page structure - if (IsEnabled()) - { - VMA_ASSERT(ctx.pageAllocs != VMA_NULL && "Validation context not initialized!"); - - for (uint32_t page = 0; page < m_RegionCount; ++page) - { - VMA_VALIDATE(ctx.pageAllocs[page] == m_RegionInfo[page].allocCount); - } - vma_delete_array(ctx.allocCallbacks, ctx.pageAllocs, m_RegionCount); - ctx.pageAllocs = VMA_NULL; - } - return true; -} - -uint32_t VmaBlockBufferImageGranularity::OffsetToPageIndex(VkDeviceSize offset) const -{ - return static_cast(offset >> VMA_BITSCAN_MSB(m_BufferImageGranularity)); -} - -void VmaBlockBufferImageGranularity::AllocPage(RegionInfo& page, uint8_t allocType) -{ - // When current alloc type is free then it can be overriden by new type - if (page.allocCount == 0 || (page.allocCount > 0 && page.allocType == VMA_SUBALLOCATION_TYPE_FREE)) - page.allocType = allocType; - - ++page.allocCount; -} -#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS -#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY - -#if 0 -#ifndef _VMA_BLOCK_METADATA_GENERIC -class VmaBlockMetadata_Generic : public VmaBlockMetadata -{ - friend class VmaDefragmentationAlgorithm_Generic; - friend class VmaDefragmentationAlgorithm_Fast; - VMA_CLASS_NO_COPY(VmaBlockMetadata_Generic) -public: - VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual); - virtual ~VmaBlockMetadata_Generic() = default; - - size_t GetAllocationCount() const override { return m_Suballocations.size() - m_FreeCount; } - VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; } - bool IsEmpty() const override { return (m_Suballocations.size() == 1) && (m_FreeCount == 1); } - void Free(VmaAllocHandle allocHandle) override { FreeSuballocation(FindAtOffset((VkDeviceSize)allocHandle - 1)); } - VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; }; - - void Init(VkDeviceSize size) override; - bool Validate() const override; - - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override; - void AddStatistics(VmaStatistics& inoutStats) const override; - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override; -#endif - - bool CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) override; - - VkResult CheckCorruption(const void* pBlockData) override; - - void Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) override; - - void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override; - void* GetAllocationUserData(VmaAllocHandle allocHandle) const override; - VmaAllocHandle GetAllocationListBegin() const override; - VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override; - void Clear() override; - void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override; - void DebugLogAllAllocations() const override; - -private: - uint32_t m_FreeCount; - VkDeviceSize m_SumFreeSize; - VmaSuballocationList m_Suballocations; - // Suballocations that are free. Sorted by size, ascending. - VmaVector> m_FreeSuballocationsBySize; - - VkDeviceSize AlignAllocationSize(VkDeviceSize size) const { return IsVirtual() ? size : VmaAlignUp(size, (VkDeviceSize)16); } - - VmaSuballocationList::iterator FindAtOffset(VkDeviceSize offset) const; - bool ValidateFreeSuballocationList() const; - - // Checks if requested suballocation with given parameters can be placed in given pFreeSuballocItem. - // If yes, fills pOffset and returns true. If no, returns false. - bool CheckAllocation( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - VmaSuballocationList::const_iterator suballocItem, - VmaAllocHandle* pAllocHandle) const; - - // Given free suballocation, it merges it with following one, which must also be free. - void MergeFreeWithNext(VmaSuballocationList::iterator item); - // Releases given suballocation, making it free. - // Merges it with adjacent free suballocations if applicable. - // Returns iterator to new free suballocation at this place. - VmaSuballocationList::iterator FreeSuballocation(VmaSuballocationList::iterator suballocItem); - // Given free suballocation, it inserts it into sorted list of - // m_FreeSuballocationsBySize if it is suitable. - void RegisterFreeSuballocation(VmaSuballocationList::iterator item); - // Given free suballocation, it removes it from sorted list of - // m_FreeSuballocationsBySize if it is suitable. - void UnregisterFreeSuballocation(VmaSuballocationList::iterator item); -}; - -#ifndef _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS -VmaBlockMetadata_Generic::VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual) - : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual), - m_FreeCount(0), - m_SumFreeSize(0), - m_Suballocations(VmaStlAllocator(pAllocationCallbacks)), - m_FreeSuballocationsBySize(VmaStlAllocator(pAllocationCallbacks)) {} - -void VmaBlockMetadata_Generic::Init(VkDeviceSize size) -{ - VmaBlockMetadata::Init(size); - - m_FreeCount = 1; - m_SumFreeSize = size; - - VmaSuballocation suballoc = {}; - suballoc.offset = 0; - suballoc.size = size; - suballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - - m_Suballocations.push_back(suballoc); - m_FreeSuballocationsBySize.push_back(m_Suballocations.begin()); -} - -bool VmaBlockMetadata_Generic::Validate() const -{ - VMA_VALIDATE(!m_Suballocations.empty()); - - // Expected offset of new suballocation as calculated from previous ones. - VkDeviceSize calculatedOffset = 0; - // Expected number of free suballocations as calculated from traversing their list. - uint32_t calculatedFreeCount = 0; - // Expected sum size of free suballocations as calculated from traversing their list. - VkDeviceSize calculatedSumFreeSize = 0; - // Expected number of free suballocations that should be registered in - // m_FreeSuballocationsBySize calculated from traversing their list. - size_t freeSuballocationsToRegister = 0; - // True if previous visited suballocation was free. - bool prevFree = false; - - const VkDeviceSize debugMargin = GetDebugMargin(); - - for (const auto& subAlloc : m_Suballocations) - { - // Actual offset of this suballocation doesn't match expected one. - VMA_VALIDATE(subAlloc.offset == calculatedOffset); - - const bool currFree = (subAlloc.type == VMA_SUBALLOCATION_TYPE_FREE); - // Two adjacent free suballocations are invalid. They should be merged. - VMA_VALIDATE(!prevFree || !currFree); - - VmaAllocation alloc = (VmaAllocation)subAlloc.userData; - if (!IsVirtual()) - { - VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE)); - } - - if (currFree) - { - calculatedSumFreeSize += subAlloc.size; - ++calculatedFreeCount; - ++freeSuballocationsToRegister; - - // Margin required between allocations - every free space must be at least that large. - VMA_VALIDATE(subAlloc.size >= debugMargin); - } - else - { - if (!IsVirtual()) - { - VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == subAlloc.offset + 1); - VMA_VALIDATE(alloc->GetSize() == subAlloc.size); - } - - // Margin required between allocations - previous allocation must be free. - VMA_VALIDATE(debugMargin == 0 || prevFree); - } - - calculatedOffset += subAlloc.size; - prevFree = currFree; - } - - // Number of free suballocations registered in m_FreeSuballocationsBySize doesn't - // match expected one. - VMA_VALIDATE(m_FreeSuballocationsBySize.size() == freeSuballocationsToRegister); - - VkDeviceSize lastSize = 0; - for (size_t i = 0; i < m_FreeSuballocationsBySize.size(); ++i) - { - VmaSuballocationList::iterator suballocItem = m_FreeSuballocationsBySize[i]; - - // Only free suballocations can be registered in m_FreeSuballocationsBySize. - VMA_VALIDATE(suballocItem->type == VMA_SUBALLOCATION_TYPE_FREE); - // They must be sorted by size ascending. - VMA_VALIDATE(suballocItem->size >= lastSize); - - lastSize = suballocItem->size; - } - - // Check if totals match calculated values. - VMA_VALIDATE(ValidateFreeSuballocationList()); - VMA_VALIDATE(calculatedOffset == GetSize()); - VMA_VALIDATE(calculatedSumFreeSize == m_SumFreeSize); - VMA_VALIDATE(calculatedFreeCount == m_FreeCount); - - return true; -} - -void VmaBlockMetadata_Generic::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const -{ - const uint32_t rangeCount = (uint32_t)m_Suballocations.size(); - inoutStats.statistics.blockCount++; - inoutStats.statistics.blockBytes += GetSize(); - - for (const auto& suballoc : m_Suballocations) - { - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size); - else - VmaAddDetailedStatisticsUnusedRange(inoutStats, suballoc.size); - } -} - -void VmaBlockMetadata_Generic::AddStatistics(VmaStatistics& inoutStats) const -{ - inoutStats.blockCount++; - inoutStats.allocationCount += (uint32_t)m_Suballocations.size() - m_FreeCount; - inoutStats.blockBytes += GetSize(); - inoutStats.allocationBytes += GetSize() - m_SumFreeSize; -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata_Generic::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const -{ - PrintDetailedMap_Begin(json, - m_SumFreeSize, // unusedBytes - m_Suballocations.size() - (size_t)m_FreeCount, // allocationCount - m_FreeCount, // unusedRangeCount - mapRefCount); - - for (const auto& suballoc : m_Suballocations) - { - if (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE) - { - PrintDetailedMap_UnusedRange(json, suballoc.offset, suballoc.size); - } - else - { - PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData); - } - } - - PrintDetailedMap_End(json); -} -#endif // VMA_STATS_STRING_ENABLED - -bool VmaBlockMetadata_Generic::CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - VMA_ASSERT(allocSize > 0); - VMA_ASSERT(!upperAddress); - VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE); - VMA_ASSERT(pAllocationRequest != VMA_NULL); - VMA_HEAVY_ASSERT(Validate()); - - allocSize = AlignAllocationSize(allocSize); - - pAllocationRequest->type = VmaAllocationRequestType::Normal; - pAllocationRequest->size = allocSize; - - const VkDeviceSize debugMargin = GetDebugMargin(); - - // There is not enough total free space in this block to fulfill the request: Early return. - if (m_SumFreeSize < allocSize + debugMargin) - { - return false; - } - - // New algorithm, efficiently searching freeSuballocationsBySize. - const size_t freeSuballocCount = m_FreeSuballocationsBySize.size(); - if (freeSuballocCount > 0) - { - if (strategy == 0 || - strategy == VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT) - { - // Find first free suballocation with size not less than allocSize + debugMargin. - VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess( - m_FreeSuballocationsBySize.data(), - m_FreeSuballocationsBySize.data() + freeSuballocCount, - allocSize + debugMargin, - VmaSuballocationItemSizeLess()); - size_t index = it - m_FreeSuballocationsBySize.data(); - for (; index < freeSuballocCount; ++index) - { - if (CheckAllocation( - allocSize, - allocAlignment, - allocType, - m_FreeSuballocationsBySize[index], - &pAllocationRequest->allocHandle)) - { - pAllocationRequest->item = m_FreeSuballocationsBySize[index]; - return true; - } - } - } - else if (strategy == VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET) - { - for (VmaSuballocationList::iterator it = m_Suballocations.begin(); - it != m_Suballocations.end(); - ++it) - { - if (it->type == VMA_SUBALLOCATION_TYPE_FREE && CheckAllocation( - allocSize, - allocAlignment, - allocType, - it, - &pAllocationRequest->allocHandle)) - { - pAllocationRequest->item = it; - return true; - } - } - } - else - { - VMA_ASSERT(strategy & (VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT | VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT )); - // Search staring from biggest suballocations. - for (size_t index = freeSuballocCount; index--; ) - { - if (CheckAllocation( - allocSize, - allocAlignment, - allocType, - m_FreeSuballocationsBySize[index], - &pAllocationRequest->allocHandle)) - { - pAllocationRequest->item = m_FreeSuballocationsBySize[index]; - return true; - } - } - } - } - - return false; -} - -VkResult VmaBlockMetadata_Generic::CheckCorruption(const void* pBlockData) -{ - for (auto& suballoc : m_Suballocations) - { - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - { - if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size)) - { - VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!"); - return VK_ERROR_UNKNOWN_COPY; - } - } - } - - return VK_SUCCESS; -} - -void VmaBlockMetadata_Generic::Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) -{ - VMA_ASSERT(request.type == VmaAllocationRequestType::Normal); - VMA_ASSERT(request.item != m_Suballocations.end()); - VmaSuballocation& suballoc = *request.item; - // Given suballocation is a free block. - VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE); - - // Given offset is inside this suballocation. - VMA_ASSERT((VkDeviceSize)request.allocHandle - 1 >= suballoc.offset); - const VkDeviceSize paddingBegin = (VkDeviceSize)request.allocHandle - suballoc.offset - 1; - VMA_ASSERT(suballoc.size >= paddingBegin + request.size); - const VkDeviceSize paddingEnd = suballoc.size - paddingBegin - request.size; - - // Unregister this free suballocation from m_FreeSuballocationsBySize and update - // it to become used. - UnregisterFreeSuballocation(request.item); - - suballoc.offset = (VkDeviceSize)request.allocHandle - 1; - suballoc.size = request.size; - suballoc.type = type; - suballoc.userData = userData; - - // If there are any free bytes remaining at the end, insert new free suballocation after current one. - if (paddingEnd) - { - VmaSuballocation paddingSuballoc = {}; - paddingSuballoc.offset = suballoc.offset + suballoc.size; - paddingSuballoc.size = paddingEnd; - paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - VmaSuballocationList::iterator next = request.item; - ++next; - const VmaSuballocationList::iterator paddingEndItem = - m_Suballocations.insert(next, paddingSuballoc); - RegisterFreeSuballocation(paddingEndItem); - } - - // If there are any free bytes remaining at the beginning, insert new free suballocation before current one. - if (paddingBegin) - { - VmaSuballocation paddingSuballoc = {}; - paddingSuballoc.offset = suballoc.offset - paddingBegin; - paddingSuballoc.size = paddingBegin; - paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - const VmaSuballocationList::iterator paddingBeginItem = - m_Suballocations.insert(request.item, paddingSuballoc); - RegisterFreeSuballocation(paddingBeginItem); - } - - // Update totals. - m_FreeCount = m_FreeCount - 1; - if (paddingBegin > 0) - { - ++m_FreeCount; - } - if (paddingEnd > 0) - { - ++m_FreeCount; - } - m_SumFreeSize -= request.size; -} - -void VmaBlockMetadata_Generic::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) -{ - outInfo.offset = (VkDeviceSize)allocHandle - 1; - const VmaSuballocation& suballoc = *FindAtOffset(outInfo.offset); - outInfo.size = suballoc.size; - outInfo.pUserData = suballoc.userData; -} - -void* VmaBlockMetadata_Generic::GetAllocationUserData(VmaAllocHandle allocHandle) const -{ - return FindAtOffset((VkDeviceSize)allocHandle - 1)->userData; -} - -VmaAllocHandle VmaBlockMetadata_Generic::GetAllocationListBegin() const -{ - if (IsEmpty()) - return VK_NULL_HANDLE; - - for (const auto& suballoc : m_Suballocations) - { - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - return (VmaAllocHandle)(suballoc.offset + 1); - } - VMA_ASSERT(false && "Should contain at least 1 allocation!"); - return VK_NULL_HANDLE; -} - -VmaAllocHandle VmaBlockMetadata_Generic::GetNextAllocation(VmaAllocHandle prevAlloc) const -{ - VmaSuballocationList::const_iterator prev = FindAtOffset((VkDeviceSize)prevAlloc - 1); - - for (VmaSuballocationList::const_iterator it = ++prev; it != m_Suballocations.end(); ++it) - { - if (it->type != VMA_SUBALLOCATION_TYPE_FREE) - return (VmaAllocHandle)(it->offset + 1); - } - return VK_NULL_HANDLE; -} - -void VmaBlockMetadata_Generic::Clear() -{ - const VkDeviceSize size = GetSize(); - - VMA_ASSERT(IsVirtual()); - m_FreeCount = 1; - m_SumFreeSize = size; - m_Suballocations.clear(); - m_FreeSuballocationsBySize.clear(); - - VmaSuballocation suballoc = {}; - suballoc.offset = 0; - suballoc.size = size; - suballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - m_Suballocations.push_back(suballoc); - - m_FreeSuballocationsBySize.push_back(m_Suballocations.begin()); -} - -void VmaBlockMetadata_Generic::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) -{ - VmaSuballocation& suballoc = *FindAtOffset((VkDeviceSize)allocHandle - 1); - suballoc.userData = userData; -} - -void VmaBlockMetadata_Generic::DebugLogAllAllocations() const -{ - for (const auto& suballoc : m_Suballocations) - { - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - DebugLogAllocation(suballoc.offset, suballoc.size, suballoc.userData); - } -} - -VmaSuballocationList::iterator VmaBlockMetadata_Generic::FindAtOffset(VkDeviceSize offset) const -{ - VMA_HEAVY_ASSERT(!m_Suballocations.empty()); - const VkDeviceSize last = m_Suballocations.rbegin()->offset; - if (last == offset) - return m_Suballocations.rbegin().drop_const(); - const VkDeviceSize first = m_Suballocations.begin()->offset; - if (first == offset) - return m_Suballocations.begin().drop_const(); - - const size_t suballocCount = m_Suballocations.size(); - const VkDeviceSize step = (last - first + m_Suballocations.begin()->size) / suballocCount; - auto findSuballocation = [&](auto begin, auto end) -> VmaSuballocationList::iterator - { - for (auto suballocItem = begin; - suballocItem != end; - ++suballocItem) - { - if (suballocItem->offset == offset) - return suballocItem.drop_const(); - } - VMA_ASSERT(false && "Not found!"); - return m_Suballocations.end().drop_const(); - }; - // If requested offset is closer to the end of range, search from the end - if (offset - first > suballocCount * step / 2) - { - return findSuballocation(m_Suballocations.rbegin(), m_Suballocations.rend()); - } - return findSuballocation(m_Suballocations.begin(), m_Suballocations.end()); -} - -bool VmaBlockMetadata_Generic::ValidateFreeSuballocationList() const -{ - VkDeviceSize lastSize = 0; - for (size_t i = 0, count = m_FreeSuballocationsBySize.size(); i < count; ++i) - { - const VmaSuballocationList::iterator it = m_FreeSuballocationsBySize[i]; - - VMA_VALIDATE(it->type == VMA_SUBALLOCATION_TYPE_FREE); - VMA_VALIDATE(it->size >= lastSize); - lastSize = it->size; - } - return true; -} - -bool VmaBlockMetadata_Generic::CheckAllocation( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - VmaSuballocationList::const_iterator suballocItem, - VmaAllocHandle* pAllocHandle) const -{ - VMA_ASSERT(allocSize > 0); - VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE); - VMA_ASSERT(suballocItem != m_Suballocations.cend()); - VMA_ASSERT(pAllocHandle != VMA_NULL); - - const VkDeviceSize debugMargin = GetDebugMargin(); - const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity(); - - const VmaSuballocation& suballoc = *suballocItem; - VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE); - - // Size of this suballocation is too small for this request: Early return. - if (suballoc.size < allocSize) - { - return false; - } - - // Start from offset equal to beginning of this suballocation. - VkDeviceSize offset = suballoc.offset + (suballocItem == m_Suballocations.cbegin() ? 0 : GetDebugMargin()); - - // Apply debugMargin from the end of previous alloc. - if (debugMargin > 0) - { - offset += debugMargin; - } - - // Apply alignment. - offset = VmaAlignUp(offset, allocAlignment); - - // Check previous suballocations for BufferImageGranularity conflicts. - // Make bigger alignment if necessary. - if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment) - { - bool bufferImageGranularityConflict = false; - VmaSuballocationList::const_iterator prevSuballocItem = suballocItem; - while (prevSuballocItem != m_Suballocations.cbegin()) - { - --prevSuballocItem; - const VmaSuballocation& prevSuballoc = *prevSuballocItem; - if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, offset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType)) - { - bufferImageGranularityConflict = true; - break; - } - } - else - // Already on previous page. - break; - } - if (bufferImageGranularityConflict) - { - offset = VmaAlignUp(offset, bufferImageGranularity); - } - } - - // Calculate padding at the beginning based on current offset. - const VkDeviceSize paddingBegin = offset - suballoc.offset; - - // Fail if requested size plus margin after is bigger than size of this suballocation. - if (paddingBegin + allocSize + debugMargin > suballoc.size) - { - return false; - } - - // Check next suballocations for BufferImageGranularity conflicts. - // If conflict exists, allocation cannot be made here. - if (allocSize % bufferImageGranularity || offset % bufferImageGranularity) - { - VmaSuballocationList::const_iterator nextSuballocItem = suballocItem; - ++nextSuballocItem; - while (nextSuballocItem != m_Suballocations.cend()) - { - const VmaSuballocation& nextSuballoc = *nextSuballocItem; - if (VmaBlocksOnSamePage(offset, allocSize, nextSuballoc.offset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type)) - { - return false; - } - } - else - { - // Already on next page. - break; - } - ++nextSuballocItem; - } - } - - *pAllocHandle = (VmaAllocHandle)(offset + 1); - // All tests passed: Success. pAllocHandle is already filled. - return true; -} - -void VmaBlockMetadata_Generic::MergeFreeWithNext(VmaSuballocationList::iterator item) -{ - VMA_ASSERT(item != m_Suballocations.end()); - VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE); - - VmaSuballocationList::iterator nextItem = item; - ++nextItem; - VMA_ASSERT(nextItem != m_Suballocations.end()); - VMA_ASSERT(nextItem->type == VMA_SUBALLOCATION_TYPE_FREE); - - item->size += nextItem->size; - --m_FreeCount; - m_Suballocations.erase(nextItem); -} - -VmaSuballocationList::iterator VmaBlockMetadata_Generic::FreeSuballocation(VmaSuballocationList::iterator suballocItem) -{ - // Change this suballocation to be marked as free. - VmaSuballocation& suballoc = *suballocItem; - suballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - suballoc.userData = VMA_NULL; - - // Update totals. - ++m_FreeCount; - m_SumFreeSize += suballoc.size; - - // Merge with previous and/or next suballocation if it's also free. - bool mergeWithNext = false; - bool mergeWithPrev = false; - - VmaSuballocationList::iterator nextItem = suballocItem; - ++nextItem; - if ((nextItem != m_Suballocations.end()) && (nextItem->type == VMA_SUBALLOCATION_TYPE_FREE)) - { - mergeWithNext = true; - } - - VmaSuballocationList::iterator prevItem = suballocItem; - if (suballocItem != m_Suballocations.begin()) - { - --prevItem; - if (prevItem->type == VMA_SUBALLOCATION_TYPE_FREE) - { - mergeWithPrev = true; - } - } - - if (mergeWithNext) - { - UnregisterFreeSuballocation(nextItem); - MergeFreeWithNext(suballocItem); - } - - if (mergeWithPrev) - { - UnregisterFreeSuballocation(prevItem); - MergeFreeWithNext(prevItem); - RegisterFreeSuballocation(prevItem); - return prevItem; - } - else - { - RegisterFreeSuballocation(suballocItem); - return suballocItem; - } -} - -void VmaBlockMetadata_Generic::RegisterFreeSuballocation(VmaSuballocationList::iterator item) -{ - VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE); - VMA_ASSERT(item->size > 0); - - // You may want to enable this validation at the beginning or at the end of - // this function, depending on what do you want to check. - VMA_HEAVY_ASSERT(ValidateFreeSuballocationList()); - - if (m_FreeSuballocationsBySize.empty()) - { - m_FreeSuballocationsBySize.push_back(item); - } - else - { - VmaVectorInsertSorted(m_FreeSuballocationsBySize, item); - } - - //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList()); -} - -void VmaBlockMetadata_Generic::UnregisterFreeSuballocation(VmaSuballocationList::iterator item) -{ - VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE); - VMA_ASSERT(item->size > 0); - - // You may want to enable this validation at the beginning or at the end of - // this function, depending on what do you want to check. - VMA_HEAVY_ASSERT(ValidateFreeSuballocationList()); - - VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess( - m_FreeSuballocationsBySize.data(), - m_FreeSuballocationsBySize.data() + m_FreeSuballocationsBySize.size(), - item, - VmaSuballocationItemSizeLess()); - for (size_t index = it - m_FreeSuballocationsBySize.data(); - index < m_FreeSuballocationsBySize.size(); - ++index) - { - if (m_FreeSuballocationsBySize[index] == item) - { - VmaVectorRemove(m_FreeSuballocationsBySize, index); - return; - } - VMA_ASSERT((m_FreeSuballocationsBySize[index]->size == item->size) && "Not found."); - } - VMA_ASSERT(0 && "Not found."); - - //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList()); -} -#endif // _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS -#endif // _VMA_BLOCK_METADATA_GENERIC -#endif // #if 0 - -#ifndef _VMA_BLOCK_METADATA_LINEAR -/* -Allocations and their references in internal data structure look like this: - -if(m_2ndVectorMode == SECOND_VECTOR_EMPTY): - - 0 +-------+ - | | - | | - | | - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount] - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount + 1] - +-------+ - | ... | - +-------+ - | Alloc | 1st[1st.size() - 1] - +-------+ - | | - | | - | | -GetSize() +-------+ - -if(m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER): - - 0 +-------+ - | Alloc | 2nd[0] - +-------+ - | Alloc | 2nd[1] - +-------+ - | ... | - +-------+ - | Alloc | 2nd[2nd.size() - 1] - +-------+ - | | - | | - | | - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount] - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount + 1] - +-------+ - | ... | - +-------+ - | Alloc | 1st[1st.size() - 1] - +-------+ - | | -GetSize() +-------+ - -if(m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK): - - 0 +-------+ - | | - | | - | | - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount] - +-------+ - | Alloc | 1st[m_1stNullItemsBeginCount + 1] - +-------+ - | ... | - +-------+ - | Alloc | 1st[1st.size() - 1] - +-------+ - | | - | | - | | - +-------+ - | Alloc | 2nd[2nd.size() - 1] - +-------+ - | ... | - +-------+ - | Alloc | 2nd[1] - +-------+ - | Alloc | 2nd[0] -GetSize() +-------+ - -*/ -class VmaBlockMetadata_Linear : public VmaBlockMetadata -{ - VMA_CLASS_NO_COPY(VmaBlockMetadata_Linear) -public: - VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual); - virtual ~VmaBlockMetadata_Linear() = default; - - VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; } - bool IsEmpty() const override { return GetAllocationCount() == 0; } - VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; }; - - void Init(VkDeviceSize size) override; - bool Validate() const override; - size_t GetAllocationCount() const override; - size_t GetFreeRegionsCount() const override; - - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override; - void AddStatistics(VmaStatistics& inoutStats) const override; - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json) const override; -#endif - - bool CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) override; - - VkResult CheckCorruption(const void* pBlockData) override; - - void Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) override; - - void Free(VmaAllocHandle allocHandle) override; - void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override; - void* GetAllocationUserData(VmaAllocHandle allocHandle) const override; - VmaAllocHandle GetAllocationListBegin() const override; - VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override; - VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override; - void Clear() override; - void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override; - void DebugLogAllAllocations() const override; - -private: - /* - There are two suballocation vectors, used in ping-pong way. - The one with index m_1stVectorIndex is called 1st. - The one with index (m_1stVectorIndex ^ 1) is called 2nd. - 2nd can be non-empty only when 1st is not empty. - When 2nd is not empty, m_2ndVectorMode indicates its mode of operation. - */ - typedef VmaVector> SuballocationVectorType; - - enum SECOND_VECTOR_MODE - { - SECOND_VECTOR_EMPTY, - /* - Suballocations in 2nd vector are created later than the ones in 1st, but they - all have smaller offset. - */ - SECOND_VECTOR_RING_BUFFER, - /* - Suballocations in 2nd vector are upper side of double stack. - They all have offsets higher than those in 1st vector. - Top of this stack means smaller offsets, but higher indices in this vector. - */ - SECOND_VECTOR_DOUBLE_STACK, - }; - - VkDeviceSize m_SumFreeSize; - SuballocationVectorType m_Suballocations0, m_Suballocations1; - uint32_t m_1stVectorIndex; - SECOND_VECTOR_MODE m_2ndVectorMode; - // Number of items in 1st vector with hAllocation = null at the beginning. - size_t m_1stNullItemsBeginCount; - // Number of other items in 1st vector with hAllocation = null somewhere in the middle. - size_t m_1stNullItemsMiddleCount; - // Number of items in 2nd vector with hAllocation = null. - size_t m_2ndNullItemsCount; - - SuballocationVectorType& AccessSuballocations1st() { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; } - SuballocationVectorType& AccessSuballocations2nd() { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; } - const SuballocationVectorType& AccessSuballocations1st() const { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; } - const SuballocationVectorType& AccessSuballocations2nd() const { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; } - - VmaSuballocation& FindSuballocation(VkDeviceSize offset) const; - bool ShouldCompact1st() const; - void CleanupAfterFree(); - - bool CreateAllocationRequest_LowerAddress( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest); - bool CreateAllocationRequest_UpperAddress( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest); -}; - -#ifndef _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS -VmaBlockMetadata_Linear::VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual) - : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual), - m_SumFreeSize(0), - m_Suballocations0(VmaStlAllocator(pAllocationCallbacks)), - m_Suballocations1(VmaStlAllocator(pAllocationCallbacks)), - m_1stVectorIndex(0), - m_2ndVectorMode(SECOND_VECTOR_EMPTY), - m_1stNullItemsBeginCount(0), - m_1stNullItemsMiddleCount(0), - m_2ndNullItemsCount(0) {} - -void VmaBlockMetadata_Linear::Init(VkDeviceSize size) -{ - VmaBlockMetadata::Init(size); - m_SumFreeSize = size; -} - -bool VmaBlockMetadata_Linear::Validate() const -{ - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - VMA_VALIDATE(suballocations2nd.empty() == (m_2ndVectorMode == SECOND_VECTOR_EMPTY)); - VMA_VALIDATE(!suballocations1st.empty() || - suballocations2nd.empty() || - m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER); - - if (!suballocations1st.empty()) - { - // Null item at the beginning should be accounted into m_1stNullItemsBeginCount. - VMA_VALIDATE(suballocations1st[m_1stNullItemsBeginCount].type != VMA_SUBALLOCATION_TYPE_FREE); - // Null item at the end should be just pop_back(). - VMA_VALIDATE(suballocations1st.back().type != VMA_SUBALLOCATION_TYPE_FREE); - } - if (!suballocations2nd.empty()) - { - // Null item at the end should be just pop_back(). - VMA_VALIDATE(suballocations2nd.back().type != VMA_SUBALLOCATION_TYPE_FREE); - } - - VMA_VALIDATE(m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount <= suballocations1st.size()); - VMA_VALIDATE(m_2ndNullItemsCount <= suballocations2nd.size()); - - VkDeviceSize sumUsedSize = 0; - const size_t suballoc1stCount = suballocations1st.size(); - const VkDeviceSize debugMargin = GetDebugMargin(); - VkDeviceSize offset = 0; - - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - const size_t suballoc2ndCount = suballocations2nd.size(); - size_t nullItem2ndCount = 0; - for (size_t i = 0; i < suballoc2ndCount; ++i) - { - const VmaSuballocation& suballoc = suballocations2nd[i]; - const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE); - - VmaAllocation const alloc = (VmaAllocation)suballoc.userData; - if (!IsVirtual()) - { - VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE)); - } - VMA_VALIDATE(suballoc.offset >= offset); - - if (!currFree) - { - if (!IsVirtual()) - { - VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1); - VMA_VALIDATE(alloc->GetSize() == suballoc.size); - } - sumUsedSize += suballoc.size; - } - else - { - ++nullItem2ndCount; - } - - offset = suballoc.offset + suballoc.size + debugMargin; - } - - VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount); - } - - for (size_t i = 0; i < m_1stNullItemsBeginCount; ++i) - { - const VmaSuballocation& suballoc = suballocations1st[i]; - VMA_VALIDATE(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE && - suballoc.userData == VMA_NULL); - } - - size_t nullItem1stCount = m_1stNullItemsBeginCount; - - for (size_t i = m_1stNullItemsBeginCount; i < suballoc1stCount; ++i) - { - const VmaSuballocation& suballoc = suballocations1st[i]; - const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE); - - VmaAllocation const alloc = (VmaAllocation)suballoc.userData; - if (!IsVirtual()) - { - VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE)); - } - VMA_VALIDATE(suballoc.offset >= offset); - VMA_VALIDATE(i >= m_1stNullItemsBeginCount || currFree); - - if (!currFree) - { - if (!IsVirtual()) - { - VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1); - VMA_VALIDATE(alloc->GetSize() == suballoc.size); - } - sumUsedSize += suballoc.size; - } - else - { - ++nullItem1stCount; - } - - offset = suballoc.offset + suballoc.size + debugMargin; - } - VMA_VALIDATE(nullItem1stCount == m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount); - - if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - const size_t suballoc2ndCount = suballocations2nd.size(); - size_t nullItem2ndCount = 0; - for (size_t i = suballoc2ndCount; i--; ) - { - const VmaSuballocation& suballoc = suballocations2nd[i]; - const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE); - - VmaAllocation const alloc = (VmaAllocation)suballoc.userData; - if (!IsVirtual()) - { - VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE)); - } - VMA_VALIDATE(suballoc.offset >= offset); - - if (!currFree) - { - if (!IsVirtual()) - { - VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1); - VMA_VALIDATE(alloc->GetSize() == suballoc.size); - } - sumUsedSize += suballoc.size; - } - else - { - ++nullItem2ndCount; - } - - offset = suballoc.offset + suballoc.size + debugMargin; - } - - VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount); - } - - VMA_VALIDATE(offset <= GetSize()); - VMA_VALIDATE(m_SumFreeSize == GetSize() - sumUsedSize); - - return true; -} - -size_t VmaBlockMetadata_Linear::GetAllocationCount() const -{ - return AccessSuballocations1st().size() - m_1stNullItemsBeginCount - m_1stNullItemsMiddleCount + - AccessSuballocations2nd().size() - m_2ndNullItemsCount; -} - -size_t VmaBlockMetadata_Linear::GetFreeRegionsCount() const -{ - // Function only used for defragmentation, which is disabled for this algorithm - VMA_ASSERT(0); - return SIZE_MAX; -} - -void VmaBlockMetadata_Linear::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const -{ - const VkDeviceSize size = GetSize(); - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - const size_t suballoc1stCount = suballocations1st.size(); - const size_t suballoc2ndCount = suballocations2nd.size(); - - inoutStats.statistics.blockCount++; - inoutStats.statistics.blockBytes += size; - - VkDeviceSize lastOffset = 0; - - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset; - size_t nextAlloc2ndIndex = 0; - while (lastOffset < freeSpace2ndTo1stEnd) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc2ndIndex < suballoc2ndCount && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - ++nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex < suballoc2ndCount) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc2ndIndex; - } - // We are at the end. - else - { - // There is free space from lastOffset to freeSpace2ndTo1stEnd. - if (lastOffset < freeSpace2ndTo1stEnd) - { - const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // End of loop. - lastOffset = freeSpace2ndTo1stEnd; - } - } - } - - size_t nextAlloc1stIndex = m_1stNullItemsBeginCount; - const VkDeviceSize freeSpace1stTo2ndEnd = - m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size; - while (lastOffset < freeSpace1stTo2ndEnd) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc1stIndex < suballoc1stCount && - suballocations1st[nextAlloc1stIndex].userData == VMA_NULL) - { - ++nextAlloc1stIndex; - } - - // Found non-null allocation. - if (nextAlloc1stIndex < suballoc1stCount) - { - const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc1stIndex; - } - // We are at the end. - else - { - // There is free space from lastOffset to freeSpace1stTo2ndEnd. - if (lastOffset < freeSpace1stTo2ndEnd) - { - const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // End of loop. - lastOffset = freeSpace1stTo2ndEnd; - } - } - - if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - size_t nextAlloc2ndIndex = suballocations2nd.size() - 1; - while (lastOffset < size) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc2ndIndex != SIZE_MAX && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - --nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex != SIZE_MAX) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - --nextAlloc2ndIndex; - } - // We are at the end. - else - { - // There is free space from lastOffset to size. - if (lastOffset < size) - { - const VkDeviceSize unusedRangeSize = size - lastOffset; - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize); - } - - // End of loop. - lastOffset = size; - } - } - } -} - -void VmaBlockMetadata_Linear::AddStatistics(VmaStatistics& inoutStats) const -{ - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - const VkDeviceSize size = GetSize(); - const size_t suballoc1stCount = suballocations1st.size(); - const size_t suballoc2ndCount = suballocations2nd.size(); - - inoutStats.blockCount++; - inoutStats.blockBytes += size; - inoutStats.allocationBytes += size - m_SumFreeSize; - - VkDeviceSize lastOffset = 0; - - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset; - size_t nextAlloc2ndIndex = m_1stNullItemsBeginCount; - while (lastOffset < freeSpace2ndTo1stEnd) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex < suballoc2ndCount && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - ++nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex < suballoc2ndCount) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++inoutStats.allocationCount; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < freeSpace2ndTo1stEnd) - { - // There is free space from lastOffset to freeSpace2ndTo1stEnd. - const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset; - } - - // End of loop. - lastOffset = freeSpace2ndTo1stEnd; - } - } - } - - size_t nextAlloc1stIndex = m_1stNullItemsBeginCount; - const VkDeviceSize freeSpace1stTo2ndEnd = - m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size; - while (lastOffset < freeSpace1stTo2ndEnd) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc1stIndex < suballoc1stCount && - suballocations1st[nextAlloc1stIndex].userData == VMA_NULL) - { - ++nextAlloc1stIndex; - } - - // Found non-null allocation. - if (nextAlloc1stIndex < suballoc1stCount) - { - const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++inoutStats.allocationCount; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc1stIndex; - } - // We are at the end. - else - { - if (lastOffset < freeSpace1stTo2ndEnd) - { - // There is free space from lastOffset to freeSpace1stTo2ndEnd. - const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset; - } - - // End of loop. - lastOffset = freeSpace1stTo2ndEnd; - } - } - - if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - size_t nextAlloc2ndIndex = suballocations2nd.size() - 1; - while (lastOffset < size) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex != SIZE_MAX && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - --nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex != SIZE_MAX) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++inoutStats.allocationCount; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - --nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < size) - { - // There is free space from lastOffset to size. - const VkDeviceSize unusedRangeSize = size - lastOffset; - } - - // End of loop. - lastOffset = size; - } - } - } -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata_Linear::PrintDetailedMap(class VmaJsonWriter& json) const -{ - const VkDeviceSize size = GetSize(); - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - const size_t suballoc1stCount = suballocations1st.size(); - const size_t suballoc2ndCount = suballocations2nd.size(); - - // FIRST PASS - - size_t unusedRangeCount = 0; - VkDeviceSize usedBytes = 0; - - VkDeviceSize lastOffset = 0; - - size_t alloc2ndCount = 0; - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset; - size_t nextAlloc2ndIndex = 0; - while (lastOffset < freeSpace2ndTo1stEnd) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex < suballoc2ndCount && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - ++nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex < suballoc2ndCount) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - ++unusedRangeCount; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++alloc2ndCount; - usedBytes += suballoc.size; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < freeSpace2ndTo1stEnd) - { - // There is free space from lastOffset to freeSpace2ndTo1stEnd. - ++unusedRangeCount; - } - - // End of loop. - lastOffset = freeSpace2ndTo1stEnd; - } - } - } - - size_t nextAlloc1stIndex = m_1stNullItemsBeginCount; - size_t alloc1stCount = 0; - const VkDeviceSize freeSpace1stTo2ndEnd = - m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size; - while (lastOffset < freeSpace1stTo2ndEnd) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc1stIndex < suballoc1stCount && - suballocations1st[nextAlloc1stIndex].userData == VMA_NULL) - { - ++nextAlloc1stIndex; - } - - // Found non-null allocation. - if (nextAlloc1stIndex < suballoc1stCount) - { - const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - ++unusedRangeCount; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++alloc1stCount; - usedBytes += suballoc.size; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc1stIndex; - } - // We are at the end. - else - { - if (lastOffset < size) - { - // There is free space from lastOffset to freeSpace1stTo2ndEnd. - ++unusedRangeCount; - } - - // End of loop. - lastOffset = freeSpace1stTo2ndEnd; - } - } - - if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - size_t nextAlloc2ndIndex = suballocations2nd.size() - 1; - while (lastOffset < size) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex != SIZE_MAX && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - --nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex != SIZE_MAX) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - ++unusedRangeCount; - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - ++alloc2ndCount; - usedBytes += suballoc.size; - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - --nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < size) - { - // There is free space from lastOffset to size. - ++unusedRangeCount; - } - - // End of loop. - lastOffset = size; - } - } - } - - const VkDeviceSize unusedBytes = size - usedBytes; - PrintDetailedMap_Begin(json, unusedBytes, alloc1stCount + alloc2ndCount, unusedRangeCount); - - // SECOND PASS - lastOffset = 0; - - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset; - size_t nextAlloc2ndIndex = 0; - while (lastOffset < freeSpace2ndTo1stEnd) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex < suballoc2ndCount && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - ++nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex < suballoc2ndCount) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < freeSpace2ndTo1stEnd) - { - // There is free space from lastOffset to freeSpace2ndTo1stEnd. - const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // End of loop. - lastOffset = freeSpace2ndTo1stEnd; - } - } - } - - nextAlloc1stIndex = m_1stNullItemsBeginCount; - while (lastOffset < freeSpace1stTo2ndEnd) - { - // Find next non-null allocation or move nextAllocIndex to the end. - while (nextAlloc1stIndex < suballoc1stCount && - suballocations1st[nextAlloc1stIndex].userData == VMA_NULL) - { - ++nextAlloc1stIndex; - } - - // Found non-null allocation. - if (nextAlloc1stIndex < suballoc1stCount) - { - const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - ++nextAlloc1stIndex; - } - // We are at the end. - else - { - if (lastOffset < freeSpace1stTo2ndEnd) - { - // There is free space from lastOffset to freeSpace1stTo2ndEnd. - const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // End of loop. - lastOffset = freeSpace1stTo2ndEnd; - } - } - - if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - size_t nextAlloc2ndIndex = suballocations2nd.size() - 1; - while (lastOffset < size) - { - // Find next non-null allocation or move nextAlloc2ndIndex to the end. - while (nextAlloc2ndIndex != SIZE_MAX && - suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL) - { - --nextAlloc2ndIndex; - } - - // Found non-null allocation. - if (nextAlloc2ndIndex != SIZE_MAX) - { - const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex]; - - // 1. Process free space before this allocation. - if (lastOffset < suballoc.offset) - { - // There is free space from lastOffset to suballoc.offset. - const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // 2. Process this allocation. - // There is allocation with suballoc.offset, suballoc.size. - PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData); - - // 3. Prepare for next iteration. - lastOffset = suballoc.offset + suballoc.size; - --nextAlloc2ndIndex; - } - // We are at the end. - else - { - if (lastOffset < size) - { - // There is free space from lastOffset to size. - const VkDeviceSize unusedRangeSize = size - lastOffset; - PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize); - } - - // End of loop. - lastOffset = size; - } - } - } - - PrintDetailedMap_End(json); -} -#endif // VMA_STATS_STRING_ENABLED - -bool VmaBlockMetadata_Linear::CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - VMA_ASSERT(allocSize > 0); - VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE); - VMA_ASSERT(pAllocationRequest != VMA_NULL); - VMA_HEAVY_ASSERT(Validate()); - pAllocationRequest->size = allocSize; - return upperAddress ? - CreateAllocationRequest_UpperAddress( - allocSize, allocAlignment, allocType, strategy, pAllocationRequest) : - CreateAllocationRequest_LowerAddress( - allocSize, allocAlignment, allocType, strategy, pAllocationRequest); -} - -VkResult VmaBlockMetadata_Linear::CheckCorruption(const void* pBlockData) -{ - VMA_ASSERT(!IsVirtual()); - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - for (size_t i = m_1stNullItemsBeginCount, count = suballocations1st.size(); i < count; ++i) - { - const VmaSuballocation& suballoc = suballocations1st[i]; - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - { - if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size)) - { - VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!"); - return VK_ERROR_UNKNOWN_COPY; - } - } - } - - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - for (size_t i = 0, count = suballocations2nd.size(); i < count; ++i) - { - const VmaSuballocation& suballoc = suballocations2nd[i]; - if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE) - { - if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size)) - { - VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!"); - return VK_ERROR_UNKNOWN_COPY; - } - } - } - - return VK_SUCCESS; -} - -void VmaBlockMetadata_Linear::Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) -{ - const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1; - const VmaSuballocation newSuballoc = { offset, request.size, userData, type }; - - switch (request.type) - { - case VmaAllocationRequestType::UpperAddress: - { - VMA_ASSERT(m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER && - "CRITICAL ERROR: Trying to use linear allocator as double stack while it was already used as ring buffer."); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - suballocations2nd.push_back(newSuballoc); - m_2ndVectorMode = SECOND_VECTOR_DOUBLE_STACK; - } - break; - case VmaAllocationRequestType::EndOf1st: - { - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - - VMA_ASSERT(suballocations1st.empty() || - offset >= suballocations1st.back().offset + suballocations1st.back().size); - // Check if it fits before the end of the block. - VMA_ASSERT(offset + request.size <= GetSize()); - - suballocations1st.push_back(newSuballoc); - } - break; - case VmaAllocationRequestType::EndOf2nd: - { - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - // New allocation at the end of 2-part ring buffer, so before first allocation from 1st vector. - VMA_ASSERT(!suballocations1st.empty() && - offset + request.size <= suballocations1st[m_1stNullItemsBeginCount].offset); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - switch (m_2ndVectorMode) - { - case SECOND_VECTOR_EMPTY: - // First allocation from second part ring buffer. - VMA_ASSERT(suballocations2nd.empty()); - m_2ndVectorMode = SECOND_VECTOR_RING_BUFFER; - break; - case SECOND_VECTOR_RING_BUFFER: - // 2-part ring buffer is already started. - VMA_ASSERT(!suballocations2nd.empty()); - break; - case SECOND_VECTOR_DOUBLE_STACK: - VMA_ASSERT(0 && "CRITICAL ERROR: Trying to use linear allocator as ring buffer while it was already used as double stack."); - break; - default: - VMA_ASSERT(0); - } - - suballocations2nd.push_back(newSuballoc); - } - break; - default: - VMA_ASSERT(0 && "CRITICAL INTERNAL ERROR."); - } - - m_SumFreeSize -= newSuballoc.size; -} - -void VmaBlockMetadata_Linear::Free(VmaAllocHandle allocHandle) -{ - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - VkDeviceSize offset = (VkDeviceSize)allocHandle - 1; - - if (!suballocations1st.empty()) - { - // First allocation: Mark it as next empty at the beginning. - VmaSuballocation& firstSuballoc = suballocations1st[m_1stNullItemsBeginCount]; - if (firstSuballoc.offset == offset) - { - firstSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE; - firstSuballoc.userData = VMA_NULL; - m_SumFreeSize += firstSuballoc.size; - ++m_1stNullItemsBeginCount; - CleanupAfterFree(); - return; - } - } - - // Last allocation in 2-part ring buffer or top of upper stack (same logic). - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER || - m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - VmaSuballocation& lastSuballoc = suballocations2nd.back(); - if (lastSuballoc.offset == offset) - { - m_SumFreeSize += lastSuballoc.size; - suballocations2nd.pop_back(); - CleanupAfterFree(); - return; - } - } - // Last allocation in 1st vector. - else if (m_2ndVectorMode == SECOND_VECTOR_EMPTY) - { - VmaSuballocation& lastSuballoc = suballocations1st.back(); - if (lastSuballoc.offset == offset) - { - m_SumFreeSize += lastSuballoc.size; - suballocations1st.pop_back(); - CleanupAfterFree(); - return; - } - } - - VmaSuballocation refSuballoc; - refSuballoc.offset = offset; - // Rest of members stays uninitialized intentionally for better performance. - - // Item from the middle of 1st vector. - { - const SuballocationVectorType::iterator it = VmaBinaryFindSorted( - suballocations1st.begin() + m_1stNullItemsBeginCount, - suballocations1st.end(), - refSuballoc, - VmaSuballocationOffsetLess()); - if (it != suballocations1st.end()) - { - it->type = VMA_SUBALLOCATION_TYPE_FREE; - it->userData = VMA_NULL; - ++m_1stNullItemsMiddleCount; - m_SumFreeSize += it->size; - CleanupAfterFree(); - return; - } - } - - if (m_2ndVectorMode != SECOND_VECTOR_EMPTY) - { - // Item from the middle of 2nd vector. - const SuballocationVectorType::iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ? - VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) : - VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater()); - if (it != suballocations2nd.end()) - { - it->type = VMA_SUBALLOCATION_TYPE_FREE; - it->userData = VMA_NULL; - ++m_2ndNullItemsCount; - m_SumFreeSize += it->size; - CleanupAfterFree(); - return; - } - } - - VMA_ASSERT(0 && "Allocation to free not found in linear allocator!"); -} - -void VmaBlockMetadata_Linear::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) -{ - outInfo.offset = (VkDeviceSize)allocHandle - 1; - VmaSuballocation& suballoc = FindSuballocation(outInfo.offset); - outInfo.size = suballoc.size; - outInfo.pUserData = suballoc.userData; -} - -void* VmaBlockMetadata_Linear::GetAllocationUserData(VmaAllocHandle allocHandle) const -{ - return FindSuballocation((VkDeviceSize)allocHandle - 1).userData; -} - -VmaAllocHandle VmaBlockMetadata_Linear::GetAllocationListBegin() const -{ - // Function only used for defragmentation, which is disabled for this algorithm - VMA_ASSERT(0); - return VK_NULL_HANDLE; -} - -VmaAllocHandle VmaBlockMetadata_Linear::GetNextAllocation(VmaAllocHandle prevAlloc) const -{ - // Function only used for defragmentation, which is disabled for this algorithm - VMA_ASSERT(0); - return VK_NULL_HANDLE; -} - -VkDeviceSize VmaBlockMetadata_Linear::GetNextFreeRegionSize(VmaAllocHandle alloc) const -{ - // Function only used for defragmentation, which is disabled for this algorithm - VMA_ASSERT(0); - return 0; -} - -void VmaBlockMetadata_Linear::Clear() -{ - m_SumFreeSize = GetSize(); - m_Suballocations0.clear(); - m_Suballocations1.clear(); - // Leaving m_1stVectorIndex unchanged - it doesn't matter. - m_2ndVectorMode = SECOND_VECTOR_EMPTY; - m_1stNullItemsBeginCount = 0; - m_1stNullItemsMiddleCount = 0; - m_2ndNullItemsCount = 0; -} - -void VmaBlockMetadata_Linear::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) -{ - VmaSuballocation& suballoc = FindSuballocation((VkDeviceSize)allocHandle - 1); - suballoc.userData = userData; -} - -void VmaBlockMetadata_Linear::DebugLogAllAllocations() const -{ - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - for (auto it = suballocations1st.begin() + m_1stNullItemsBeginCount; it != suballocations1st.end(); ++it) - if (it->type != VMA_SUBALLOCATION_TYPE_FREE) - DebugLogAllocation(it->offset, it->size, it->userData); - - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - for (auto it = suballocations2nd.begin(); it != suballocations2nd.end(); ++it) - if (it->type != VMA_SUBALLOCATION_TYPE_FREE) - DebugLogAllocation(it->offset, it->size, it->userData); -} - -VmaSuballocation& VmaBlockMetadata_Linear::FindSuballocation(VkDeviceSize offset) const -{ - const SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - VmaSuballocation refSuballoc; - refSuballoc.offset = offset; - // Rest of members stays uninitialized intentionally for better performance. - - // Item from the 1st vector. - { - SuballocationVectorType::const_iterator it = VmaBinaryFindSorted( - suballocations1st.begin() + m_1stNullItemsBeginCount, - suballocations1st.end(), - refSuballoc, - VmaSuballocationOffsetLess()); - if (it != suballocations1st.end()) - { - return const_cast(*it); - } - } - - if (m_2ndVectorMode != SECOND_VECTOR_EMPTY) - { - // Rest of members stays uninitialized intentionally for better performance. - SuballocationVectorType::const_iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ? - VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) : - VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater()); - if (it != suballocations2nd.end()) - { - return const_cast(*it); - } - } - - VMA_ASSERT(0 && "Allocation not found in linear allocator!"); - return const_cast(suballocations1st.back()); // Should never occur. -} - -bool VmaBlockMetadata_Linear::ShouldCompact1st() const -{ - const size_t nullItemCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount; - const size_t suballocCount = AccessSuballocations1st().size(); - return suballocCount > 32 && nullItemCount * 2 >= (suballocCount - nullItemCount) * 3; -} - -void VmaBlockMetadata_Linear::CleanupAfterFree() -{ - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - if (IsEmpty()) - { - suballocations1st.clear(); - suballocations2nd.clear(); - m_1stNullItemsBeginCount = 0; - m_1stNullItemsMiddleCount = 0; - m_2ndNullItemsCount = 0; - m_2ndVectorMode = SECOND_VECTOR_EMPTY; - } - else - { - const size_t suballoc1stCount = suballocations1st.size(); - const size_t nullItem1stCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount; - VMA_ASSERT(nullItem1stCount <= suballoc1stCount); - - // Find more null items at the beginning of 1st vector. - while (m_1stNullItemsBeginCount < suballoc1stCount && - suballocations1st[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE) - { - ++m_1stNullItemsBeginCount; - --m_1stNullItemsMiddleCount; - } - - // Find more null items at the end of 1st vector. - while (m_1stNullItemsMiddleCount > 0 && - suballocations1st.back().type == VMA_SUBALLOCATION_TYPE_FREE) - { - --m_1stNullItemsMiddleCount; - suballocations1st.pop_back(); - } - - // Find more null items at the end of 2nd vector. - while (m_2ndNullItemsCount > 0 && - suballocations2nd.back().type == VMA_SUBALLOCATION_TYPE_FREE) - { - --m_2ndNullItemsCount; - suballocations2nd.pop_back(); - } - - // Find more null items at the beginning of 2nd vector. - while (m_2ndNullItemsCount > 0 && - suballocations2nd[0].type == VMA_SUBALLOCATION_TYPE_FREE) - { - --m_2ndNullItemsCount; - VmaVectorRemove(suballocations2nd, 0); - } - - if (ShouldCompact1st()) - { - const size_t nonNullItemCount = suballoc1stCount - nullItem1stCount; - size_t srcIndex = m_1stNullItemsBeginCount; - for (size_t dstIndex = 0; dstIndex < nonNullItemCount; ++dstIndex) - { - while (suballocations1st[srcIndex].type == VMA_SUBALLOCATION_TYPE_FREE) - { - ++srcIndex; - } - if (dstIndex != srcIndex) - { - suballocations1st[dstIndex] = suballocations1st[srcIndex]; - } - ++srcIndex; - } - suballocations1st.resize(nonNullItemCount); - m_1stNullItemsBeginCount = 0; - m_1stNullItemsMiddleCount = 0; - } - - // 2nd vector became empty. - if (suballocations2nd.empty()) - { - m_2ndVectorMode = SECOND_VECTOR_EMPTY; - } - - // 1st vector became empty. - if (suballocations1st.size() - m_1stNullItemsBeginCount == 0) - { - suballocations1st.clear(); - m_1stNullItemsBeginCount = 0; - - if (!suballocations2nd.empty() && m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - // Swap 1st with 2nd. Now 2nd is empty. - m_2ndVectorMode = SECOND_VECTOR_EMPTY; - m_1stNullItemsMiddleCount = m_2ndNullItemsCount; - while (m_1stNullItemsBeginCount < suballocations2nd.size() && - suballocations2nd[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE) - { - ++m_1stNullItemsBeginCount; - --m_1stNullItemsMiddleCount; - } - m_2ndNullItemsCount = 0; - m_1stVectorIndex ^= 1; - } - } - } - - VMA_HEAVY_ASSERT(Validate()); -} - -bool VmaBlockMetadata_Linear::CreateAllocationRequest_LowerAddress( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - const VkDeviceSize blockSize = GetSize(); - const VkDeviceSize debugMargin = GetDebugMargin(); - const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity(); - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - // Try to allocate at the end of 1st vector. - - VkDeviceSize resultBaseOffset = 0; - if (!suballocations1st.empty()) - { - const VmaSuballocation& lastSuballoc = suballocations1st.back(); - resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin; - } - - // Start from offset equal to beginning of free space. - VkDeviceSize resultOffset = resultBaseOffset; - - // Apply alignment. - resultOffset = VmaAlignUp(resultOffset, allocAlignment); - - // Check previous suballocations for BufferImageGranularity conflicts. - // Make bigger alignment if necessary. - if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations1st.empty()) - { - bool bufferImageGranularityConflict = false; - for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; ) - { - const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex]; - if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType)) - { - bufferImageGranularityConflict = true; - break; - } - } - else - // Already on previous page. - break; - } - if (bufferImageGranularityConflict) - { - resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity); - } - } - - const VkDeviceSize freeSpaceEnd = m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? - suballocations2nd.back().offset : blockSize; - - // There is enough free space at the end after alignment. - if (resultOffset + allocSize + debugMargin <= freeSpaceEnd) - { - // Check next suballocations for BufferImageGranularity conflicts. - // If conflict exists, allocation cannot be made here. - if ((allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity) && m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK) - { - for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; ) - { - const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex]; - if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type)) - { - return false; - } - } - else - { - // Already on previous page. - break; - } - } - } - - // All tests passed: Success. - pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1); - // pAllocationRequest->item, customData unused. - pAllocationRequest->type = VmaAllocationRequestType::EndOf1st; - return true; - } - } - - // Wrap-around to end of 2nd vector. Try to allocate there, watching for the - // beginning of 1st vector as the end of free space. - if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - VMA_ASSERT(!suballocations1st.empty()); - - VkDeviceSize resultBaseOffset = 0; - if (!suballocations2nd.empty()) - { - const VmaSuballocation& lastSuballoc = suballocations2nd.back(); - resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin; - } - - // Start from offset equal to beginning of free space. - VkDeviceSize resultOffset = resultBaseOffset; - - // Apply alignment. - resultOffset = VmaAlignUp(resultOffset, allocAlignment); - - // Check previous suballocations for BufferImageGranularity conflicts. - // Make bigger alignment if necessary. - if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty()) - { - bool bufferImageGranularityConflict = false; - for (size_t prevSuballocIndex = suballocations2nd.size(); prevSuballocIndex--; ) - { - const VmaSuballocation& prevSuballoc = suballocations2nd[prevSuballocIndex]; - if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType)) - { - bufferImageGranularityConflict = true; - break; - } - } - else - // Already on previous page. - break; - } - if (bufferImageGranularityConflict) - { - resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity); - } - } - - size_t index1st = m_1stNullItemsBeginCount; - - // There is enough free space at the end after alignment. - if ((index1st == suballocations1st.size() && resultOffset + allocSize + debugMargin <= blockSize) || - (index1st < suballocations1st.size() && resultOffset + allocSize + debugMargin <= suballocations1st[index1st].offset)) - { - // Check next suballocations for BufferImageGranularity conflicts. - // If conflict exists, allocation cannot be made here. - if (allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity) - { - for (size_t nextSuballocIndex = index1st; - nextSuballocIndex < suballocations1st.size(); - nextSuballocIndex++) - { - const VmaSuballocation& nextSuballoc = suballocations1st[nextSuballocIndex]; - if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type)) - { - return false; - } - } - else - { - // Already on next page. - break; - } - } - } - - // All tests passed: Success. - pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1); - pAllocationRequest->type = VmaAllocationRequestType::EndOf2nd; - // pAllocationRequest->item, customData unused. - return true; - } - } - - return false; -} - -bool VmaBlockMetadata_Linear::CreateAllocationRequest_UpperAddress( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - const VkDeviceSize blockSize = GetSize(); - const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity(); - SuballocationVectorType& suballocations1st = AccessSuballocations1st(); - SuballocationVectorType& suballocations2nd = AccessSuballocations2nd(); - - if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER) - { - VMA_ASSERT(0 && "Trying to use pool with linear algorithm as double stack, while it is already being used as ring buffer."); - return false; - } - - // Try to allocate before 2nd.back(), or end of block if 2nd.empty(). - if (allocSize > blockSize) - { - return false; - } - VkDeviceSize resultBaseOffset = blockSize - allocSize; - if (!suballocations2nd.empty()) - { - const VmaSuballocation& lastSuballoc = suballocations2nd.back(); - resultBaseOffset = lastSuballoc.offset - allocSize; - if (allocSize > lastSuballoc.offset) - { - return false; - } - } - - // Start from offset equal to end of free space. - VkDeviceSize resultOffset = resultBaseOffset; - - const VkDeviceSize debugMargin = GetDebugMargin(); - - // Apply debugMargin at the end. - if (debugMargin > 0) - { - if (resultOffset < debugMargin) - { - return false; - } - resultOffset -= debugMargin; - } - - // Apply alignment. - resultOffset = VmaAlignDown(resultOffset, allocAlignment); - - // Check next suballocations from 2nd for BufferImageGranularity conflicts. - // Make bigger alignment if necessary. - if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty()) - { - bool bufferImageGranularityConflict = false; - for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; ) - { - const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex]; - if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(nextSuballoc.type, allocType)) - { - bufferImageGranularityConflict = true; - break; - } - } - else - // Already on previous page. - break; - } - if (bufferImageGranularityConflict) - { - resultOffset = VmaAlignDown(resultOffset, bufferImageGranularity); - } - } - - // There is enough free space. - const VkDeviceSize endOf1st = !suballocations1st.empty() ? - suballocations1st.back().offset + suballocations1st.back().size : - 0; - if (endOf1st + debugMargin <= resultOffset) - { - // Check previous suballocations for BufferImageGranularity conflicts. - // If conflict exists, allocation cannot be made here. - if (bufferImageGranularity > 1) - { - for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; ) - { - const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex]; - if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity)) - { - if (VmaIsBufferImageGranularityConflict(allocType, prevSuballoc.type)) - { - return false; - } - } - else - { - // Already on next page. - break; - } - } - } - - // All tests passed: Success. - pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1); - // pAllocationRequest->item unused. - pAllocationRequest->type = VmaAllocationRequestType::UpperAddress; - return true; - } - - return false; -} -#endif // _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS -#endif // _VMA_BLOCK_METADATA_LINEAR - -#if 0 -#ifndef _VMA_BLOCK_METADATA_BUDDY -/* -- GetSize() is the original size of allocated memory block. -- m_UsableSize is this size aligned down to a power of two. - All allocations and calculations happen relative to m_UsableSize. -- GetUnusableSize() is the difference between them. - It is reported as separate, unused range, not available for allocations. - -Node at level 0 has size = m_UsableSize. -Each next level contains nodes with size 2 times smaller than current level. -m_LevelCount is the maximum number of levels to use in the current object. -*/ -class VmaBlockMetadata_Buddy : public VmaBlockMetadata -{ - VMA_CLASS_NO_COPY(VmaBlockMetadata_Buddy) -public: - VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual); - virtual ~VmaBlockMetadata_Buddy(); - - size_t GetAllocationCount() const override { return m_AllocationCount; } - VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize + GetUnusableSize(); } - bool IsEmpty() const override { return m_Root->type == Node::TYPE_FREE; } - VkResult CheckCorruption(const void* pBlockData) override { return VK_ERROR_FEATURE_NOT_PRESENT; } - VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; }; - void DebugLogAllAllocations() const override { DebugLogAllAllocationNode(m_Root, 0); } - - void Init(VkDeviceSize size) override; - bool Validate() const override; - - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override; - void AddStatistics(VmaStatistics& inoutStats) const override; - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override; -#endif - - bool CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) override; - - void Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) override; - - void Free(VmaAllocHandle allocHandle) override; - void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override; - void* GetAllocationUserData(VmaAllocHandle allocHandle) const override; - VmaAllocHandle GetAllocationListBegin() const override; - VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override; - void Clear() override; - void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override; - -private: - static const size_t MAX_LEVELS = 48; - - struct ValidationContext - { - size_t calculatedAllocationCount = 0; - size_t calculatedFreeCount = 0; - VkDeviceSize calculatedSumFreeSize = 0; - }; - struct Node - { - VkDeviceSize offset; - enum TYPE - { - TYPE_FREE, - TYPE_ALLOCATION, - TYPE_SPLIT, - TYPE_COUNT - } type; - Node* parent; - Node* buddy; - - union - { - struct - { - Node* prev; - Node* next; - } free; - struct - { - void* userData; - } allocation; - struct - { - Node* leftChild; - } split; - }; - }; - - // Size of the memory block aligned down to a power of two. - VkDeviceSize m_UsableSize; - uint32_t m_LevelCount; - VmaPoolAllocator m_NodeAllocator; - Node* m_Root; - struct - { - Node* front; - Node* back; - } m_FreeList[MAX_LEVELS]; - - // Number of nodes in the tree with type == TYPE_ALLOCATION. - size_t m_AllocationCount; - // Number of nodes in the tree with type == TYPE_FREE. - size_t m_FreeCount; - // Doesn't include space wasted due to internal fragmentation - allocation sizes are just aligned up to node sizes. - // Doesn't include unusable size. - VkDeviceSize m_SumFreeSize; - - VkDeviceSize GetUnusableSize() const { return GetSize() - m_UsableSize; } - VkDeviceSize LevelToNodeSize(uint32_t level) const { return m_UsableSize >> level; } - - VkDeviceSize AlignAllocationSize(VkDeviceSize size) const - { - if (!IsVirtual()) - { - size = VmaAlignUp(size, (VkDeviceSize)16); - } - return VmaNextPow2(size); - } - Node* FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const; - void DeleteNodeChildren(Node* node); - bool ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const; - uint32_t AllocSizeToLevel(VkDeviceSize allocSize) const; - void AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const; - // Adds node to the front of FreeList at given level. - // node->type must be FREE. - // node->free.prev, next can be undefined. - void AddToFreeListFront(uint32_t level, Node* node); - // Removes node from FreeList at given level. - // node->type must be FREE. - // node->free.prev, next stay untouched. - void RemoveFromFreeList(uint32_t level, Node* node); - void DebugLogAllAllocationNode(Node* node, uint32_t level) const; - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const; -#endif -}; - -#ifndef _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS -VmaBlockMetadata_Buddy::VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual) - : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual), - m_NodeAllocator(pAllocationCallbacks, 32), // firstBlockCapacity - m_Root(VMA_NULL), - m_AllocationCount(0), - m_FreeCount(1), - m_SumFreeSize(0) -{ - memset(m_FreeList, 0, sizeof(m_FreeList)); -} - -VmaBlockMetadata_Buddy::~VmaBlockMetadata_Buddy() -{ - DeleteNodeChildren(m_Root); - m_NodeAllocator.Free(m_Root); -} - -void VmaBlockMetadata_Buddy::Init(VkDeviceSize size) -{ - VmaBlockMetadata::Init(size); - - m_UsableSize = VmaPrevPow2(size); - m_SumFreeSize = m_UsableSize; - - // Calculate m_LevelCount. - const VkDeviceSize minNodeSize = IsVirtual() ? 1 : 16; - m_LevelCount = 1; - while (m_LevelCount < MAX_LEVELS && - LevelToNodeSize(m_LevelCount) >= minNodeSize) - { - ++m_LevelCount; - } - - Node* rootNode = m_NodeAllocator.Alloc(); - rootNode->offset = 0; - rootNode->type = Node::TYPE_FREE; - rootNode->parent = VMA_NULL; - rootNode->buddy = VMA_NULL; - - m_Root = rootNode; - AddToFreeListFront(0, rootNode); -} - -bool VmaBlockMetadata_Buddy::Validate() const -{ - // Validate tree. - ValidationContext ctx; - if (!ValidateNode(ctx, VMA_NULL, m_Root, 0, LevelToNodeSize(0))) - { - VMA_VALIDATE(false && "ValidateNode failed."); - } - VMA_VALIDATE(m_AllocationCount == ctx.calculatedAllocationCount); - VMA_VALIDATE(m_SumFreeSize == ctx.calculatedSumFreeSize); - - // Validate free node lists. - for (uint32_t level = 0; level < m_LevelCount; ++level) - { - VMA_VALIDATE(m_FreeList[level].front == VMA_NULL || - m_FreeList[level].front->free.prev == VMA_NULL); - - for (Node* node = m_FreeList[level].front; - node != VMA_NULL; - node = node->free.next) - { - VMA_VALIDATE(node->type == Node::TYPE_FREE); - - if (node->free.next == VMA_NULL) - { - VMA_VALIDATE(m_FreeList[level].back == node); - } - else - { - VMA_VALIDATE(node->free.next->free.prev == node); - } - } - } - - // Validate that free lists ar higher levels are empty. - for (uint32_t level = m_LevelCount; level < MAX_LEVELS; ++level) - { - VMA_VALIDATE(m_FreeList[level].front == VMA_NULL && m_FreeList[level].back == VMA_NULL); - } - - return true; -} - -void VmaBlockMetadata_Buddy::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const -{ - inoutStats.statistics.blockCount++; - inoutStats.statistics.blockBytes += GetSize(); - - AddNodeToDetailedStatistics(inoutStats, m_Root, LevelToNodeSize(0)); - - const VkDeviceSize unusableSize = GetUnusableSize(); - if (unusableSize > 0) - VmaAddDetailedStatisticsUnusedRange(inoutStats, unusableSize); -} - -void VmaBlockMetadata_Buddy::AddStatistics(VmaStatistics& inoutStats) const -{ - inoutStats.blockCount++; - inoutStats.allocationCount += (uint32_t)m_AllocationCount; - inoutStats.blockBytes += GetSize(); - inoutStats.allocationBytes += GetSize() - m_SumFreeSize; -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata_Buddy::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const -{ - VmaDetailedStatistics stats; - VmaClearDetailedStatistics(stats); - AddDetailedStatistics(stats); - - PrintDetailedMap_Begin( - json, - stats.statistics.blockBytes - stats.statistics.allocationBytes, - stats.statistics.allocationCount, - stats.unusedRangeCount, - mapRefCount); - - PrintDetailedMapNode(json, m_Root, LevelToNodeSize(0)); - - const VkDeviceSize unusableSize = GetUnusableSize(); - if (unusableSize > 0) - { - PrintDetailedMap_UnusedRange(json, - m_UsableSize, // offset - unusableSize); // size - } - - PrintDetailedMap_End(json); -} -#endif // VMA_STATS_STRING_ENABLED - -bool VmaBlockMetadata_Buddy::CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm."); - - allocSize = AlignAllocationSize(allocSize); - - // Simple way to respect bufferImageGranularity. May be optimized some day. - // Whenever it might be an OPTIMAL image... - if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN || - allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN || - allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL) - { - allocAlignment = VMA_MAX(allocAlignment, GetBufferImageGranularity()); - allocSize = VmaAlignUp(allocSize, GetBufferImageGranularity()); - } - - if (allocSize > m_UsableSize) - { - return false; - } - - const uint32_t targetLevel = AllocSizeToLevel(allocSize); - for (uint32_t level = targetLevel; level--; ) - { - for (Node* freeNode = m_FreeList[level].front; - freeNode != VMA_NULL; - freeNode = freeNode->free.next) - { - if (freeNode->offset % allocAlignment == 0) - { - pAllocationRequest->type = VmaAllocationRequestType::Normal; - pAllocationRequest->allocHandle = (VmaAllocHandle)(freeNode->offset + 1); - pAllocationRequest->size = allocSize; - pAllocationRequest->customData = (void*)(uintptr_t)level; - return true; - } - } - } - - return false; -} - -void VmaBlockMetadata_Buddy::Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) -{ - VMA_ASSERT(request.type == VmaAllocationRequestType::Normal); - - const uint32_t targetLevel = AllocSizeToLevel(request.size); - uint32_t currLevel = (uint32_t)(uintptr_t)request.customData; - - Node* currNode = m_FreeList[currLevel].front; - VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE); - const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1; - while (currNode->offset != offset) - { - currNode = currNode->free.next; - VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE); - } - - // Go down, splitting free nodes. - while (currLevel < targetLevel) - { - // currNode is already first free node at currLevel. - // Remove it from list of free nodes at this currLevel. - RemoveFromFreeList(currLevel, currNode); - - const uint32_t childrenLevel = currLevel + 1; - - // Create two free sub-nodes. - Node* leftChild = m_NodeAllocator.Alloc(); - Node* rightChild = m_NodeAllocator.Alloc(); - - leftChild->offset = currNode->offset; - leftChild->type = Node::TYPE_FREE; - leftChild->parent = currNode; - leftChild->buddy = rightChild; - - rightChild->offset = currNode->offset + LevelToNodeSize(childrenLevel); - rightChild->type = Node::TYPE_FREE; - rightChild->parent = currNode; - rightChild->buddy = leftChild; - - // Convert current currNode to split type. - currNode->type = Node::TYPE_SPLIT; - currNode->split.leftChild = leftChild; - - // Add child nodes to free list. Order is important! - AddToFreeListFront(childrenLevel, rightChild); - AddToFreeListFront(childrenLevel, leftChild); - - ++m_FreeCount; - ++currLevel; - currNode = m_FreeList[currLevel].front; - - /* - We can be sure that currNode, as left child of node previously split, - also fulfills the alignment requirement. - */ - } - - // Remove from free list. - VMA_ASSERT(currLevel == targetLevel && - currNode != VMA_NULL && - currNode->type == Node::TYPE_FREE); - RemoveFromFreeList(currLevel, currNode); - - // Convert to allocation node. - currNode->type = Node::TYPE_ALLOCATION; - currNode->allocation.userData = userData; - - ++m_AllocationCount; - --m_FreeCount; - m_SumFreeSize -= request.size; -} - -void VmaBlockMetadata_Buddy::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) -{ - uint32_t level = 0; - outInfo.offset = (VkDeviceSize)allocHandle - 1; - const Node* const node = FindAllocationNode(outInfo.offset, level); - outInfo.size = LevelToNodeSize(level); - outInfo.pUserData = node->allocation.userData; -} - -void* VmaBlockMetadata_Buddy::GetAllocationUserData(VmaAllocHandle allocHandle) const -{ - uint32_t level = 0; - const Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level); - return node->allocation.userData; -} - -VmaAllocHandle VmaBlockMetadata_Buddy::GetAllocationListBegin() const -{ - // Function only used for defragmentation, which is disabled for this algorithm - return VK_NULL_HANDLE; -} - -VmaAllocHandle VmaBlockMetadata_Buddy::GetNextAllocation(VmaAllocHandle prevAlloc) const -{ - // Function only used for defragmentation, which is disabled for this algorithm - return VK_NULL_HANDLE; -} - -void VmaBlockMetadata_Buddy::DeleteNodeChildren(Node* node) -{ - if (node->type == Node::TYPE_SPLIT) - { - DeleteNodeChildren(node->split.leftChild->buddy); - DeleteNodeChildren(node->split.leftChild); - const VkAllocationCallbacks* allocationCallbacks = GetAllocationCallbacks(); - m_NodeAllocator.Free(node->split.leftChild->buddy); - m_NodeAllocator.Free(node->split.leftChild); - } -} - -void VmaBlockMetadata_Buddy::Clear() -{ - DeleteNodeChildren(m_Root); - m_Root->type = Node::TYPE_FREE; - m_AllocationCount = 0; - m_FreeCount = 1; - m_SumFreeSize = m_UsableSize; -} - -void VmaBlockMetadata_Buddy::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) -{ - uint32_t level = 0; - Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level); - node->allocation.userData = userData; -} - -VmaBlockMetadata_Buddy::Node* VmaBlockMetadata_Buddy::FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const -{ - Node* node = m_Root; - VkDeviceSize nodeOffset = 0; - outLevel = 0; - VkDeviceSize levelNodeSize = LevelToNodeSize(0); - while (node->type == Node::TYPE_SPLIT) - { - const VkDeviceSize nextLevelNodeSize = levelNodeSize >> 1; - if (offset < nodeOffset + nextLevelNodeSize) - { - node = node->split.leftChild; - } - else - { - node = node->split.leftChild->buddy; - nodeOffset += nextLevelNodeSize; - } - ++outLevel; - levelNodeSize = nextLevelNodeSize; - } - - VMA_ASSERT(node != VMA_NULL && node->type == Node::TYPE_ALLOCATION); - return node; -} - -bool VmaBlockMetadata_Buddy::ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const -{ - VMA_VALIDATE(level < m_LevelCount); - VMA_VALIDATE(curr->parent == parent); - VMA_VALIDATE((curr->buddy == VMA_NULL) == (parent == VMA_NULL)); - VMA_VALIDATE(curr->buddy == VMA_NULL || curr->buddy->buddy == curr); - switch (curr->type) - { - case Node::TYPE_FREE: - // curr->free.prev, next are validated separately. - ctx.calculatedSumFreeSize += levelNodeSize; - ++ctx.calculatedFreeCount; - break; - case Node::TYPE_ALLOCATION: - ++ctx.calculatedAllocationCount; - if (!IsVirtual()) - { - VMA_VALIDATE(curr->allocation.userData != VMA_NULL); - } - break; - case Node::TYPE_SPLIT: - { - const uint32_t childrenLevel = level + 1; - const VkDeviceSize childrenLevelNodeSize = levelNodeSize >> 1; - const Node* const leftChild = curr->split.leftChild; - VMA_VALIDATE(leftChild != VMA_NULL); - VMA_VALIDATE(leftChild->offset == curr->offset); - if (!ValidateNode(ctx, curr, leftChild, childrenLevel, childrenLevelNodeSize)) - { - VMA_VALIDATE(false && "ValidateNode for left child failed."); - } - const Node* const rightChild = leftChild->buddy; - VMA_VALIDATE(rightChild->offset == curr->offset + childrenLevelNodeSize); - if (!ValidateNode(ctx, curr, rightChild, childrenLevel, childrenLevelNodeSize)) - { - VMA_VALIDATE(false && "ValidateNode for right child failed."); - } - } - break; - default: - return false; - } - - return true; -} - -uint32_t VmaBlockMetadata_Buddy::AllocSizeToLevel(VkDeviceSize allocSize) const -{ - // I know this could be optimized somehow e.g. by using std::log2p1 from C++20. - uint32_t level = 0; - VkDeviceSize currLevelNodeSize = m_UsableSize; - VkDeviceSize nextLevelNodeSize = currLevelNodeSize >> 1; - while (allocSize <= nextLevelNodeSize && level + 1 < m_LevelCount) - { - ++level; - currLevelNodeSize >>= 1; - nextLevelNodeSize >>= 1; - } - return level; -} - -void VmaBlockMetadata_Buddy::Free(VmaAllocHandle allocHandle) -{ - uint32_t level = 0; - Node* node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level); - - ++m_FreeCount; - --m_AllocationCount; - m_SumFreeSize += LevelToNodeSize(level); - - node->type = Node::TYPE_FREE; - - // Join free nodes if possible. - while (level > 0 && node->buddy->type == Node::TYPE_FREE) - { - RemoveFromFreeList(level, node->buddy); - Node* const parent = node->parent; - - m_NodeAllocator.Free(node->buddy); - m_NodeAllocator.Free(node); - parent->type = Node::TYPE_FREE; - - node = parent; - --level; - --m_FreeCount; - } - - AddToFreeListFront(level, node); -} - -void VmaBlockMetadata_Buddy::AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const -{ - switch (node->type) - { - case Node::TYPE_FREE: - VmaAddDetailedStatisticsUnusedRange(inoutStats, levelNodeSize); - break; - case Node::TYPE_ALLOCATION: - VmaAddDetailedStatisticsAllocation(inoutStats, levelNodeSize); - break; - case Node::TYPE_SPLIT: - { - const VkDeviceSize childrenNodeSize = levelNodeSize / 2; - const Node* const leftChild = node->split.leftChild; - AddNodeToDetailedStatistics(inoutStats, leftChild, childrenNodeSize); - const Node* const rightChild = leftChild->buddy; - AddNodeToDetailedStatistics(inoutStats, rightChild, childrenNodeSize); - } - break; - default: - VMA_ASSERT(0); - } -} - -void VmaBlockMetadata_Buddy::AddToFreeListFront(uint32_t level, Node* node) -{ - VMA_ASSERT(node->type == Node::TYPE_FREE); - - // List is empty. - Node* const frontNode = m_FreeList[level].front; - if (frontNode == VMA_NULL) - { - VMA_ASSERT(m_FreeList[level].back == VMA_NULL); - node->free.prev = node->free.next = VMA_NULL; - m_FreeList[level].front = m_FreeList[level].back = node; - } - else - { - VMA_ASSERT(frontNode->free.prev == VMA_NULL); - node->free.prev = VMA_NULL; - node->free.next = frontNode; - frontNode->free.prev = node; - m_FreeList[level].front = node; - } -} - -void VmaBlockMetadata_Buddy::RemoveFromFreeList(uint32_t level, Node* node) -{ - VMA_ASSERT(m_FreeList[level].front != VMA_NULL); - - // It is at the front. - if (node->free.prev == VMA_NULL) - { - VMA_ASSERT(m_FreeList[level].front == node); - m_FreeList[level].front = node->free.next; - } - else - { - Node* const prevFreeNode = node->free.prev; - VMA_ASSERT(prevFreeNode->free.next == node); - prevFreeNode->free.next = node->free.next; - } - - // It is at the back. - if (node->free.next == VMA_NULL) - { - VMA_ASSERT(m_FreeList[level].back == node); - m_FreeList[level].back = node->free.prev; - } - else - { - Node* const nextFreeNode = node->free.next; - VMA_ASSERT(nextFreeNode->free.prev == node); - nextFreeNode->free.prev = node->free.prev; - } -} - -void VmaBlockMetadata_Buddy::DebugLogAllAllocationNode(Node* node, uint32_t level) const -{ - switch (node->type) - { - case Node::TYPE_FREE: - break; - case Node::TYPE_ALLOCATION: - DebugLogAllocation(node->offset, LevelToNodeSize(level), node->allocation.userData); - break; - case Node::TYPE_SPLIT: - { - ++level; - DebugLogAllAllocationNode(node->split.leftChild, level); - DebugLogAllAllocationNode(node->split.leftChild->buddy, level); - } - break; - default: - VMA_ASSERT(0); - } -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata_Buddy::PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const -{ - switch (node->type) - { - case Node::TYPE_FREE: - PrintDetailedMap_UnusedRange(json, node->offset, levelNodeSize); - break; - case Node::TYPE_ALLOCATION: - PrintDetailedMap_Allocation(json, node->offset, levelNodeSize, node->allocation.userData); - break; - case Node::TYPE_SPLIT: - { - const VkDeviceSize childrenNodeSize = levelNodeSize / 2; - const Node* const leftChild = node->split.leftChild; - PrintDetailedMapNode(json, leftChild, childrenNodeSize); - const Node* const rightChild = leftChild->buddy; - PrintDetailedMapNode(json, rightChild, childrenNodeSize); - } - break; - default: - VMA_ASSERT(0); - } -} -#endif // VMA_STATS_STRING_ENABLED -#endif // _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS -#endif // _VMA_BLOCK_METADATA_BUDDY -#endif // #if 0 - -#ifndef _VMA_BLOCK_METADATA_TLSF -// To not search current larger region if first allocation won't succeed and skip to smaller range -// use with VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT as strategy in CreateAllocationRequest(). -// When fragmentation and reusal of previous blocks doesn't matter then use with -// VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT for fastest alloc time possible. -class VmaBlockMetadata_TLSF : public VmaBlockMetadata -{ - VMA_CLASS_NO_COPY(VmaBlockMetadata_TLSF) -public: - VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual); - virtual ~VmaBlockMetadata_TLSF(); - - size_t GetAllocationCount() const override { return m_AllocCount; } - size_t GetFreeRegionsCount() const override { return m_BlocksFreeCount + 1; } - VkDeviceSize GetSumFreeSize() const override { return m_BlocksFreeSize + m_NullBlock->size; } - bool IsEmpty() const override { return m_NullBlock->offset == 0; } - VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return ((Block*)allocHandle)->offset; }; - - void Init(VkDeviceSize size) override; - bool Validate() const override; - - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override; - void AddStatistics(VmaStatistics& inoutStats) const override; - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json) const override; -#endif - - bool CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) override; - - VkResult CheckCorruption(const void* pBlockData) override; - void Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) override; - - void Free(VmaAllocHandle allocHandle) override; - void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override; - void* GetAllocationUserData(VmaAllocHandle allocHandle) const override; - VmaAllocHandle GetAllocationListBegin() const override; - VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override; - VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override; - void Clear() override; - void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override; - void DebugLogAllAllocations() const override; - -private: - // According to original paper it should be preferable 4 or 5: - // M. Masmano, I. Ripoll, A. Crespo, and J. Real "TLSF: a New Dynamic Memory Allocator for Real-Time Systems" - // http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf - static const uint8_t SECOND_LEVEL_INDEX = 5; - static const uint16_t SMALL_BUFFER_SIZE = 256; - static const uint32_t INITIAL_BLOCK_ALLOC_COUNT = 16; - static const uint8_t MEMORY_CLASS_SHIFT = 7; - static const uint8_t MAX_MEMORY_CLASSES = 65 - MEMORY_CLASS_SHIFT; - - class Block - { - public: - VkDeviceSize offset; - VkDeviceSize size; - Block* prevPhysical; - Block* nextPhysical; - - void MarkFree() { prevFree = VMA_NULL; } - void MarkTaken() { prevFree = this; } - bool IsFree() const { return prevFree != this; } - void*& UserData() { VMA_HEAVY_ASSERT(!IsFree()); return userData; } - Block*& PrevFree() { return prevFree; } - Block*& NextFree() { VMA_HEAVY_ASSERT(IsFree()); return nextFree; } - - private: - Block* prevFree; // Address of the same block here indicates that block is taken - union - { - Block* nextFree; - void* userData; - }; - }; - - size_t m_AllocCount; - // Total number of free blocks besides null block - size_t m_BlocksFreeCount; - // Total size of free blocks excluding null block - VkDeviceSize m_BlocksFreeSize; - uint32_t m_IsFreeBitmap; - uint8_t m_MemoryClasses; - uint32_t m_InnerIsFreeBitmap[MAX_MEMORY_CLASSES]; - uint32_t m_ListsCount; - /* - * 0: 0-3 lists for small buffers - * 1+: 0-(2^SLI-1) lists for normal buffers - */ - Block** m_FreeList; - VmaPoolAllocator m_BlockAllocator; - Block* m_NullBlock; - VmaBlockBufferImageGranularity m_GranularityHandler; - - uint8_t SizeToMemoryClass(VkDeviceSize size) const; - uint16_t SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const; - uint32_t GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const; - uint32_t GetListIndex(VkDeviceSize size) const; - - void RemoveFreeBlock(Block* block); - void InsertFreeBlock(Block* block); - void MergeBlock(Block* block, Block* prev); - - Block* FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const; - bool CheckBlock( - Block& block, - uint32_t listIndex, - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - VmaAllocationRequest* pAllocationRequest); -}; - -#ifndef _VMA_BLOCK_METADATA_TLSF_FUNCTIONS -VmaBlockMetadata_TLSF::VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks, - VkDeviceSize bufferImageGranularity, bool isVirtual) - : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual), - m_AllocCount(0), - m_BlocksFreeCount(0), - m_BlocksFreeSize(0), - m_IsFreeBitmap(0), - m_MemoryClasses(0), - m_ListsCount(0), - m_FreeList(VMA_NULL), - m_BlockAllocator(pAllocationCallbacks, INITIAL_BLOCK_ALLOC_COUNT), - m_NullBlock(VMA_NULL), - m_GranularityHandler(bufferImageGranularity) {} - -VmaBlockMetadata_TLSF::~VmaBlockMetadata_TLSF() -{ - if (m_FreeList) - vma_delete_array(GetAllocationCallbacks(), m_FreeList, m_ListsCount); - m_GranularityHandler.Destroy(GetAllocationCallbacks()); -} - -void VmaBlockMetadata_TLSF::Init(VkDeviceSize size) -{ - VmaBlockMetadata::Init(size); - - if (!IsVirtual()) - m_GranularityHandler.Init(GetAllocationCallbacks(), size); - - m_NullBlock = m_BlockAllocator.Alloc(); - m_NullBlock->size = size; - m_NullBlock->offset = 0; - m_NullBlock->prevPhysical = VMA_NULL; - m_NullBlock->nextPhysical = VMA_NULL; - m_NullBlock->MarkFree(); - m_NullBlock->NextFree() = VMA_NULL; - m_NullBlock->PrevFree() = VMA_NULL; - uint8_t memoryClass = SizeToMemoryClass(size); - uint16_t sli = SizeToSecondIndex(size, memoryClass); - m_ListsCount = (memoryClass == 0 ? 0 : (memoryClass - 1) * (1UL << SECOND_LEVEL_INDEX) + sli) + 1; - if (IsVirtual()) - m_ListsCount += 1UL << SECOND_LEVEL_INDEX; - else - m_ListsCount += 4; - - m_MemoryClasses = memoryClass + 2; - memset(m_InnerIsFreeBitmap, 0, MAX_MEMORY_CLASSES * sizeof(uint32_t)); - - m_FreeList = vma_new_array(GetAllocationCallbacks(), Block*, m_ListsCount); - memset(m_FreeList, 0, m_ListsCount * sizeof(Block*)); -} - -bool VmaBlockMetadata_TLSF::Validate() const -{ - VMA_VALIDATE(GetSumFreeSize() <= GetSize()); - - VkDeviceSize calculatedSize = m_NullBlock->size; - VkDeviceSize calculatedFreeSize = m_NullBlock->size; - size_t allocCount = 0; - size_t freeCount = 0; - - // Check integrity of free lists - for (uint32_t list = 0; list < m_ListsCount; ++list) - { - Block* block = m_FreeList[list]; - if (block != VMA_NULL) - { - VMA_VALIDATE(block->IsFree()); - VMA_VALIDATE(block->PrevFree() == VMA_NULL); - while (block->NextFree()) - { - VMA_VALIDATE(block->NextFree()->IsFree()); - VMA_VALIDATE(block->NextFree()->PrevFree() == block); - block = block->NextFree(); - } - } - } - - VkDeviceSize nextOffset = m_NullBlock->offset; - auto validateCtx = m_GranularityHandler.StartValidation(GetAllocationCallbacks(), IsVirtual()); - - VMA_VALIDATE(m_NullBlock->nextPhysical == VMA_NULL); - if (m_NullBlock->prevPhysical) - { - VMA_VALIDATE(m_NullBlock->prevPhysical->nextPhysical == m_NullBlock); - } - // Check all blocks - for (Block* prev = m_NullBlock->prevPhysical; prev != VMA_NULL; prev = prev->prevPhysical) - { - VMA_VALIDATE(prev->offset + prev->size == nextOffset); - nextOffset = prev->offset; - calculatedSize += prev->size; - - uint32_t listIndex = GetListIndex(prev->size); - if (prev->IsFree()) - { - ++freeCount; - // Check if free block belongs to free list - Block* freeBlock = m_FreeList[listIndex]; - VMA_VALIDATE(freeBlock != VMA_NULL); - - bool found = false; - do - { - if (freeBlock == prev) - found = true; - - freeBlock = freeBlock->NextFree(); - } while (!found && freeBlock != VMA_NULL); - - VMA_VALIDATE(found); - calculatedFreeSize += prev->size; - } - else - { - ++allocCount; - // Check if taken block is not on a free list - Block* freeBlock = m_FreeList[listIndex]; - while (freeBlock) - { - VMA_VALIDATE(freeBlock != prev); - freeBlock = freeBlock->NextFree(); - } - - if (!IsVirtual()) - { - VMA_VALIDATE(m_GranularityHandler.Validate(validateCtx, prev->offset, prev->size)); - } - } - - if (prev->prevPhysical) - { - VMA_VALIDATE(prev->prevPhysical->nextPhysical == prev); - } - } - - if (!IsVirtual()) - { - VMA_VALIDATE(m_GranularityHandler.FinishValidation(validateCtx)); - } - - VMA_VALIDATE(nextOffset == 0); - VMA_VALIDATE(calculatedSize == GetSize()); - VMA_VALIDATE(calculatedFreeSize == GetSumFreeSize()); - VMA_VALIDATE(allocCount == m_AllocCount); - VMA_VALIDATE(freeCount == m_BlocksFreeCount); - - return true; -} - -void VmaBlockMetadata_TLSF::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const -{ - inoutStats.statistics.blockCount++; - inoutStats.statistics.blockBytes += GetSize(); - if (m_NullBlock->size > 0) - VmaAddDetailedStatisticsUnusedRange(inoutStats, m_NullBlock->size); - - for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical) - { - if (block->IsFree()) - VmaAddDetailedStatisticsUnusedRange(inoutStats, block->size); - else - VmaAddDetailedStatisticsAllocation(inoutStats, block->size); - } -} - -void VmaBlockMetadata_TLSF::AddStatistics(VmaStatistics& inoutStats) const -{ - inoutStats.blockCount++; - inoutStats.allocationCount += (uint32_t)m_AllocCount; - inoutStats.blockBytes += GetSize(); - inoutStats.allocationBytes += GetSize() - GetSumFreeSize(); -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockMetadata_TLSF::PrintDetailedMap(class VmaJsonWriter& json) const -{ - size_t blockCount = m_AllocCount + m_BlocksFreeCount; - VmaStlAllocator allocator(GetAllocationCallbacks()); - VmaVector> blockList(blockCount, allocator); - - size_t i = blockCount; - for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical) - { - blockList[--i] = block; - } - VMA_ASSERT(i == 0); - - VmaDetailedStatistics stats; - VmaClearDetailedStatistics(stats); - AddDetailedStatistics(stats); - - PrintDetailedMap_Begin(json, - stats.statistics.blockBytes - stats.statistics.allocationBytes, - stats.statistics.allocationCount, - stats.unusedRangeCount); - - for (; i < blockCount; ++i) - { - Block* block = blockList[i]; - if (block->IsFree()) - PrintDetailedMap_UnusedRange(json, block->offset, block->size); - else - PrintDetailedMap_Allocation(json, block->offset, block->size, block->UserData()); - } - if (m_NullBlock->size > 0) - PrintDetailedMap_UnusedRange(json, m_NullBlock->offset, m_NullBlock->size); - - PrintDetailedMap_End(json); -} -#endif - -bool VmaBlockMetadata_TLSF::CreateAllocationRequest( - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - bool upperAddress, - VmaSuballocationType allocType, - uint32_t strategy, - VmaAllocationRequest* pAllocationRequest) -{ - VMA_ASSERT(allocSize > 0 && "Cannot allocate empty block!"); - VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm."); - - // For small granularity round up - if (!IsVirtual()) - m_GranularityHandler.RoundupAllocRequest(allocType, allocSize, allocAlignment); - - allocSize += GetDebugMargin(); - // Quick check for too small pool - if (allocSize > GetSumFreeSize()) - return false; - - // If no free blocks in pool then check only null block - if (m_BlocksFreeCount == 0) - return CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest); - - // Round up to the next block - VkDeviceSize sizeForNextList = allocSize; - VkDeviceSize smallSizeStep = SMALL_BUFFER_SIZE / (IsVirtual() ? 1 << SECOND_LEVEL_INDEX : 4); - if (allocSize > SMALL_BUFFER_SIZE) - { - sizeForNextList += (1ULL << (VMA_BITSCAN_MSB(allocSize) - SECOND_LEVEL_INDEX)); - } - else if (allocSize > SMALL_BUFFER_SIZE - smallSizeStep) - sizeForNextList = SMALL_BUFFER_SIZE + 1; - else - sizeForNextList += smallSizeStep; - - uint32_t nextListIndex = 0; - uint32_t prevListIndex = 0; - Block* nextListBlock = VMA_NULL; - Block* prevListBlock = VMA_NULL; - - // Check blocks according to strategies - if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT) - { - // Quick check for larger block first - nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex); - if (nextListBlock != VMA_NULL && CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - - // If not fitted then null block - if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - - // Null block failed, search larger bucket - while (nextListBlock) - { - if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - nextListBlock = nextListBlock->NextFree(); - } - - // Failed again, check best fit bucket - prevListBlock = FindFreeBlock(allocSize, prevListIndex); - while (prevListBlock) - { - if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - prevListBlock = prevListBlock->NextFree(); - } - } - else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT) - { - // Check best fit bucket - prevListBlock = FindFreeBlock(allocSize, prevListIndex); - while (prevListBlock) - { - if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - prevListBlock = prevListBlock->NextFree(); - } - - // If failed check null block - if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - - // Check larger bucket - nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex); - while (nextListBlock) - { - if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - nextListBlock = nextListBlock->NextFree(); - } - } - else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT ) - { - // Perform search from the start - VmaStlAllocator allocator(GetAllocationCallbacks()); - VmaVector> blockList(m_BlocksFreeCount, allocator); - - size_t i = m_BlocksFreeCount; - for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical) - { - if (block->IsFree() && block->size >= allocSize) - blockList[--i] = block; - } - - for (; i < m_BlocksFreeCount; ++i) - { - Block& block = *blockList[i]; - if (CheckBlock(block, GetListIndex(block.size), allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - } - - // If failed check null block - if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - - // Whole range searched, no more memory - return false; - } - else - { - // Check larger bucket - nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex); - while (nextListBlock) - { - if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - nextListBlock = nextListBlock->NextFree(); - } - - // If failed check null block - if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - - // Check best fit bucket - prevListBlock = FindFreeBlock(allocSize, prevListIndex); - while (prevListBlock) - { - if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - prevListBlock = prevListBlock->NextFree(); - } - } - - // Worst case, full search has to be done - while (++nextListIndex < m_ListsCount) - { - nextListBlock = m_FreeList[nextListIndex]; - while (nextListBlock) - { - if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest)) - return true; - nextListBlock = nextListBlock->NextFree(); - } - } - - // No more memory sadly - return false; -} - -VkResult VmaBlockMetadata_TLSF::CheckCorruption(const void* pBlockData) -{ - for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical) - { - if (!block->IsFree()) - { - if (!VmaValidateMagicValue(pBlockData, block->offset + block->size)) - { - VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!"); - return VK_ERROR_UNKNOWN_COPY; - } - } - } - - return VK_SUCCESS; -} - -void VmaBlockMetadata_TLSF::Alloc( - const VmaAllocationRequest& request, - VmaSuballocationType type, - void* userData) -{ - VMA_ASSERT(request.type == VmaAllocationRequestType::TLSF); - - // Get block and pop it from the free list - Block* currentBlock = (Block*)request.allocHandle; - VkDeviceSize offset = request.algorithmData; - VMA_ASSERT(currentBlock != VMA_NULL); - VMA_ASSERT(currentBlock->offset <= offset); - - if (currentBlock != m_NullBlock) - RemoveFreeBlock(currentBlock); - - VkDeviceSize debugMargin = GetDebugMargin(); - VkDeviceSize misssingAlignment = offset - currentBlock->offset; - - // Append missing alignment to prev block or create new one - if (misssingAlignment) - { - Block* prevBlock = currentBlock->prevPhysical; - VMA_ASSERT(prevBlock != VMA_NULL && "There should be no missing alignment at offset 0!"); - - if (prevBlock->IsFree() && prevBlock->size != debugMargin) - { - uint32_t oldList = GetListIndex(prevBlock->size); - prevBlock->size += misssingAlignment; - // Check if new size crosses list bucket - if (oldList != GetListIndex(prevBlock->size)) - { - prevBlock->size -= misssingAlignment; - RemoveFreeBlock(prevBlock); - prevBlock->size += misssingAlignment; - InsertFreeBlock(prevBlock); - } - else - m_BlocksFreeSize += misssingAlignment; - } - else - { - Block* newBlock = m_BlockAllocator.Alloc(); - currentBlock->prevPhysical = newBlock; - prevBlock->nextPhysical = newBlock; - newBlock->prevPhysical = prevBlock; - newBlock->nextPhysical = currentBlock; - newBlock->size = misssingAlignment; - newBlock->offset = currentBlock->offset; - newBlock->MarkTaken(); - - InsertFreeBlock(newBlock); - } - - currentBlock->size -= misssingAlignment; - currentBlock->offset += misssingAlignment; - } - - VkDeviceSize size = request.size + debugMargin; - if (currentBlock->size == size) - { - if (currentBlock == m_NullBlock) - { - // Setup new null block - m_NullBlock = m_BlockAllocator.Alloc(); - m_NullBlock->size = 0; - m_NullBlock->offset = currentBlock->offset + size; - m_NullBlock->prevPhysical = currentBlock; - m_NullBlock->nextPhysical = VMA_NULL; - m_NullBlock->MarkFree(); - m_NullBlock->PrevFree() = VMA_NULL; - m_NullBlock->NextFree() = VMA_NULL; - currentBlock->nextPhysical = m_NullBlock; - currentBlock->MarkTaken(); - } - } - else - { - VMA_ASSERT(currentBlock->size > size && "Proper block already found, shouldn't find smaller one!"); - - // Create new free block - Block* newBlock = m_BlockAllocator.Alloc(); - newBlock->size = currentBlock->size - size; - newBlock->offset = currentBlock->offset + size; - newBlock->prevPhysical = currentBlock; - newBlock->nextPhysical = currentBlock->nextPhysical; - currentBlock->nextPhysical = newBlock; - currentBlock->size = size; - - if (currentBlock == m_NullBlock) - { - m_NullBlock = newBlock; - m_NullBlock->MarkFree(); - m_NullBlock->NextFree() = VMA_NULL; - m_NullBlock->PrevFree() = VMA_NULL; - currentBlock->MarkTaken(); - } - else - { - newBlock->nextPhysical->prevPhysical = newBlock; - newBlock->MarkTaken(); - InsertFreeBlock(newBlock); - } - } - currentBlock->UserData() = userData; - - if (debugMargin > 0) - { - currentBlock->size -= debugMargin; - Block* newBlock = m_BlockAllocator.Alloc(); - newBlock->size = debugMargin; - newBlock->offset = currentBlock->offset + currentBlock->size; - newBlock->prevPhysical = currentBlock; - newBlock->nextPhysical = currentBlock->nextPhysical; - newBlock->MarkTaken(); - currentBlock->nextPhysical->prevPhysical = newBlock; - currentBlock->nextPhysical = newBlock; - InsertFreeBlock(newBlock); - } - - if (!IsVirtual()) - m_GranularityHandler.AllocPages((uint8_t)(uintptr_t)request.customData, - currentBlock->offset, currentBlock->size); - ++m_AllocCount; -} - -void VmaBlockMetadata_TLSF::Free(VmaAllocHandle allocHandle) -{ - Block* block = (Block*)allocHandle; - Block* next = block->nextPhysical; - VMA_ASSERT(!block->IsFree() && "Block is already free!"); - - if (!IsVirtual()) - m_GranularityHandler.FreePages(block->offset, block->size); - --m_AllocCount; - - VkDeviceSize debugMargin = GetDebugMargin(); - if (debugMargin > 0) - { - RemoveFreeBlock(next); - MergeBlock(next, block); - block = next; - next = next->nextPhysical; - } - - // Try merging - Block* prev = block->prevPhysical; - if (prev != VMA_NULL && prev->IsFree() && prev->size != debugMargin) - { - RemoveFreeBlock(prev); - MergeBlock(block, prev); - } - - if (!next->IsFree()) - InsertFreeBlock(block); - else if (next == m_NullBlock) - MergeBlock(m_NullBlock, block); - else - { - RemoveFreeBlock(next); - MergeBlock(next, block); - InsertFreeBlock(next); - } -} - -void VmaBlockMetadata_TLSF::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) -{ - Block* block = (Block*)allocHandle; - VMA_ASSERT(!block->IsFree() && "Cannot get allocation info for free block!"); - outInfo.offset = block->offset; - outInfo.size = block->size; - outInfo.pUserData = block->UserData(); -} - -void* VmaBlockMetadata_TLSF::GetAllocationUserData(VmaAllocHandle allocHandle) const -{ - Block* block = (Block*)allocHandle; - VMA_ASSERT(!block->IsFree() && "Cannot get user data for free block!"); - return block->UserData(); -} - -VmaAllocHandle VmaBlockMetadata_TLSF::GetAllocationListBegin() const -{ - if (m_AllocCount == 0) - return VK_NULL_HANDLE; - - for (Block* block = m_NullBlock->prevPhysical; block; block = block->prevPhysical) - { - if (!block->IsFree()) - return (VmaAllocHandle)block; - } - VMA_ASSERT(false && "If m_AllocCount > 0 then should find any allocation!"); - return VK_NULL_HANDLE; -} - -VmaAllocHandle VmaBlockMetadata_TLSF::GetNextAllocation(VmaAllocHandle prevAlloc) const -{ - Block* startBlock = (Block*)prevAlloc; - VMA_ASSERT(!startBlock->IsFree() && "Incorrect block!"); - - for (Block* block = startBlock->prevPhysical; block; block = block->prevPhysical) - { - if (!block->IsFree()) - return (VmaAllocHandle)block; - } - return VK_NULL_HANDLE; -} - -VkDeviceSize VmaBlockMetadata_TLSF::GetNextFreeRegionSize(VmaAllocHandle alloc) const -{ - Block* block = (Block*)alloc; - VMA_ASSERT(!block->IsFree() && "Incorrect block!"); - - if (block->prevPhysical) - return block->prevPhysical->IsFree() ? block->prevPhysical->size : 0; - return 0; -} - -void VmaBlockMetadata_TLSF::Clear() -{ - m_AllocCount = 0; - m_BlocksFreeCount = 0; - m_BlocksFreeSize = 0; - m_IsFreeBitmap = 0; - m_NullBlock->offset = 0; - m_NullBlock->size = GetSize(); - Block* block = m_NullBlock->prevPhysical; - m_NullBlock->prevPhysical = VMA_NULL; - while (block) - { - Block* prev = block->prevPhysical; - m_BlockAllocator.Free(block); - block = prev; - } - memset(m_FreeList, 0, m_ListsCount * sizeof(Block*)); - memset(m_InnerIsFreeBitmap, 0, m_MemoryClasses * sizeof(uint32_t)); - m_GranularityHandler.Clear(); -} - -void VmaBlockMetadata_TLSF::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) -{ - Block* block = (Block*)allocHandle; - VMA_ASSERT(!block->IsFree() && "Trying to set user data for not allocated block!"); - block->UserData() = userData; -} - -void VmaBlockMetadata_TLSF::DebugLogAllAllocations() const -{ - for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical) - if (!block->IsFree()) - DebugLogAllocation(block->offset, block->size, block->UserData()); -} - -uint8_t VmaBlockMetadata_TLSF::SizeToMemoryClass(VkDeviceSize size) const -{ - if (size > SMALL_BUFFER_SIZE) - return VMA_BITSCAN_MSB(size) - MEMORY_CLASS_SHIFT; - return 0; -} - -uint16_t VmaBlockMetadata_TLSF::SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const -{ - if (memoryClass == 0) - { - if (IsVirtual()) - return static_cast((size - 1) / 8); - else - return static_cast((size - 1) / 64); - } - return static_cast((size >> (memoryClass + MEMORY_CLASS_SHIFT - SECOND_LEVEL_INDEX)) ^ (1U << SECOND_LEVEL_INDEX)); -} - -uint32_t VmaBlockMetadata_TLSF::GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const -{ - if (memoryClass == 0) - return secondIndex; - - const uint32_t index = static_cast(memoryClass - 1) * (1 << SECOND_LEVEL_INDEX) + secondIndex; - if (IsVirtual()) - return index + (1 << SECOND_LEVEL_INDEX); - else - return index + 4; -} - -uint32_t VmaBlockMetadata_TLSF::GetListIndex(VkDeviceSize size) const -{ - uint8_t memoryClass = SizeToMemoryClass(size); - return GetListIndex(memoryClass, SizeToSecondIndex(size, memoryClass)); -} - -void VmaBlockMetadata_TLSF::RemoveFreeBlock(Block* block) -{ - VMA_ASSERT(block != m_NullBlock); - VMA_ASSERT(block->IsFree()); - - if (block->NextFree() != VMA_NULL) - block->NextFree()->PrevFree() = block->PrevFree(); - if (block->PrevFree() != VMA_NULL) - block->PrevFree()->NextFree() = block->NextFree(); - else - { - uint8_t memClass = SizeToMemoryClass(block->size); - uint16_t secondIndex = SizeToSecondIndex(block->size, memClass); - uint32_t index = GetListIndex(memClass, secondIndex); - VMA_ASSERT(m_FreeList[index] == block); - m_FreeList[index] = block->NextFree(); - if (block->NextFree() == VMA_NULL) - { - m_InnerIsFreeBitmap[memClass] &= ~(1U << secondIndex); - if (m_InnerIsFreeBitmap[memClass] == 0) - m_IsFreeBitmap &= ~(1UL << memClass); - } - } - block->MarkTaken(); - block->UserData() = VMA_NULL; - --m_BlocksFreeCount; - m_BlocksFreeSize -= block->size; -} - -void VmaBlockMetadata_TLSF::InsertFreeBlock(Block* block) -{ - VMA_ASSERT(block != m_NullBlock); - VMA_ASSERT(!block->IsFree() && "Cannot insert block twice!"); - - uint8_t memClass = SizeToMemoryClass(block->size); - uint16_t secondIndex = SizeToSecondIndex(block->size, memClass); - uint32_t index = GetListIndex(memClass, secondIndex); - VMA_ASSERT(index < m_ListsCount); - block->PrevFree() = VMA_NULL; - block->NextFree() = m_FreeList[index]; - m_FreeList[index] = block; - if (block->NextFree() != VMA_NULL) - block->NextFree()->PrevFree() = block; - else - { - m_InnerIsFreeBitmap[memClass] |= 1U << secondIndex; - m_IsFreeBitmap |= 1UL << memClass; - } - ++m_BlocksFreeCount; - m_BlocksFreeSize += block->size; -} - -void VmaBlockMetadata_TLSF::MergeBlock(Block* block, Block* prev) -{ - VMA_ASSERT(block->prevPhysical == prev && "Cannot merge seperate physical regions!"); - VMA_ASSERT(!prev->IsFree() && "Cannot merge block that belongs to free list!"); - - block->offset = prev->offset; - block->size += prev->size; - block->prevPhysical = prev->prevPhysical; - if (block->prevPhysical) - block->prevPhysical->nextPhysical = block; - m_BlockAllocator.Free(prev); -} - -VmaBlockMetadata_TLSF::Block* VmaBlockMetadata_TLSF::FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const -{ - uint8_t memoryClass = SizeToMemoryClass(size); - uint32_t innerFreeMap = m_InnerIsFreeBitmap[memoryClass] & (~0U << SizeToSecondIndex(size, memoryClass)); - if (!innerFreeMap) - { - // Check higher levels for avaiable blocks - uint32_t freeMap = m_IsFreeBitmap & (~0UL << (memoryClass + 1)); - if (!freeMap) - return VMA_NULL; // No more memory avaible - - // Find lowest free region - memoryClass = VMA_BITSCAN_LSB(freeMap); - innerFreeMap = m_InnerIsFreeBitmap[memoryClass]; - VMA_ASSERT(innerFreeMap != 0); - } - // Find lowest free subregion - listIndex = GetListIndex(memoryClass, VMA_BITSCAN_LSB(innerFreeMap)); - VMA_ASSERT(m_FreeList[listIndex]); - return m_FreeList[listIndex]; -} - -bool VmaBlockMetadata_TLSF::CheckBlock( - Block& block, - uint32_t listIndex, - VkDeviceSize allocSize, - VkDeviceSize allocAlignment, - VmaSuballocationType allocType, - VmaAllocationRequest* pAllocationRequest) -{ - VMA_ASSERT(block.IsFree() && "Block is already taken!"); - - VkDeviceSize alignedOffset = VmaAlignUp(block.offset, allocAlignment); - if (block.size < allocSize + alignedOffset - block.offset) - return false; - - // Check for granularity conflicts - if (!IsVirtual() && - m_GranularityHandler.CheckConflictAndAlignUp(alignedOffset, allocSize, block.offset, block.size, allocType)) - return false; - - // Alloc successful - pAllocationRequest->type = VmaAllocationRequestType::TLSF; - pAllocationRequest->allocHandle = (VmaAllocHandle)█ - pAllocationRequest->size = allocSize - GetDebugMargin(); - pAllocationRequest->customData = (void*)allocType; - pAllocationRequest->algorithmData = alignedOffset; - - // Place block at the start of list if it's normal block - if (listIndex != m_ListsCount && block.PrevFree()) - { - block.PrevFree()->NextFree() = block.NextFree(); - if (block.NextFree()) - block.NextFree()->PrevFree() = block.PrevFree(); - block.PrevFree() = VMA_NULL; - block.NextFree() = m_FreeList[listIndex]; - m_FreeList[listIndex] = █ - if (block.NextFree()) - block.NextFree()->PrevFree() = █ - } - - return true; -} -#endif // _VMA_BLOCK_METADATA_TLSF_FUNCTIONS -#endif // _VMA_BLOCK_METADATA_TLSF - -#ifndef _VMA_BLOCK_VECTOR -/* -Sequence of VmaDeviceMemoryBlock. Represents memory blocks allocated for a specific -Vulkan memory type. - -Synchronized internally with a mutex. -*/ -class VmaBlockVector -{ - friend struct VmaDefragmentationContext_T; - VMA_CLASS_NO_COPY(VmaBlockVector) -public: - VmaBlockVector( - VmaAllocator hAllocator, - VmaPool hParentPool, - uint32_t memoryTypeIndex, - VkDeviceSize preferredBlockSize, - size_t minBlockCount, - size_t maxBlockCount, - VkDeviceSize bufferImageGranularity, - bool explicitBlockSize, - uint32_t algorithm, - float priority, - VkDeviceSize minAllocationAlignment, - void* pMemoryAllocateNext); - ~VmaBlockVector(); - - VmaAllocator GetAllocator() const { return m_hAllocator; } - VmaPool GetParentPool() const { return m_hParentPool; } - bool IsCustomPool() const { return m_hParentPool != VMA_NULL; } - uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; } - VkDeviceSize GetPreferredBlockSize() const { return m_PreferredBlockSize; } - VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; } - uint32_t GetAlgorithm() const { return m_Algorithm; } - bool HasExplicitBlockSize() const { return m_ExplicitBlockSize; } - float GetPriority() const { return m_Priority; } - const void* GetAllocationNextPtr() const { return m_pMemoryAllocateNext; } - // To be used only while the m_Mutex is locked. Used during defragmentation. - size_t GetBlockCount() const { return m_Blocks.size(); } - // To be used only while the m_Mutex is locked. Used during defragmentation. - VmaDeviceMemoryBlock* GetBlock(size_t index) const { return m_Blocks[index]; } - VMA_RW_MUTEX &GetMutex() { return m_Mutex; } - - VkResult CreateMinBlocks(); - void AddStatistics(VmaStatistics& inoutStats); - void AddDetailedStatistics(VmaDetailedStatistics& inoutStats); - bool IsEmpty(); - bool IsCorruptionDetectionEnabled() const; - - VkResult Allocate( - VkDeviceSize size, - VkDeviceSize alignment, - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - size_t allocationCount, - VmaAllocation* pAllocations); - - void Free(const VmaAllocation hAllocation); - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json); -#endif - - VkResult CheckCorruption(); - -private: - const VmaAllocator m_hAllocator; - const VmaPool m_hParentPool; - const uint32_t m_MemoryTypeIndex; - const VkDeviceSize m_PreferredBlockSize; - const size_t m_MinBlockCount; - const size_t m_MaxBlockCount; - const VkDeviceSize m_BufferImageGranularity; - const bool m_ExplicitBlockSize; - const uint32_t m_Algorithm; - const float m_Priority; - const VkDeviceSize m_MinAllocationAlignment; - - void* const m_pMemoryAllocateNext; - VMA_RW_MUTEX m_Mutex; - // Incrementally sorted by sumFreeSize, ascending. - VmaVector> m_Blocks; - uint32_t m_NextBlockId; - bool m_IncrementalSort = true; - - void SetIncrementalSort(bool val) { m_IncrementalSort = val; } - - VkDeviceSize CalcMaxBlockSize() const; - // Finds and removes given block from vector. - void Remove(VmaDeviceMemoryBlock* pBlock); - // Performs single step in sorting m_Blocks. They may not be fully sorted - // after this call. - void IncrementallySortBlocks(); - void SortByFreeSize(); - - VkResult AllocatePage( - VkDeviceSize size, - VkDeviceSize alignment, - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - VmaAllocation* pAllocation); - - VkResult AllocateFromBlock( - VmaDeviceMemoryBlock* pBlock, - VkDeviceSize size, - VkDeviceSize alignment, - VmaAllocationCreateFlags allocFlags, - void* pUserData, - VmaSuballocationType suballocType, - uint32_t strategy, - VmaAllocation* pAllocation); - - VkResult CommitAllocationRequest( - VmaAllocationRequest& allocRequest, - VmaDeviceMemoryBlock* pBlock, - VkDeviceSize alignment, - VmaAllocationCreateFlags allocFlags, - void* pUserData, - VmaSuballocationType suballocType, - VmaAllocation* pAllocation); - - VkResult CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex); - bool HasEmptyBlock(); -}; -#endif // _VMA_BLOCK_VECTOR - -#ifndef _VMA_DEFRAGMENTATION_CONTEXT -struct VmaDefragmentationContext_T -{ - VMA_CLASS_NO_COPY(VmaDefragmentationContext_T) -public: - VmaDefragmentationContext_T( - VmaAllocator hAllocator, - const VmaDefragmentationInfo& info); - ~VmaDefragmentationContext_T(); - - void GetStats(VmaDefragmentationStats& outStats) { outStats = m_GlobalStats; } - - VkResult DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo); - VkResult DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo); - -private: - // Max number of allocations to ignore due to size constraints before ending single pass - static const uint8_t MAX_ALLOCS_TO_IGNORE = 16; - enum class CounterStatus { Pass, Ignore, End }; - - struct FragmentedBlock - { - uint32_t data; - VmaDeviceMemoryBlock* block; - }; - struct StateBalanced - { - VkDeviceSize avgFreeSize = 0; - VkDeviceSize avgAllocSize = UINT64_MAX; - }; - struct StateExtensive - { - enum class Operation : uint8_t - { - FindFreeBlockBuffer, FindFreeBlockTexture, FindFreeBlockAll, - MoveBuffers, MoveTextures, MoveAll, - Cleanup, Done - }; - - Operation operation = Operation::FindFreeBlockTexture; - size_t firstFreeBlock = SIZE_MAX; - }; - struct MoveAllocationData - { - VkDeviceSize size; - VkDeviceSize alignment; - VmaSuballocationType type; - VmaAllocationCreateFlags flags; - VmaDefragmentationMove move = {}; - }; - - const VkDeviceSize m_MaxPassBytes; - const uint32_t m_MaxPassAllocations; - - VmaStlAllocator m_MoveAllocator; - VmaVector> m_Moves; - - uint8_t m_IgnoredAllocs = 0; - uint32_t m_Algorithm; - uint32_t m_BlockVectorCount; - VmaBlockVector* m_PoolBlockVector; - VmaBlockVector** m_pBlockVectors; - size_t m_ImmovableBlockCount = 0; - VmaDefragmentationStats m_GlobalStats = { 0 }; - VmaDefragmentationStats m_PassStats = { 0 }; - void* m_AlgorithmState = VMA_NULL; - - static MoveAllocationData GetMoveData(VmaAllocHandle handle, VmaBlockMetadata* metadata); - CounterStatus CheckCounters(VkDeviceSize bytes); - bool IncrementCounters(VkDeviceSize bytes); - bool ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block); - bool AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector); - - bool ComputeDefragmentation(VmaBlockVector& vector, size_t index); - bool ComputeDefragmentation_Fast(VmaBlockVector& vector); - bool ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update); - bool ComputeDefragmentation_Full(VmaBlockVector& vector); - bool ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index); - - void UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state); - bool MoveDataToFreeBlocks(VmaSuballocationType currentType, - VmaBlockVector& vector, size_t firstFreeBlock, - bool& texturePresent, bool& bufferPresent, bool& otherPresent); -}; -#endif // _VMA_DEFRAGMENTATION_CONTEXT - -#ifndef _VMA_POOL_T -struct VmaPool_T -{ - friend struct VmaPoolListItemTraits; - VMA_CLASS_NO_COPY(VmaPool_T) -public: - VmaBlockVector m_BlockVector; - VmaDedicatedAllocationList m_DedicatedAllocations; - - VmaPool_T( - VmaAllocator hAllocator, - const VmaPoolCreateInfo& createInfo, - VkDeviceSize preferredBlockSize); - ~VmaPool_T(); - - uint32_t GetId() const { return m_Id; } - void SetId(uint32_t id) { VMA_ASSERT(m_Id == 0); m_Id = id; } - - const char* GetName() const { return m_Name; } - void SetName(const char* pName); - -#if VMA_STATS_STRING_ENABLED - //void PrintDetailedMap(class VmaStringBuilder& sb); -#endif - -private: - uint32_t m_Id; - char* m_Name; - VmaPool_T* m_PrevPool = VMA_NULL; - VmaPool_T* m_NextPool = VMA_NULL; -}; - -struct VmaPoolListItemTraits -{ - typedef VmaPool_T ItemType; - - static ItemType* GetPrev(const ItemType* item) { return item->m_PrevPool; } - static ItemType* GetNext(const ItemType* item) { return item->m_NextPool; } - static ItemType*& AccessPrev(ItemType* item) { return item->m_PrevPool; } - static ItemType*& AccessNext(ItemType* item) { return item->m_NextPool; } -}; -#endif // _VMA_POOL_T - -#ifndef _VMA_CURRENT_BUDGET_DATA -struct VmaCurrentBudgetData -{ - VMA_ATOMIC_UINT32 m_BlockCount[VK_MAX_MEMORY_HEAPS]; - VMA_ATOMIC_UINT32 m_AllocationCount[VK_MAX_MEMORY_HEAPS]; - VMA_ATOMIC_UINT64 m_BlockBytes[VK_MAX_MEMORY_HEAPS]; - VMA_ATOMIC_UINT64 m_AllocationBytes[VK_MAX_MEMORY_HEAPS]; - -#if VMA_MEMORY_BUDGET - VMA_ATOMIC_UINT32 m_OperationsSinceBudgetFetch; - VMA_RW_MUTEX m_BudgetMutex; - uint64_t m_VulkanUsage[VK_MAX_MEMORY_HEAPS]; - uint64_t m_VulkanBudget[VK_MAX_MEMORY_HEAPS]; - uint64_t m_BlockBytesAtBudgetFetch[VK_MAX_MEMORY_HEAPS]; -#endif // VMA_MEMORY_BUDGET - - VmaCurrentBudgetData(); - - void AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize); - void RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize); -}; - -#ifndef _VMA_CURRENT_BUDGET_DATA_FUNCTIONS -VmaCurrentBudgetData::VmaCurrentBudgetData() -{ - for (uint32_t heapIndex = 0; heapIndex < VK_MAX_MEMORY_HEAPS; ++heapIndex) - { - m_BlockCount[heapIndex] = 0; - m_AllocationCount[heapIndex] = 0; - m_BlockBytes[heapIndex] = 0; - m_AllocationBytes[heapIndex] = 0; -#if VMA_MEMORY_BUDGET - m_VulkanUsage[heapIndex] = 0; - m_VulkanBudget[heapIndex] = 0; - m_BlockBytesAtBudgetFetch[heapIndex] = 0; -#endif - } - -#if VMA_MEMORY_BUDGET - m_OperationsSinceBudgetFetch = 0; -#endif -} - -void VmaCurrentBudgetData::AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize) -{ - m_AllocationBytes[heapIndex] += allocationSize; - ++m_AllocationCount[heapIndex]; -#if VMA_MEMORY_BUDGET - ++m_OperationsSinceBudgetFetch; -#endif -} - -void VmaCurrentBudgetData::RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize) -{ - VMA_ASSERT(m_AllocationBytes[heapIndex] >= allocationSize); - m_AllocationBytes[heapIndex] -= allocationSize; - VMA_ASSERT(m_AllocationCount[heapIndex] > 0); - --m_AllocationCount[heapIndex]; -#if VMA_MEMORY_BUDGET - ++m_OperationsSinceBudgetFetch; -#endif -} -#endif // _VMA_CURRENT_BUDGET_DATA_FUNCTIONS -#endif // _VMA_CURRENT_BUDGET_DATA - -#ifndef _VMA_ALLOCATION_OBJECT_ALLOCATOR -/* -Thread-safe wrapper over VmaPoolAllocator free list, for allocation of VmaAllocation_T objects. -*/ -class VmaAllocationObjectAllocator -{ - VMA_CLASS_NO_COPY(VmaAllocationObjectAllocator) -public: - VmaAllocationObjectAllocator(const VkAllocationCallbacks* pAllocationCallbacks) - : m_Allocator(pAllocationCallbacks, 1024) {} - - template VmaAllocation Allocate(Types&&... args); - void Free(VmaAllocation hAlloc); - -private: - VMA_MUTEX m_Mutex; - VmaPoolAllocator m_Allocator; -}; - -template -VmaAllocation VmaAllocationObjectAllocator::Allocate(Types&&... args) -{ - VmaMutexLock mutexLock(m_Mutex); - return m_Allocator.Alloc(std::forward(args)...); -} - -void VmaAllocationObjectAllocator::Free(VmaAllocation hAlloc) -{ - VmaMutexLock mutexLock(m_Mutex); - m_Allocator.Free(hAlloc); -} -#endif // _VMA_ALLOCATION_OBJECT_ALLOCATOR - -#ifndef _VMA_VIRTUAL_BLOCK_T -struct VmaVirtualBlock_T -{ - VMA_CLASS_NO_COPY(VmaVirtualBlock_T) -public: - const bool m_AllocationCallbacksSpecified; - const VkAllocationCallbacks m_AllocationCallbacks; - - VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo); - ~VmaVirtualBlock_T(); - - VkResult Init() { return VK_SUCCESS; } - bool IsEmpty() const { return m_Metadata->IsEmpty(); } - void Free(VmaVirtualAllocation allocation) { m_Metadata->Free((VmaAllocHandle)allocation); } - void SetAllocationUserData(VmaVirtualAllocation allocation, void* userData) { m_Metadata->SetAllocationUserData((VmaAllocHandle)allocation, userData); } - void Clear() { m_Metadata->Clear(); } - - const VkAllocationCallbacks* GetAllocationCallbacks() const; - void GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo); - VkResult Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation, - VkDeviceSize* outOffset); - void GetStatistics(VmaStatistics& outStats) const; - void CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const; -#if VMA_STATS_STRING_ENABLED - void BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const; -#endif - -private: - VmaBlockMetadata* m_Metadata; -}; - -#ifndef _VMA_VIRTUAL_BLOCK_T_FUNCTIONS -VmaVirtualBlock_T::VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo) - : m_AllocationCallbacksSpecified(createInfo.pAllocationCallbacks != VMA_NULL), - m_AllocationCallbacks(createInfo.pAllocationCallbacks != VMA_NULL ? *createInfo.pAllocationCallbacks : VmaEmptyAllocationCallbacks) -{ - const uint32_t algorithm = createInfo.flags & VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK; - switch (algorithm) - { - default: - VMA_ASSERT(0); - case 0: - m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_TLSF)(VK_NULL_HANDLE, 1, true); - break; - case VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT: - m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_Linear)(VK_NULL_HANDLE, 1, true); - break; - } - - m_Metadata->Init(createInfo.size); -} - -VmaVirtualBlock_T::~VmaVirtualBlock_T() -{ - // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations - if (!m_Metadata->IsEmpty()) - m_Metadata->DebugLogAllAllocations(); - // This is the most important assert in the entire library. - // Hitting it means you have some memory leak - unreleased virtual allocations. - VMA_ASSERT(m_Metadata->IsEmpty() && "Some virtual allocations were not freed before destruction of this virtual block!"); - - vma_delete(GetAllocationCallbacks(), m_Metadata); -} - -const VkAllocationCallbacks* VmaVirtualBlock_T::GetAllocationCallbacks() const -{ - return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL; -} - -void VmaVirtualBlock_T::GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo) -{ - m_Metadata->GetAllocationInfo((VmaAllocHandle)allocation, outInfo); -} - -VkResult VmaVirtualBlock_T::Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation, - VkDeviceSize* outOffset) -{ - VmaAllocationRequest request = {}; - if (m_Metadata->CreateAllocationRequest( - createInfo.size, // allocSize - VMA_MAX(createInfo.alignment, (VkDeviceSize)1), // allocAlignment - (createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0, // upperAddress - VMA_SUBALLOCATION_TYPE_UNKNOWN, // allocType - unimportant - createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK, // strategy - &request)) - { - m_Metadata->Alloc(request, - VMA_SUBALLOCATION_TYPE_UNKNOWN, // type - unimportant - createInfo.pUserData); - outAllocation = (VmaVirtualAllocation)request.allocHandle; - if(outOffset) - *outOffset = m_Metadata->GetAllocationOffset(request.allocHandle); - return VK_SUCCESS; - } - outAllocation = (VmaVirtualAllocation)VK_NULL_HANDLE; - if (outOffset) - *outOffset = UINT64_MAX; - return VK_ERROR_OUT_OF_DEVICE_MEMORY; -} - -void VmaVirtualBlock_T::GetStatistics(VmaStatistics& outStats) const -{ - VmaClearStatistics(outStats); - m_Metadata->AddStatistics(outStats); -} - -void VmaVirtualBlock_T::CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const -{ - VmaClearDetailedStatistics(outStats); - m_Metadata->AddDetailedStatistics(outStats); -} - -#if VMA_STATS_STRING_ENABLED -void VmaVirtualBlock_T::BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const -{ - VmaJsonWriter json(GetAllocationCallbacks(), sb); - json.BeginObject(); - - VmaDetailedStatistics stats; - CalculateDetailedStatistics(stats); - - json.WriteString("Stats"); - VmaPrintDetailedStatistics(json, stats); - - if (detailedMap) - { - json.WriteString("Details"); - json.BeginObject(); - m_Metadata->PrintDetailedMap(json); - json.EndObject(); - } - - json.EndObject(); -} -#endif // VMA_STATS_STRING_ENABLED -#endif // _VMA_VIRTUAL_BLOCK_T_FUNCTIONS -#endif // _VMA_VIRTUAL_BLOCK_T - - -// Main allocator object. -struct VmaAllocator_T -{ - VMA_CLASS_NO_COPY(VmaAllocator_T) -public: - bool m_UseMutex; - uint32_t m_VulkanApiVersion; - bool m_UseKhrDedicatedAllocation; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0). - bool m_UseKhrBindMemory2; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0). - bool m_UseExtMemoryBudget; - bool m_UseAmdDeviceCoherentMemory; - bool m_UseKhrBufferDeviceAddress; - bool m_UseExtMemoryPriority; - VkDevice m_hDevice; - VkInstance m_hInstance; - bool m_AllocationCallbacksSpecified; - VkAllocationCallbacks m_AllocationCallbacks; - VmaDeviceMemoryCallbacks m_DeviceMemoryCallbacks; - VmaAllocationObjectAllocator m_AllocationObjectAllocator; - - // Each bit (1 << i) is set if HeapSizeLimit is enabled for that heap, so cannot allocate more than the heap size. - uint32_t m_HeapSizeLimitMask; - - VkPhysicalDeviceProperties m_PhysicalDeviceProperties; - VkPhysicalDeviceMemoryProperties m_MemProps; - - // Default pools. - VmaBlockVector* m_pBlockVectors[VK_MAX_MEMORY_TYPES]; - VmaDedicatedAllocationList m_DedicatedAllocations[VK_MAX_MEMORY_TYPES]; - - VmaCurrentBudgetData m_Budget; - VMA_ATOMIC_UINT32 m_DeviceMemoryCount; // Total number of VkDeviceMemory objects. - - VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo); - VkResult Init(const VmaAllocatorCreateInfo* pCreateInfo); - ~VmaAllocator_T(); - - const VkAllocationCallbacks* GetAllocationCallbacks() const - { - return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL; - } - const VmaVulkanFunctions& GetVulkanFunctions() const - { - return m_VulkanFunctions; - } - - VkPhysicalDevice GetPhysicalDevice() const { return m_PhysicalDevice; } - - VkDeviceSize GetBufferImageGranularity() const - { - return VMA_MAX( - static_cast(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY), - m_PhysicalDeviceProperties.limits.bufferImageGranularity); - } - - uint32_t GetMemoryHeapCount() const { return m_MemProps.memoryHeapCount; } - uint32_t GetMemoryTypeCount() const { return m_MemProps.memoryTypeCount; } - - uint32_t MemoryTypeIndexToHeapIndex(uint32_t memTypeIndex) const - { - VMA_ASSERT(memTypeIndex < m_MemProps.memoryTypeCount); - return m_MemProps.memoryTypes[memTypeIndex].heapIndex; - } - // True when specific memory type is HOST_VISIBLE but not HOST_COHERENT. - bool IsMemoryTypeNonCoherent(uint32_t memTypeIndex) const - { - return (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)) == - VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; - } - // Minimum alignment for all allocations in specific memory type. - VkDeviceSize GetMemoryTypeMinAlignment(uint32_t memTypeIndex) const - { - return IsMemoryTypeNonCoherent(memTypeIndex) ? - VMA_MAX((VkDeviceSize)VMA_MIN_ALIGNMENT, m_PhysicalDeviceProperties.limits.nonCoherentAtomSize) : - (VkDeviceSize)VMA_MIN_ALIGNMENT; - } - - bool IsIntegratedGpu() const - { - return m_PhysicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU; - } - - uint32_t GetGlobalMemoryTypeBits() const { return m_GlobalMemoryTypeBits; } - - void GetBufferMemoryRequirements( - VkBuffer hBuffer, - VkMemoryRequirements& memReq, - bool& requiresDedicatedAllocation, - bool& prefersDedicatedAllocation) const; - void GetImageMemoryRequirements( - VkImage hImage, - VkMemoryRequirements& memReq, - bool& requiresDedicatedAllocation, - bool& prefersDedicatedAllocation) const; - VkResult FindMemoryTypeIndex( - uint32_t memoryTypeBits, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown. - uint32_t* pMemoryTypeIndex) const; - - // Main allocation function. - VkResult AllocateMemory( - const VkMemoryRequirements& vkMemReq, - bool requiresDedicatedAllocation, - bool prefersDedicatedAllocation, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, // UINT32_MAX if unknown. - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - size_t allocationCount, - VmaAllocation* pAllocations); - - // Main deallocation function. - void FreeMemory( - size_t allocationCount, - const VmaAllocation* pAllocations); - - void CalculateStatistics(VmaTotalStatistics* pStats); - - void GetHeapBudgets( - VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount); - -#if VMA_STATS_STRING_ENABLED - void PrintDetailedMap(class VmaJsonWriter& json); -#endif - - void GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo); - - VkResult CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool); - void DestroyPool(VmaPool pool); - void GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats); - void CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats); - - void SetCurrentFrameIndex(uint32_t frameIndex); - uint32_t GetCurrentFrameIndex() const { return m_CurrentFrameIndex.load(); } - - VkResult CheckPoolCorruption(VmaPool hPool); - VkResult CheckCorruption(uint32_t memoryTypeBits); - - // Call to Vulkan function vkAllocateMemory with accompanying bookkeeping. - VkResult AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory); - // Call to Vulkan function vkFreeMemory with accompanying bookkeeping. - void FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory); - // Call to Vulkan function vkBindBufferMemory or vkBindBufferMemory2KHR. - VkResult BindVulkanBuffer( - VkDeviceMemory memory, - VkDeviceSize memoryOffset, - VkBuffer buffer, - const void* pNext); - // Call to Vulkan function vkBindImageMemory or vkBindImageMemory2KHR. - VkResult BindVulkanImage( - VkDeviceMemory memory, - VkDeviceSize memoryOffset, - VkImage image, - const void* pNext); - - VkResult Map(VmaAllocation hAllocation, void** ppData); - void Unmap(VmaAllocation hAllocation); - - VkResult BindBufferMemory( - VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkBuffer hBuffer, - const void* pNext); - VkResult BindImageMemory( - VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkImage hImage, - const void* pNext); - - VkResult FlushOrInvalidateAllocation( - VmaAllocation hAllocation, - VkDeviceSize offset, VkDeviceSize size, - VMA_CACHE_OPERATION op); - VkResult FlushOrInvalidateAllocations( - uint32_t allocationCount, - const VmaAllocation* allocations, - const VkDeviceSize* offsets, const VkDeviceSize* sizes, - VMA_CACHE_OPERATION op); - - void FillAllocation(const VmaAllocation hAllocation, uint8_t pattern); - - /* - Returns bit mask of memory types that can support defragmentation on GPU as - they support creation of required buffer for copy operations. - */ - uint32_t GetGpuDefragmentationMemoryTypeBits(); - -#if VMA_EXTERNAL_MEMORY - VkExternalMemoryHandleTypeFlagsKHR GetExternalMemoryHandleTypeFlags(uint32_t memTypeIndex) const - { - return m_TypeExternalMemoryHandleTypes[memTypeIndex]; - } -#endif // #if VMA_EXTERNAL_MEMORY - -private: - VkDeviceSize m_PreferredLargeHeapBlockSize; - - VkPhysicalDevice m_PhysicalDevice; - VMA_ATOMIC_UINT32 m_CurrentFrameIndex; - VMA_ATOMIC_UINT32 m_GpuDefragmentationMemoryTypeBits; // UINT32_MAX means uninitialized. -#if VMA_EXTERNAL_MEMORY - VkExternalMemoryHandleTypeFlagsKHR m_TypeExternalMemoryHandleTypes[VK_MAX_MEMORY_TYPES]; -#endif // #if VMA_EXTERNAL_MEMORY - - VMA_RW_MUTEX m_PoolsMutex; - typedef VmaIntrusiveLinkedList PoolList; - // Protected by m_PoolsMutex. - PoolList m_Pools; - uint32_t m_NextPoolId; - - VmaVulkanFunctions m_VulkanFunctions; - - // Global bit mask AND-ed with any memoryTypeBits to disallow certain memory types. - uint32_t m_GlobalMemoryTypeBits; - - void ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions); - -#if VMA_STATIC_VULKAN_FUNCTIONS == 1 - void ImportVulkanFunctions_Static(); -#endif - - void ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions); - -#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1 - void ImportVulkanFunctions_Dynamic(); -#endif - - void ValidateVulkanFunctions(); - - VkDeviceSize CalcPreferredBlockSize(uint32_t memTypeIndex); - - VkResult AllocateMemoryOfType( - VmaPool pool, - VkDeviceSize size, - VkDeviceSize alignment, - bool dedicatedPreferred, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, - const VmaAllocationCreateInfo& createInfo, - uint32_t memTypeIndex, - VmaSuballocationType suballocType, - VmaDedicatedAllocationList& dedicatedAllocations, - VmaBlockVector& blockVector, - size_t allocationCount, - VmaAllocation* pAllocations); - - // Helper function only to be used inside AllocateDedicatedMemory. - VkResult AllocateDedicatedMemoryPage( - VmaPool pool, - VkDeviceSize size, - VmaSuballocationType suballocType, - uint32_t memTypeIndex, - const VkMemoryAllocateInfo& allocInfo, - bool map, - bool isUserDataString, - bool isMappingAllowed, - void* pUserData, - VmaAllocation* pAllocation); - - // Allocates and registers new VkDeviceMemory specifically for dedicated allocations. - VkResult AllocateDedicatedMemory( - VmaPool pool, - VkDeviceSize size, - VmaSuballocationType suballocType, - VmaDedicatedAllocationList& dedicatedAllocations, - uint32_t memTypeIndex, - bool map, - bool isUserDataString, - bool isMappingAllowed, - bool canAliasMemory, - void* pUserData, - float priority, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, - size_t allocationCount, - VmaAllocation* pAllocations, - const void* pNextChain = nullptr); - - void FreeDedicatedMemory(const VmaAllocation allocation); - - VkResult CalcMemTypeParams( - VmaAllocationCreateInfo& outCreateInfo, - uint32_t memTypeIndex, - VkDeviceSize size, - size_t allocationCount); - VkResult CalcAllocationParams( - VmaAllocationCreateInfo& outCreateInfo, - bool dedicatedRequired, - bool dedicatedPreferred); - - /* - Calculates and returns bit mask of memory types that can support defragmentation - on GPU as they support creation of required buffer for copy operations. - */ - uint32_t CalculateGpuDefragmentationMemoryTypeBits() const; - uint32_t CalculateGlobalMemoryTypeBits() const; - - bool GetFlushOrInvalidateRange( - VmaAllocation allocation, - VkDeviceSize offset, VkDeviceSize size, - VkMappedMemoryRange& outRange) const; - -#if VMA_MEMORY_BUDGET - void UpdateVulkanBudget(); -#endif // #if VMA_MEMORY_BUDGET -}; - - -#ifndef _VMA_MEMORY_FUNCTIONS -static void* VmaMalloc(VmaAllocator hAllocator, size_t size, size_t alignment) -{ - return VmaMalloc(&hAllocator->m_AllocationCallbacks, size, alignment); -} - -static void VmaFree(VmaAllocator hAllocator, void* ptr) -{ - VmaFree(&hAllocator->m_AllocationCallbacks, ptr); -} - -template -static T* VmaAllocate(VmaAllocator hAllocator) -{ - return (T*)VmaMalloc(hAllocator, sizeof(T), VMA_ALIGN_OF(T)); -} - -template -static T* VmaAllocateArray(VmaAllocator hAllocator, size_t count) -{ - return (T*)VmaMalloc(hAllocator, sizeof(T) * count, VMA_ALIGN_OF(T)); -} - -template -static void vma_delete(VmaAllocator hAllocator, T* ptr) -{ - if(ptr != VMA_NULL) - { - ptr->~T(); - VmaFree(hAllocator, ptr); - } -} - -template -static void vma_delete_array(VmaAllocator hAllocator, T* ptr, size_t count) -{ - if(ptr != VMA_NULL) - { - for(size_t i = count; i--; ) - ptr[i].~T(); - VmaFree(hAllocator, ptr); - } -} -#endif // _VMA_MEMORY_FUNCTIONS - -#ifndef _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS -VmaDeviceMemoryBlock::VmaDeviceMemoryBlock(VmaAllocator hAllocator) - : m_pMetadata(VMA_NULL), - m_MemoryTypeIndex(UINT32_MAX), - m_Id(0), - m_hMemory(VK_NULL_HANDLE), - m_MapCount(0), - m_pMappedData(VMA_NULL) {} - -VmaDeviceMemoryBlock::~VmaDeviceMemoryBlock() -{ - VMA_ASSERT(m_MapCount == 0 && "VkDeviceMemory block is being destroyed while it is still mapped."); - VMA_ASSERT(m_hMemory == VK_NULL_HANDLE); -} - -void VmaDeviceMemoryBlock::Init( - VmaAllocator hAllocator, - VmaPool hParentPool, - uint32_t newMemoryTypeIndex, - VkDeviceMemory newMemory, - VkDeviceSize newSize, - uint32_t id, - uint32_t algorithm, - VkDeviceSize bufferImageGranularity) -{ - VMA_ASSERT(m_hMemory == VK_NULL_HANDLE); - - m_hParentPool = hParentPool; - m_MemoryTypeIndex = newMemoryTypeIndex; - m_Id = id; - m_hMemory = newMemory; - - switch (algorithm) - { - case VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT: - m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_Linear)(hAllocator->GetAllocationCallbacks(), - bufferImageGranularity, false); // isVirtual - break; - default: - VMA_ASSERT(0); - // Fall-through. - case 0: - m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_TLSF)(hAllocator->GetAllocationCallbacks(), - bufferImageGranularity, false); // isVirtual - } - m_pMetadata->Init(newSize); -} - -void VmaDeviceMemoryBlock::Destroy(VmaAllocator allocator) -{ - // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations - if (!m_pMetadata->IsEmpty()) - m_pMetadata->DebugLogAllAllocations(); - // This is the most important assert in the entire library. - // Hitting it means you have some memory leak - unreleased VmaAllocation objects. - VMA_ASSERT(m_pMetadata->IsEmpty() && "Some allocations were not freed before destruction of this memory block!"); - - VMA_ASSERT(m_hMemory != VK_NULL_HANDLE); - allocator->FreeVulkanMemory(m_MemoryTypeIndex, m_pMetadata->GetSize(), m_hMemory); - m_hMemory = VK_NULL_HANDLE; - - vma_delete(allocator, m_pMetadata); - m_pMetadata = VMA_NULL; -} - -void VmaDeviceMemoryBlock::PostFree(VmaAllocator hAllocator) -{ - if(m_MappingHysteresis.PostFree()) - { - VMA_ASSERT(m_MappingHysteresis.GetExtraMapping() == 0); - if (m_MapCount == 0) - { - m_pMappedData = VMA_NULL; - (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory); - } - } -} - -bool VmaDeviceMemoryBlock::Validate() const -{ - VMA_VALIDATE((m_hMemory != VK_NULL_HANDLE) && - (m_pMetadata->GetSize() != 0)); - - return m_pMetadata->Validate(); -} - -VkResult VmaDeviceMemoryBlock::CheckCorruption(VmaAllocator hAllocator) -{ - void* pData = nullptr; - VkResult res = Map(hAllocator, 1, &pData); - if (res != VK_SUCCESS) - { - return res; - } - - res = m_pMetadata->CheckCorruption(pData); - - Unmap(hAllocator, 1); - - return res; -} - -VkResult VmaDeviceMemoryBlock::Map(VmaAllocator hAllocator, uint32_t count, void** ppData) -{ - if (count == 0) - { - return VK_SUCCESS; - } - - VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex); - const uint32_t oldTotalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping(); - m_MappingHysteresis.PostMap(); - if (oldTotalMapCount != 0) - { - m_MapCount += count; - VMA_ASSERT(m_pMappedData != VMA_NULL); - if (ppData != VMA_NULL) - { - *ppData = m_pMappedData; - } - return VK_SUCCESS; - } - else - { - VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)( - hAllocator->m_hDevice, - m_hMemory, - 0, // offset - VK_WHOLE_SIZE, - 0, // flags - &m_pMappedData); - if (result == VK_SUCCESS) - { - if (ppData != VMA_NULL) - { - *ppData = m_pMappedData; - } - m_MapCount = count; - } - return result; - } -} - -void VmaDeviceMemoryBlock::Unmap(VmaAllocator hAllocator, uint32_t count) -{ - if (count == 0) - { - return; - } - - VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex); - if (m_MapCount >= count) - { - m_MapCount -= count; - const uint32_t totalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping(); - if (totalMapCount == 0) - { - m_pMappedData = VMA_NULL; - (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory); - } - m_MappingHysteresis.PostUnmap(); - } - else - { - VMA_ASSERT(0 && "VkDeviceMemory block is being unmapped while it was not previously mapped."); - } -} - -VkResult VmaDeviceMemoryBlock::WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize) -{ - VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION); - - void* pData; - VkResult res = Map(hAllocator, 1, &pData); - if (res != VK_SUCCESS) - { - return res; - } - - VmaWriteMagicValue(pData, allocOffset + allocSize); - - Unmap(hAllocator, 1); - return VK_SUCCESS; -} - -VkResult VmaDeviceMemoryBlock::ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize) -{ - VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION); - - void* pData; - VkResult res = Map(hAllocator, 1, &pData); - if (res != VK_SUCCESS) - { - return res; - } - - if (!VmaValidateMagicValue(pData, allocOffset + allocSize)) - { - VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER FREED ALLOCATION!"); - } - - Unmap(hAllocator, 1); - return VK_SUCCESS; -} - -VkResult VmaDeviceMemoryBlock::BindBufferMemory( - const VmaAllocator hAllocator, - const VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkBuffer hBuffer, - const void* pNext) -{ - VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK && - hAllocation->GetBlock() == this); - VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() && - "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?"); - const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset; - // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads. - VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex); - return hAllocator->BindVulkanBuffer(m_hMemory, memoryOffset, hBuffer, pNext); -} - -VkResult VmaDeviceMemoryBlock::BindImageMemory( - const VmaAllocator hAllocator, - const VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkImage hImage, - const void* pNext) -{ - VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK && - hAllocation->GetBlock() == this); - VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() && - "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?"); - const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset; - // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads. - VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex); - return hAllocator->BindVulkanImage(m_hMemory, memoryOffset, hImage, pNext); -} -#endif // _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS - -#ifndef _VMA_ALLOCATION_T_FUNCTIONS -VmaAllocation_T::VmaAllocation_T(bool mappingAllowed) - : m_Alignment{ 1 }, - m_Size{ 0 }, - m_pUserData{ VMA_NULL }, - m_pName{ VMA_NULL }, - m_MemoryTypeIndex{ 0 }, - m_Type{ (uint8_t)ALLOCATION_TYPE_NONE }, - m_SuballocationType{ (uint8_t)VMA_SUBALLOCATION_TYPE_UNKNOWN }, - m_MapCount{ 0 }, - m_Flags{ 0 } -{ - if(mappingAllowed) - m_Flags |= (uint8_t)FLAG_MAPPING_ALLOWED; - -#if VMA_STATS_STRING_ENABLED - m_BufferImageUsage = 0; -#endif -} - -VmaAllocation_T::~VmaAllocation_T() -{ - VMA_ASSERT(m_MapCount == 0 && "Allocation was not unmapped before destruction."); - - // Check if owned string was freed. - VMA_ASSERT(m_pName == VMA_NULL); -} - -void VmaAllocation_T::InitBlockAllocation( - VmaDeviceMemoryBlock* block, - VmaAllocHandle allocHandle, - VkDeviceSize alignment, - VkDeviceSize size, - uint32_t memoryTypeIndex, - VmaSuballocationType suballocationType, - bool mapped) -{ - VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE); - VMA_ASSERT(block != VMA_NULL); - m_Type = (uint8_t)ALLOCATION_TYPE_BLOCK; - m_Alignment = alignment; - m_Size = size; - m_MemoryTypeIndex = memoryTypeIndex; - if(mapped) - { - VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it."); - m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP; - } - m_SuballocationType = (uint8_t)suballocationType; - m_BlockAllocation.m_Block = block; - m_BlockAllocation.m_AllocHandle = allocHandle; -} - -void VmaAllocation_T::InitDedicatedAllocation( - VmaPool hParentPool, - uint32_t memoryTypeIndex, - VkDeviceMemory hMemory, - VmaSuballocationType suballocationType, - void* pMappedData, - VkDeviceSize size) -{ - VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE); - VMA_ASSERT(hMemory != VK_NULL_HANDLE); - m_Type = (uint8_t)ALLOCATION_TYPE_DEDICATED; - m_Alignment = 0; - m_Size = size; - m_MemoryTypeIndex = memoryTypeIndex; - m_SuballocationType = (uint8_t)suballocationType; - if(pMappedData != VMA_NULL) - { - VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it."); - m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP; - } - m_DedicatedAllocation.m_hParentPool = hParentPool; - m_DedicatedAllocation.m_hMemory = hMemory; - m_DedicatedAllocation.m_pMappedData = pMappedData; - m_DedicatedAllocation.m_Prev = VMA_NULL; - m_DedicatedAllocation.m_Next = VMA_NULL; -} - -void VmaAllocation_T::SetName(VmaAllocator hAllocator, const char* pName) -{ - VMA_ASSERT(pName == VMA_NULL || pName != m_pName); - - FreeName(hAllocator); - - if (pName != VMA_NULL) - m_pName = VmaCreateStringCopy(hAllocator->GetAllocationCallbacks(), pName); -} - -uint8_t VmaAllocation_T::SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation) -{ - VMA_ASSERT(allocation != VMA_NULL); - VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK); - VMA_ASSERT(allocation->m_Type == ALLOCATION_TYPE_BLOCK); - - if (m_MapCount != 0) - m_BlockAllocation.m_Block->Unmap(hAllocator, m_MapCount); - - m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, allocation); - VMA_SWAP(m_BlockAllocation, allocation->m_BlockAllocation); - m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, this); - -#if VMA_STATS_STRING_ENABLED - VMA_SWAP(m_BufferImageUsage, allocation->m_BufferImageUsage); -#endif - return m_MapCount; -} - -VmaAllocHandle VmaAllocation_T::GetAllocHandle() const -{ - switch (m_Type) - { - case ALLOCATION_TYPE_BLOCK: - return m_BlockAllocation.m_AllocHandle; - case ALLOCATION_TYPE_DEDICATED: - return VK_NULL_HANDLE; - default: - VMA_ASSERT(0); - return VK_NULL_HANDLE; - } -} - -VkDeviceSize VmaAllocation_T::GetOffset() const -{ - switch (m_Type) - { - case ALLOCATION_TYPE_BLOCK: - return m_BlockAllocation.m_Block->m_pMetadata->GetAllocationOffset(m_BlockAllocation.m_AllocHandle); - case ALLOCATION_TYPE_DEDICATED: - return 0; - default: - VMA_ASSERT(0); - return 0; - } -} - -VmaPool VmaAllocation_T::GetParentPool() const -{ - switch (m_Type) - { - case ALLOCATION_TYPE_BLOCK: - return m_BlockAllocation.m_Block->GetParentPool(); - case ALLOCATION_TYPE_DEDICATED: - return m_DedicatedAllocation.m_hParentPool; - default: - VMA_ASSERT(0); - return VK_NULL_HANDLE; - } -} - -VkDeviceMemory VmaAllocation_T::GetMemory() const -{ - switch (m_Type) - { - case ALLOCATION_TYPE_BLOCK: - return m_BlockAllocation.m_Block->GetDeviceMemory(); - case ALLOCATION_TYPE_DEDICATED: - return m_DedicatedAllocation.m_hMemory; - default: - VMA_ASSERT(0); - return VK_NULL_HANDLE; - } -} - -void* VmaAllocation_T::GetMappedData() const -{ - switch (m_Type) - { - case ALLOCATION_TYPE_BLOCK: - if (m_MapCount != 0 || IsPersistentMap()) - { - void* pBlockData = m_BlockAllocation.m_Block->GetMappedData(); - VMA_ASSERT(pBlockData != VMA_NULL); - return (char*)pBlockData + GetOffset(); - } - else - { - return VMA_NULL; - } - break; - case ALLOCATION_TYPE_DEDICATED: - VMA_ASSERT((m_DedicatedAllocation.m_pMappedData != VMA_NULL) == (m_MapCount != 0 || IsPersistentMap())); - return m_DedicatedAllocation.m_pMappedData; - default: - VMA_ASSERT(0); - return VMA_NULL; - } -} - -void VmaAllocation_T::BlockAllocMap() -{ - VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK); - VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it."); - - if (m_MapCount < 0xFF) - { - ++m_MapCount; - } - else - { - VMA_ASSERT(0 && "Allocation mapped too many times simultaneously."); - } -} - -void VmaAllocation_T::BlockAllocUnmap() -{ - VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK); - - if (m_MapCount > 0) - { - --m_MapCount; - } - else - { - VMA_ASSERT(0 && "Unmapping allocation not previously mapped."); - } -} - -VkResult VmaAllocation_T::DedicatedAllocMap(VmaAllocator hAllocator, void** ppData) -{ - VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED); - VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it."); - - if (m_MapCount != 0 || IsPersistentMap()) - { - if (m_MapCount < 0xFF) - { - VMA_ASSERT(m_DedicatedAllocation.m_pMappedData != VMA_NULL); - *ppData = m_DedicatedAllocation.m_pMappedData; - ++m_MapCount; - return VK_SUCCESS; - } - else - { - VMA_ASSERT(0 && "Dedicated allocation mapped too many times simultaneously."); - return VK_ERROR_MEMORY_MAP_FAILED; - } - } - else - { - VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)( - hAllocator->m_hDevice, - m_DedicatedAllocation.m_hMemory, - 0, // offset - VK_WHOLE_SIZE, - 0, // flags - ppData); - if (result == VK_SUCCESS) - { - m_DedicatedAllocation.m_pMappedData = *ppData; - m_MapCount = 1; - } - return result; - } -} - -void VmaAllocation_T::DedicatedAllocUnmap(VmaAllocator hAllocator) -{ - VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED); - - if (m_MapCount > 0) - { - --m_MapCount; - if (m_MapCount == 0 && !IsPersistentMap()) - { - m_DedicatedAllocation.m_pMappedData = VMA_NULL; - (*hAllocator->GetVulkanFunctions().vkUnmapMemory)( - hAllocator->m_hDevice, - m_DedicatedAllocation.m_hMemory); - } - } - else - { - VMA_ASSERT(0 && "Unmapping dedicated allocation not previously mapped."); - } -} - -#if VMA_STATS_STRING_ENABLED -void VmaAllocation_T::InitBufferImageUsage(uint32_t bufferImageUsage) -{ - VMA_ASSERT(m_BufferImageUsage == 0); - m_BufferImageUsage = bufferImageUsage; -} - -void VmaAllocation_T::PrintParameters(class VmaJsonWriter& json) const -{ - json.WriteString("Type"); - json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[m_SuballocationType]); - - json.WriteString("Size"); - json.WriteNumber(m_Size); - json.WriteString("Usage"); - json.WriteNumber(m_BufferImageUsage); - - if (m_pUserData != VMA_NULL) - { - json.WriteString("CustomData"); - json.BeginString(); - json.ContinueString_Pointer(m_pUserData); - json.EndString(); - } - if (m_pName != VMA_NULL) - { - json.WriteString("Name"); - json.WriteString(m_pName); - } -} -#endif // VMA_STATS_STRING_ENABLED - -void VmaAllocation_T::FreeName(VmaAllocator hAllocator) -{ - if(m_pName) - { - VmaFreeString(hAllocator->GetAllocationCallbacks(), m_pName); - m_pName = VMA_NULL; - } -} -#endif // _VMA_ALLOCATION_T_FUNCTIONS - -#ifndef _VMA_BLOCK_VECTOR_FUNCTIONS -VmaBlockVector::VmaBlockVector( - VmaAllocator hAllocator, - VmaPool hParentPool, - uint32_t memoryTypeIndex, - VkDeviceSize preferredBlockSize, - size_t minBlockCount, - size_t maxBlockCount, - VkDeviceSize bufferImageGranularity, - bool explicitBlockSize, - uint32_t algorithm, - float priority, - VkDeviceSize minAllocationAlignment, - void* pMemoryAllocateNext) - : m_hAllocator(hAllocator), - m_hParentPool(hParentPool), - m_MemoryTypeIndex(memoryTypeIndex), - m_PreferredBlockSize(preferredBlockSize), - m_MinBlockCount(minBlockCount), - m_MaxBlockCount(maxBlockCount), - m_BufferImageGranularity(bufferImageGranularity), - m_ExplicitBlockSize(explicitBlockSize), - m_Algorithm(algorithm), - m_Priority(priority), - m_MinAllocationAlignment(minAllocationAlignment), - m_pMemoryAllocateNext(pMemoryAllocateNext), - m_Blocks(VmaStlAllocator(hAllocator->GetAllocationCallbacks())), - m_NextBlockId(0) {} - -VmaBlockVector::~VmaBlockVector() -{ - for (size_t i = m_Blocks.size(); i--; ) - { - m_Blocks[i]->Destroy(m_hAllocator); - vma_delete(m_hAllocator, m_Blocks[i]); - } -} - -VkResult VmaBlockVector::CreateMinBlocks() -{ - for (size_t i = 0; i < m_MinBlockCount; ++i) - { - VkResult res = CreateBlock(m_PreferredBlockSize, VMA_NULL); - if (res != VK_SUCCESS) - { - return res; - } - } - return VK_SUCCESS; -} - -void VmaBlockVector::AddStatistics(VmaStatistics& inoutStats) -{ - VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex); - - const size_t blockCount = m_Blocks.size(); - for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex) - { - const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pBlock); - VMA_HEAVY_ASSERT(pBlock->Validate()); - pBlock->m_pMetadata->AddStatistics(inoutStats); - } -} - -void VmaBlockVector::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) -{ - VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex); - - const size_t blockCount = m_Blocks.size(); - for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex) - { - const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pBlock); - VMA_HEAVY_ASSERT(pBlock->Validate()); - pBlock->m_pMetadata->AddDetailedStatistics(inoutStats); - } -} - -bool VmaBlockVector::IsEmpty() -{ - VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex); - return m_Blocks.empty(); -} - -bool VmaBlockVector::IsCorruptionDetectionEnabled() const -{ - const uint32_t requiredMemFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT; - return (VMA_DEBUG_DETECT_CORRUPTION != 0) && - (VMA_DEBUG_MARGIN > 0) && - (m_Algorithm == 0 || m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) && - (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & requiredMemFlags) == requiredMemFlags; -} - -VkResult VmaBlockVector::Allocate( - VkDeviceSize size, - VkDeviceSize alignment, - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - size_t allocationCount, - VmaAllocation* pAllocations) -{ - size_t allocIndex; - VkResult res = VK_SUCCESS; - - alignment = VMA_MAX(alignment, m_MinAllocationAlignment); - - if (IsCorruptionDetectionEnabled()) - { - size = VmaAlignUp(size, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE)); - alignment = VmaAlignUp(alignment, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE)); - } - - { - VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex); - for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex) - { - res = AllocatePage( - size, - alignment, - createInfo, - suballocType, - pAllocations + allocIndex); - if (res != VK_SUCCESS) - { - break; - } - } - } - - if (res != VK_SUCCESS) - { - // Free all already created allocations. - while (allocIndex--) - Free(pAllocations[allocIndex]); - memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount); - } - - return res; -} - -VkResult VmaBlockVector::AllocatePage( - VkDeviceSize size, - VkDeviceSize alignment, - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - VmaAllocation* pAllocation) -{ - const bool isUpperAddress = (createInfo.flags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0; - - VkDeviceSize freeMemory; - { - const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex); - VmaBudget heapBudget = {}; - m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1); - freeMemory = (heapBudget.usage < heapBudget.budget) ? (heapBudget.budget - heapBudget.usage) : 0; - } - - const bool canFallbackToDedicated = !HasExplicitBlockSize() && - (createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0; - const bool canCreateNewBlock = - ((createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0) && - (m_Blocks.size() < m_MaxBlockCount) && - (freeMemory >= size || !canFallbackToDedicated); - uint32_t strategy = createInfo.flags & VMA_ALLOCATION_CREATE_STRATEGY_MASK; - - // Upper address can only be used with linear allocator and within single memory block. - if (isUpperAddress && - (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT || m_MaxBlockCount > 1)) - { - return VK_ERROR_FEATURE_NOT_PRESENT; - } - - // Early reject: requested allocation size is larger that maximum block size for this block vector. - if (size + VMA_DEBUG_MARGIN > m_PreferredBlockSize) - { - return VK_ERROR_OUT_OF_DEVICE_MEMORY; - } - - // 1. Search existing allocations. Try to allocate. - if (m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) - { - // Use only last block. - if (!m_Blocks.empty()) - { - VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks.back(); - VMA_ASSERT(pCurrBlock); - VkResult res = AllocateFromBlock( - pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation); - if (res == VK_SUCCESS) - { - VMA_DEBUG_LOG(" Returned from last block #%u", pCurrBlock->GetId()); - IncrementallySortBlocks(); - return VK_SUCCESS; - } - } - } - else - { - if (strategy != VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT) // MIN_MEMORY or default - { - const bool isHostVisible = - (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0; - if(isHostVisible) - { - const bool isMappingAllowed = (createInfo.flags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0; - /* - For non-mappable allocations, check blocks that are not mapped first. - For mappable allocations, check blocks that are already mapped first. - This way, having many blocks, we will separate mappable and non-mappable allocations, - hopefully limiting the number of blocks that are mapped, which will help tools like RenderDoc. - */ - for(size_t mappingI = 0; mappingI < 2; ++mappingI) - { - // Forward order in m_Blocks - prefer blocks with smallest amount of free space. - for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex) - { - VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pCurrBlock); - const bool isBlockMapped = pCurrBlock->GetMappedData() != VMA_NULL; - if((mappingI == 0) == (isMappingAllowed == isBlockMapped)) - { - VkResult res = AllocateFromBlock( - pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation); - if (res == VK_SUCCESS) - { - VMA_DEBUG_LOG(" Returned from existing block #%u", pCurrBlock->GetId()); - IncrementallySortBlocks(); - return VK_SUCCESS; - } - } - } - } - } - else - { - // Forward order in m_Blocks - prefer blocks with smallest amount of free space. - for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex) - { - VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pCurrBlock); - VkResult res = AllocateFromBlock( - pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation); - if (res == VK_SUCCESS) - { - VMA_DEBUG_LOG(" Returned from existing block #%u", pCurrBlock->GetId()); - IncrementallySortBlocks(); - return VK_SUCCESS; - } - } - } - } - else // VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT - { - // Backward order in m_Blocks - prefer blocks with largest amount of free space. - for (size_t blockIndex = m_Blocks.size(); blockIndex--; ) - { - VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pCurrBlock); - VkResult res = AllocateFromBlock(pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation); - if (res == VK_SUCCESS) - { - VMA_DEBUG_LOG(" Returned from existing block #%u", pCurrBlock->GetId()); - IncrementallySortBlocks(); - return VK_SUCCESS; - } - } - } - } - - // 2. Try to create new block. - if (canCreateNewBlock) - { - // Calculate optimal size for new block. - VkDeviceSize newBlockSize = m_PreferredBlockSize; - uint32_t newBlockSizeShift = 0; - const uint32_t NEW_BLOCK_SIZE_SHIFT_MAX = 3; - - if (!m_ExplicitBlockSize) - { - // Allocate 1/8, 1/4, 1/2 as first blocks. - const VkDeviceSize maxExistingBlockSize = CalcMaxBlockSize(); - for (uint32_t i = 0; i < NEW_BLOCK_SIZE_SHIFT_MAX; ++i) - { - const VkDeviceSize smallerNewBlockSize = newBlockSize / 2; - if (smallerNewBlockSize > maxExistingBlockSize && smallerNewBlockSize >= size * 2) - { - newBlockSize = smallerNewBlockSize; - ++newBlockSizeShift; - } - else - { - break; - } - } - } - - size_t newBlockIndex = 0; - VkResult res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ? - CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY; - // Allocation of this size failed? Try 1/2, 1/4, 1/8 of m_PreferredBlockSize. - if (!m_ExplicitBlockSize) - { - while (res < 0 && newBlockSizeShift < NEW_BLOCK_SIZE_SHIFT_MAX) - { - const VkDeviceSize smallerNewBlockSize = newBlockSize / 2; - if (smallerNewBlockSize >= size) - { - newBlockSize = smallerNewBlockSize; - ++newBlockSizeShift; - res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ? - CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY; - } - else - { - break; - } - } - } - - if (res == VK_SUCCESS) - { - VmaDeviceMemoryBlock* const pBlock = m_Blocks[newBlockIndex]; - VMA_ASSERT(pBlock->m_pMetadata->GetSize() >= size); - - res = AllocateFromBlock( - pBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation); - if (res == VK_SUCCESS) - { - VMA_DEBUG_LOG(" Created new block #%u Size=%llu", pBlock->GetId(), newBlockSize); - IncrementallySortBlocks(); - return VK_SUCCESS; - } - else - { - // Allocation from new block failed, possibly due to VMA_DEBUG_MARGIN or alignment. - return VK_ERROR_OUT_OF_DEVICE_MEMORY; - } - } - } - - return VK_ERROR_OUT_OF_DEVICE_MEMORY; -} - -void VmaBlockVector::Free(const VmaAllocation hAllocation) -{ - VmaDeviceMemoryBlock* pBlockToDelete = VMA_NULL; - - bool budgetExceeded = false; - { - const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex); - VmaBudget heapBudget = {}; - m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1); - budgetExceeded = heapBudget.usage >= heapBudget.budget; - } - - // Scope for lock. - { - VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex); - - VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock(); - - if (IsCorruptionDetectionEnabled()) - { - VkResult res = pBlock->ValidateMagicValueAfterAllocation(m_hAllocator, hAllocation->GetOffset(), hAllocation->GetSize()); - VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to validate magic value."); - } - - if (hAllocation->IsPersistentMap()) - { - pBlock->Unmap(m_hAllocator, 1); - } - - const bool hadEmptyBlockBeforeFree = HasEmptyBlock(); - pBlock->m_pMetadata->Free(hAllocation->GetAllocHandle()); - pBlock->PostFree(m_hAllocator); - VMA_HEAVY_ASSERT(pBlock->Validate()); - - VMA_DEBUG_LOG(" Freed from MemoryTypeIndex=%u", m_MemoryTypeIndex); - - const bool canDeleteBlock = m_Blocks.size() > m_MinBlockCount; - // pBlock became empty after this deallocation. - if (pBlock->m_pMetadata->IsEmpty()) - { - // Already had empty block. We don't want to have two, so delete this one. - if ((hadEmptyBlockBeforeFree || budgetExceeded) && canDeleteBlock) - { - pBlockToDelete = pBlock; - Remove(pBlock); - } - // else: We now have one empty block - leave it. A hysteresis to avoid allocating whole block back and forth. - } - // pBlock didn't become empty, but we have another empty block - find and free that one. - // (This is optional, heuristics.) - else if (hadEmptyBlockBeforeFree && canDeleteBlock) - { - VmaDeviceMemoryBlock* pLastBlock = m_Blocks.back(); - if (pLastBlock->m_pMetadata->IsEmpty()) - { - pBlockToDelete = pLastBlock; - m_Blocks.pop_back(); - } - } - - IncrementallySortBlocks(); - } - - // Destruction of a free block. Deferred until this point, outside of mutex - // lock, for performance reason. - if (pBlockToDelete != VMA_NULL) - { - VMA_DEBUG_LOG(" Deleted empty block #%u", pBlockToDelete->GetId()); - pBlockToDelete->Destroy(m_hAllocator); - vma_delete(m_hAllocator, pBlockToDelete); - } - - m_hAllocator->m_Budget.RemoveAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), hAllocation->GetSize()); - m_hAllocator->m_AllocationObjectAllocator.Free(hAllocation); -} - -VkDeviceSize VmaBlockVector::CalcMaxBlockSize() const -{ - VkDeviceSize result = 0; - for (size_t i = m_Blocks.size(); i--; ) - { - result = VMA_MAX(result, m_Blocks[i]->m_pMetadata->GetSize()); - if (result >= m_PreferredBlockSize) - { - break; - } - } - return result; -} - -void VmaBlockVector::Remove(VmaDeviceMemoryBlock* pBlock) -{ - for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex) - { - if (m_Blocks[blockIndex] == pBlock) - { - VmaVectorRemove(m_Blocks, blockIndex); - return; - } - } - VMA_ASSERT(0); -} - -void VmaBlockVector::IncrementallySortBlocks() -{ - if (!m_IncrementalSort) - return; - if (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) - { - // Bubble sort only until first swap. - for (size_t i = 1; i < m_Blocks.size(); ++i) - { - if (m_Blocks[i - 1]->m_pMetadata->GetSumFreeSize() > m_Blocks[i]->m_pMetadata->GetSumFreeSize()) - { - VMA_SWAP(m_Blocks[i - 1], m_Blocks[i]); - return; - } - } - } -} - -void VmaBlockVector::SortByFreeSize() -{ - VMA_SORT(m_Blocks.begin(), m_Blocks.end(), - [](VmaDeviceMemoryBlock* b1, VmaDeviceMemoryBlock* b2) -> bool - { - return b1->m_pMetadata->GetSumFreeSize() < b2->m_pMetadata->GetSumFreeSize(); - }); -} - -VkResult VmaBlockVector::AllocateFromBlock( - VmaDeviceMemoryBlock* pBlock, - VkDeviceSize size, - VkDeviceSize alignment, - VmaAllocationCreateFlags allocFlags, - void* pUserData, - VmaSuballocationType suballocType, - uint32_t strategy, - VmaAllocation* pAllocation) -{ - const bool isUpperAddress = (allocFlags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0; - - VmaAllocationRequest currRequest = {}; - if (pBlock->m_pMetadata->CreateAllocationRequest( - size, - alignment, - isUpperAddress, - suballocType, - strategy, - &currRequest)) - { - return CommitAllocationRequest(currRequest, pBlock, alignment, allocFlags, pUserData, suballocType, pAllocation); - } - return VK_ERROR_OUT_OF_DEVICE_MEMORY; -} - -VkResult VmaBlockVector::CommitAllocationRequest( - VmaAllocationRequest& allocRequest, - VmaDeviceMemoryBlock* pBlock, - VkDeviceSize alignment, - VmaAllocationCreateFlags allocFlags, - void* pUserData, - VmaSuballocationType suballocType, - VmaAllocation* pAllocation) -{ - const bool mapped = (allocFlags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0; - const bool isUserDataString = (allocFlags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0; - const bool isMappingAllowed = (allocFlags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0; - - pBlock->PostAlloc(); - // Allocate from pCurrBlock. - if (mapped) - { - VkResult res = pBlock->Map(m_hAllocator, 1, VMA_NULL); - if (res != VK_SUCCESS) - { - return res; - } - } - - *pAllocation = m_hAllocator->m_AllocationObjectAllocator.Allocate(isMappingAllowed); - pBlock->m_pMetadata->Alloc(allocRequest, suballocType, *pAllocation); - (*pAllocation)->InitBlockAllocation( - pBlock, - allocRequest.allocHandle, - alignment, - allocRequest.size, // Not size, as actual allocation size may be larger than requested! - m_MemoryTypeIndex, - suballocType, - mapped); - VMA_HEAVY_ASSERT(pBlock->Validate()); - if (isUserDataString) - (*pAllocation)->SetName(m_hAllocator, (const char*)pUserData); - else - (*pAllocation)->SetUserData(m_hAllocator, pUserData); - m_hAllocator->m_Budget.AddAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), allocRequest.size); - if (VMA_DEBUG_INITIALIZE_ALLOCATIONS) - { - m_hAllocator->FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED); - } - if (IsCorruptionDetectionEnabled()) - { - VkResult res = pBlock->WriteMagicValueAfterAllocation(m_hAllocator, (*pAllocation)->GetOffset(), allocRequest.size); - VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to write magic value."); - } - return VK_SUCCESS; -} - -VkResult VmaBlockVector::CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex) -{ - VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO }; - allocInfo.pNext = m_pMemoryAllocateNext; - allocInfo.memoryTypeIndex = m_MemoryTypeIndex; - allocInfo.allocationSize = blockSize; - -#if VMA_BUFFER_DEVICE_ADDRESS - // Every standalone block can potentially contain a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT - always enable the feature. - VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR }; - if (m_hAllocator->m_UseKhrBufferDeviceAddress) - { - allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR; - VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo); - } -#endif // VMA_BUFFER_DEVICE_ADDRESS - -#if VMA_MEMORY_PRIORITY - VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT }; - if (m_hAllocator->m_UseExtMemoryPriority) - { - VMA_ASSERT(m_Priority >= 0.f && m_Priority <= 1.f); - priorityInfo.priority = m_Priority; - VmaPnextChainPushFront(&allocInfo, &priorityInfo); - } -#endif // VMA_MEMORY_PRIORITY - -#if VMA_EXTERNAL_MEMORY - // Attach VkExportMemoryAllocateInfoKHR if necessary. - VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; - exportMemoryAllocInfo.handleTypes = m_hAllocator->GetExternalMemoryHandleTypeFlags(m_MemoryTypeIndex); - if (exportMemoryAllocInfo.handleTypes != 0) - { - VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo); - } -#endif // VMA_EXTERNAL_MEMORY - - VkDeviceMemory mem = VK_NULL_HANDLE; - VkResult res = m_hAllocator->AllocateVulkanMemory(&allocInfo, &mem); - if (res < 0) - { - return res; - } - - // New VkDeviceMemory successfully created. - - // Create new Allocation for it. - VmaDeviceMemoryBlock* const pBlock = vma_new(m_hAllocator, VmaDeviceMemoryBlock)(m_hAllocator); - pBlock->Init( - m_hAllocator, - m_hParentPool, - m_MemoryTypeIndex, - mem, - allocInfo.allocationSize, - m_NextBlockId++, - m_Algorithm, - m_BufferImageGranularity); - - m_Blocks.push_back(pBlock); - if (pNewBlockIndex != VMA_NULL) - { - *pNewBlockIndex = m_Blocks.size() - 1; - } - - return VK_SUCCESS; -} - -bool VmaBlockVector::HasEmptyBlock() -{ - for (size_t index = 0, count = m_Blocks.size(); index < count; ++index) - { - VmaDeviceMemoryBlock* const pBlock = m_Blocks[index]; - if (pBlock->m_pMetadata->IsEmpty()) - { - return true; - } - } - return false; -} - -#if VMA_STATS_STRING_ENABLED -void VmaBlockVector::PrintDetailedMap(class VmaJsonWriter& json) -{ - VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex); - - - json.BeginObject(); - for (size_t i = 0; i < m_Blocks.size(); ++i) - { - json.BeginString(); - json.ContinueString(m_Blocks[i]->GetId()); - json.EndString(); - - json.BeginObject(); - json.WriteString("MapRefCount"); - json.WriteNumber(m_Blocks[i]->GetMapRefCount()); - - m_Blocks[i]->m_pMetadata->PrintDetailedMap(json); - json.EndObject(); - } - json.EndObject(); -} -#endif // VMA_STATS_STRING_ENABLED - -VkResult VmaBlockVector::CheckCorruption() -{ - if (!IsCorruptionDetectionEnabled()) - { - return VK_ERROR_FEATURE_NOT_PRESENT; - } - - VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex); - for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex) - { - VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex]; - VMA_ASSERT(pBlock); - VkResult res = pBlock->CheckCorruption(m_hAllocator); - if (res != VK_SUCCESS) - { - return res; - } - } - return VK_SUCCESS; -} - -#endif // _VMA_BLOCK_VECTOR_FUNCTIONS - -#ifndef _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS -VmaDefragmentationContext_T::VmaDefragmentationContext_T( - VmaAllocator hAllocator, - const VmaDefragmentationInfo& info) - : m_MaxPassBytes(info.maxBytesPerPass == 0 ? VK_WHOLE_SIZE : info.maxBytesPerPass), - m_MaxPassAllocations(info.maxAllocationsPerPass == 0 ? UINT32_MAX : info.maxAllocationsPerPass), - m_MoveAllocator(hAllocator->GetAllocationCallbacks()), - m_Moves(m_MoveAllocator) -{ - m_Algorithm = info.flags & VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK; - - if (info.pool != VMA_NULL) - { - m_BlockVectorCount = 1; - m_PoolBlockVector = &info.pool->m_BlockVector; - m_pBlockVectors = &m_PoolBlockVector; - m_PoolBlockVector->SetIncrementalSort(false); - m_PoolBlockVector->SortByFreeSize(); - } - else - { - m_BlockVectorCount = hAllocator->GetMemoryTypeCount(); - m_PoolBlockVector = VMA_NULL; - m_pBlockVectors = hAllocator->m_pBlockVectors; - for (uint32_t i = 0; i < m_BlockVectorCount; ++i) - { - VmaBlockVector* vector = m_pBlockVectors[i]; - if (vector != VMA_NULL) - { - vector->SetIncrementalSort(false); - vector->SortByFreeSize(); - } - } - } - - switch (m_Algorithm) - { - case 0: // Default algorithm - m_Algorithm = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT; - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT: - { - m_AlgorithmState = vma_new_array(hAllocator, StateBalanced, m_BlockVectorCount); - break; - } - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT: - { - if (hAllocator->GetBufferImageGranularity() > 1) - { - m_AlgorithmState = vma_new_array(hAllocator, StateExtensive, m_BlockVectorCount); - } - break; - } - } -} - -VmaDefragmentationContext_T::~VmaDefragmentationContext_T() -{ - if (m_PoolBlockVector != VMA_NULL) - { - m_PoolBlockVector->SetIncrementalSort(true); - } - else - { - for (uint32_t i = 0; i < m_BlockVectorCount; ++i) - { - VmaBlockVector* vector = m_pBlockVectors[i]; - if (vector != VMA_NULL) - vector->SetIncrementalSort(true); - } - } - - if (m_AlgorithmState) - { - switch (m_Algorithm) - { - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT: - vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast(m_AlgorithmState), m_BlockVectorCount); - break; - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT: - vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast(m_AlgorithmState), m_BlockVectorCount); - break; - default: - VMA_ASSERT(0); - } - } -} - -VkResult VmaDefragmentationContext_T::DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo) -{ - if (m_PoolBlockVector != VMA_NULL) - { - VmaMutexLockWrite lock(m_PoolBlockVector->GetMutex(), m_PoolBlockVector->GetAllocator()->m_UseMutex); - - if (m_PoolBlockVector->GetBlockCount() > 1) - ComputeDefragmentation(*m_PoolBlockVector, 0); - else if (m_PoolBlockVector->GetBlockCount() == 1) - ReallocWithinBlock(*m_PoolBlockVector, m_PoolBlockVector->GetBlock(0)); - } - else - { - for (uint32_t i = 0; i < m_BlockVectorCount; ++i) - { - if (m_pBlockVectors[i] != VMA_NULL) - { - VmaMutexLockWrite lock(m_pBlockVectors[i]->GetMutex(), m_pBlockVectors[i]->GetAllocator()->m_UseMutex); - - if (m_pBlockVectors[i]->GetBlockCount() > 1) - { - if (ComputeDefragmentation(*m_pBlockVectors[i], i)) - break; - } - else if (m_pBlockVectors[i]->GetBlockCount() == 1) - { - if (ReallocWithinBlock(*m_pBlockVectors[i], m_pBlockVectors[i]->GetBlock(0))) - break; - } - } - } - } - - moveInfo.moveCount = static_cast(m_Moves.size()); - if (moveInfo.moveCount > 0) - { - moveInfo.pMoves = m_Moves.data(); - return VK_INCOMPLETE; - } - - moveInfo.pMoves = VMA_NULL; - return VK_SUCCESS; -} - -VkResult VmaDefragmentationContext_T::DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo) -{ - VMA_ASSERT(moveInfo.moveCount > 0 ? moveInfo.pMoves != VMA_NULL : true); - - VkResult result = VK_SUCCESS; - VmaStlAllocator blockAllocator(m_MoveAllocator.m_pCallbacks); - VmaVector> immovableBlocks(blockAllocator); - VmaVector> mappedBlocks(blockAllocator); - - VmaAllocator allocator = VMA_NULL; - for (uint32_t i = 0; i < moveInfo.moveCount; ++i) - { - VmaDefragmentationMove& move = moveInfo.pMoves[i]; - size_t prevCount = 0, currentCount = 0; - VkDeviceSize freedBlockSize = 0; - - uint32_t vectorIndex; - VmaBlockVector* vector; - if (m_PoolBlockVector != VMA_NULL) - { - vectorIndex = 0; - vector = m_PoolBlockVector; - } - else - { - vectorIndex = move.srcAllocation->GetMemoryTypeIndex(); - vector = m_pBlockVectors[vectorIndex]; - VMA_ASSERT(vector != VMA_NULL); - } - - switch (move.operation) - { - case VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY: - { - uint8_t mapCount = move.srcAllocation->SwapBlockAllocation(vector->m_hAllocator, move.dstTmpAllocation); - if (mapCount > 0) - { - allocator = vector->m_hAllocator; - VmaDeviceMemoryBlock* newMapBlock = move.srcAllocation->GetBlock(); - bool notPresent = true; - for (FragmentedBlock& block : mappedBlocks) - { - if (block.block == newMapBlock) - { - notPresent = false; - block.data += mapCount; - break; - } - } - if (notPresent) - mappedBlocks.push_back({ mapCount, newMapBlock }); - } - - // Scope for locks, Free have it's own lock - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - prevCount = vector->GetBlockCount(); - freedBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize(); - } - vector->Free(move.dstTmpAllocation); - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - currentCount = vector->GetBlockCount(); - } - - result = VK_INCOMPLETE; - break; - } - case VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE: - { - m_PassStats.bytesMoved -= move.srcAllocation->GetSize(); - --m_PassStats.allocationsMoved; - vector->Free(move.dstTmpAllocation); - - VmaDeviceMemoryBlock* newBlock = move.srcAllocation->GetBlock(); - bool notPresent = true; - for (const FragmentedBlock& block : immovableBlocks) - { - if (block.block == newBlock) - { - notPresent = false; - break; - } - } - if (notPresent) - immovableBlocks.push_back({ vectorIndex, newBlock }); - break; - } - case VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY: - { - m_PassStats.bytesMoved -= move.srcAllocation->GetSize(); - --m_PassStats.allocationsMoved; - // Scope for locks, Free have it's own lock - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - prevCount = vector->GetBlockCount(); - freedBlockSize = move.srcAllocation->GetBlock()->m_pMetadata->GetSize(); - } - vector->Free(move.srcAllocation); - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - currentCount = vector->GetBlockCount(); - } - freedBlockSize *= prevCount - currentCount; - - VkDeviceSize dstBlockSize; - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - dstBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize(); - } - vector->Free(move.dstTmpAllocation); - { - VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - freedBlockSize += dstBlockSize * (currentCount - vector->GetBlockCount()); - currentCount = vector->GetBlockCount(); - } - - result = VK_INCOMPLETE; - break; - } - default: - VMA_ASSERT(0); - } - - if (prevCount > currentCount) - { - size_t freedBlocks = prevCount - currentCount; - m_PassStats.deviceMemoryBlocksFreed += static_cast(freedBlocks); - m_PassStats.bytesFreed += freedBlockSize; - } - - switch (m_Algorithm) - { - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT: - { - if (m_AlgorithmState != VMA_NULL) - { - // Avoid unnecessary tries to allocate when new free block is avaiable - StateExtensive& state = reinterpret_cast(m_AlgorithmState)[vectorIndex]; - if (state.firstFreeBlock != SIZE_MAX) - { - const size_t diff = prevCount - currentCount; - if (state.firstFreeBlock >= diff) - { - state.firstFreeBlock -= diff; - if (state.firstFreeBlock != 0) - state.firstFreeBlock -= vector->GetBlock(state.firstFreeBlock - 1)->m_pMetadata->IsEmpty(); - } - else - state.firstFreeBlock = 0; - } - } - } - } - } - moveInfo.moveCount = 0; - moveInfo.pMoves = VMA_NULL; - m_Moves.clear(); - - // Update stats - m_GlobalStats.allocationsMoved += m_PassStats.allocationsMoved; - m_GlobalStats.bytesFreed += m_PassStats.bytesFreed; - m_GlobalStats.bytesMoved += m_PassStats.bytesMoved; - m_GlobalStats.deviceMemoryBlocksFreed += m_PassStats.deviceMemoryBlocksFreed; - m_PassStats = { 0 }; - - // Move blocks with immovable allocations according to algorithm - if (immovableBlocks.size() > 0) - { - switch (m_Algorithm) - { - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT: - { - if (m_AlgorithmState != VMA_NULL) - { - bool swapped = false; - // Move to the start of free blocks range - for (const FragmentedBlock& block : immovableBlocks) - { - StateExtensive& state = reinterpret_cast(m_AlgorithmState)[block.data]; - if (state.operation != StateExtensive::Operation::Cleanup) - { - VmaBlockVector* vector = m_pBlockVectors[block.data]; - VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - - for (size_t i = 0, count = vector->GetBlockCount() - m_ImmovableBlockCount; i < count; ++i) - { - if (vector->GetBlock(i) == block.block) - { - VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[vector->GetBlockCount() - ++m_ImmovableBlockCount]); - if (state.firstFreeBlock != SIZE_MAX) - { - if (i + 1 < state.firstFreeBlock) - { - if (state.firstFreeBlock > 1) - VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[--state.firstFreeBlock]); - else - --state.firstFreeBlock; - } - } - swapped = true; - break; - } - } - } - } - if (swapped) - result = VK_INCOMPLETE; - break; - } - } - default: - { - // Move to the begining - for (const FragmentedBlock& block : immovableBlocks) - { - VmaBlockVector* vector = m_pBlockVectors[block.data]; - VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex); - - for (size_t i = m_ImmovableBlockCount; i < vector->GetBlockCount(); ++i) - { - if (vector->GetBlock(i) == block.block) - { - VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[m_ImmovableBlockCount++]); - break; - } - } - } - break; - } - } - } - - // Bulk-map destination blocks - for (const FragmentedBlock& block : mappedBlocks) - { - VkResult res = block.block->Map(allocator, block.data, VMA_NULL); - VMA_ASSERT(res == VK_SUCCESS); - } - return result; -} - -bool VmaDefragmentationContext_T::ComputeDefragmentation(VmaBlockVector& vector, size_t index) -{ - switch (m_Algorithm) - { - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT: - return ComputeDefragmentation_Fast(vector); - default: - VMA_ASSERT(0); - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT: - return ComputeDefragmentation_Balanced(vector, index, true); - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT: - return ComputeDefragmentation_Full(vector); - case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT: - return ComputeDefragmentation_Extensive(vector, index); - } -} - -VmaDefragmentationContext_T::MoveAllocationData VmaDefragmentationContext_T::GetMoveData( - VmaAllocHandle handle, VmaBlockMetadata* metadata) -{ - MoveAllocationData moveData; - moveData.move.srcAllocation = (VmaAllocation)metadata->GetAllocationUserData(handle); - moveData.size = moveData.move.srcAllocation->GetSize(); - moveData.alignment = moveData.move.srcAllocation->GetAlignment(); - moveData.type = moveData.move.srcAllocation->GetSuballocationType(); - moveData.flags = 0; - - if (moveData.move.srcAllocation->IsPersistentMap()) - moveData.flags |= VMA_ALLOCATION_CREATE_MAPPED_BIT; - if (moveData.move.srcAllocation->IsMappingAllowed()) - moveData.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT; - - return moveData; -} - -VmaDefragmentationContext_T::CounterStatus VmaDefragmentationContext_T::CheckCounters(VkDeviceSize bytes) -{ - // Ignore allocation if will exceed max size for copy - if (m_PassStats.bytesMoved + bytes > m_MaxPassBytes) - { - if (++m_IgnoredAllocs < MAX_ALLOCS_TO_IGNORE) - return CounterStatus::Ignore; - else - return CounterStatus::End; - } - return CounterStatus::Pass; -} - -bool VmaDefragmentationContext_T::IncrementCounters(VkDeviceSize bytes) -{ - m_PassStats.bytesMoved += bytes; - // Early return when max found - if (++m_PassStats.allocationsMoved >= m_MaxPassAllocations || m_PassStats.bytesMoved >= m_MaxPassBytes) - { - VMA_ASSERT(m_PassStats.allocationsMoved == m_MaxPassAllocations || - m_PassStats.bytesMoved == m_MaxPassBytes && "Exceeded maximal pass threshold!"); - return true; - } - return false; -} - -bool VmaDefragmentationContext_T::ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block) -{ - VmaBlockMetadata* metadata = block->m_pMetadata; - - for (VmaAllocHandle handle = metadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = metadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, metadata); - // Ignore newly created allocations by defragmentation algorithm - if (moveData.move.srcAllocation->GetUserData() == this) - continue; - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - VkDeviceSize offset = moveData.move.srcAllocation->GetOffset(); - if (offset != 0 && metadata->GetSumFreeSize() >= moveData.size) - { - VmaAllocationRequest request = {}; - if (metadata->CreateAllocationRequest( - moveData.size, - moveData.alignment, - false, - moveData.type, - VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT, - &request)) - { - if (metadata->GetAllocationOffset(request.allocHandle) < offset) - { - if (vector.CommitAllocationRequest( - request, - block, - moveData.alignment, - moveData.flags, - this, - moveData.type, - &moveData.move.dstTmpAllocation) == VK_SUCCESS) - { - m_Moves.push_back(moveData.move); - if (IncrementCounters(moveData.size)) - return true; - } - } - } - } - } - return false; -} - -bool VmaDefragmentationContext_T::AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector) -{ - for (; start < end; ++start) - { - VmaDeviceMemoryBlock* dstBlock = vector.GetBlock(start); - if (dstBlock->m_pMetadata->GetSumFreeSize() >= data.size) - { - if (vector.AllocateFromBlock(dstBlock, - data.size, - data.alignment, - data.flags, - this, - data.type, - 0, - &data.move.dstTmpAllocation) == VK_SUCCESS) - { - m_Moves.push_back(data.move); - if (IncrementCounters(data.size)) - return true; - break; - } - } - } - return false; -} - -bool VmaDefragmentationContext_T::ComputeDefragmentation_Fast(VmaBlockVector& vector) -{ - // Move only between blocks - - // Go through allocations in last blocks and try to fit them inside first ones - for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i) - { - VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata; - - for (VmaAllocHandle handle = metadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = metadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, metadata); - // Ignore newly created allocations by defragmentation algorithm - if (moveData.move.srcAllocation->GetUserData() == this) - continue; - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - // Check all previous blocks for free space - if (AllocInOtherBlock(0, i, moveData, vector)) - return true; - } - } - return false; -} - -bool VmaDefragmentationContext_T::ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update) -{ - // Go over every allocation and try to fit it in previous blocks at lowest offsets, - // if not possible: realloc within single block to minimize offset (exclude offset == 0), - // but only if there are noticable gaps between them (some heuristic, ex. average size of allocation in block) - VMA_ASSERT(m_AlgorithmState != VMA_NULL); - - StateBalanced& vectorState = reinterpret_cast(m_AlgorithmState)[index]; - if (update && vectorState.avgAllocSize == UINT64_MAX) - UpdateVectorStatistics(vector, vectorState); - - const size_t startMoveCount = m_Moves.size(); - VkDeviceSize minimalFreeRegion = vectorState.avgFreeSize / 2; - for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i) - { - VmaDeviceMemoryBlock* block = vector.GetBlock(i); - VmaBlockMetadata* metadata = block->m_pMetadata; - VkDeviceSize prevFreeRegionSize = 0; - - for (VmaAllocHandle handle = metadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = metadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, metadata); - // Ignore newly created allocations by defragmentation algorithm - if (moveData.move.srcAllocation->GetUserData() == this) - continue; - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - // Check all previous blocks for free space - const size_t prevMoveCount = m_Moves.size(); - if (AllocInOtherBlock(0, i, moveData, vector)) - return true; - - VkDeviceSize nextFreeRegionSize = metadata->GetNextFreeRegionSize(handle); - // If no room found then realloc within block for lower offset - VkDeviceSize offset = moveData.move.srcAllocation->GetOffset(); - if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size) - { - // Check if realloc will make sense - if (prevFreeRegionSize >= minimalFreeRegion || - nextFreeRegionSize >= minimalFreeRegion || - moveData.size <= vectorState.avgFreeSize || - moveData.size <= vectorState.avgAllocSize) - { - VmaAllocationRequest request = {}; - if (metadata->CreateAllocationRequest( - moveData.size, - moveData.alignment, - false, - moveData.type, - VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT, - &request)) - { - if (metadata->GetAllocationOffset(request.allocHandle) < offset) - { - if (vector.CommitAllocationRequest( - request, - block, - moveData.alignment, - moveData.flags, - this, - moveData.type, - &moveData.move.dstTmpAllocation) == VK_SUCCESS) - { - m_Moves.push_back(moveData.move); - if (IncrementCounters(moveData.size)) - return true; - } - } - } - } - } - prevFreeRegionSize = nextFreeRegionSize; - } - } - - // No moves perfomed, update statistics to current vector state - if (startMoveCount == m_Moves.size() && !update) - { - vectorState.avgAllocSize = UINT64_MAX; - return ComputeDefragmentation_Balanced(vector, index, false); - } - return false; -} - -bool VmaDefragmentationContext_T::ComputeDefragmentation_Full(VmaBlockVector& vector) -{ - // Go over every allocation and try to fit it in previous blocks at lowest offsets, - // if not possible: realloc within single block to minimize offset (exclude offset == 0) - - for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i) - { - VmaDeviceMemoryBlock* block = vector.GetBlock(i); - VmaBlockMetadata* metadata = block->m_pMetadata; - - for (VmaAllocHandle handle = metadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = metadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, metadata); - // Ignore newly created allocations by defragmentation algorithm - if (moveData.move.srcAllocation->GetUserData() == this) - continue; - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - // Check all previous blocks for free space - const size_t prevMoveCount = m_Moves.size(); - if (AllocInOtherBlock(0, i, moveData, vector)) - return true; - - // If no room found then realloc within block for lower offset - VkDeviceSize offset = moveData.move.srcAllocation->GetOffset(); - if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size) - { - VmaAllocationRequest request = {}; - if (metadata->CreateAllocationRequest( - moveData.size, - moveData.alignment, - false, - moveData.type, - VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT, - &request)) - { - if (metadata->GetAllocationOffset(request.allocHandle) < offset) - { - if (vector.CommitAllocationRequest( - request, - block, - moveData.alignment, - moveData.flags, - this, - moveData.type, - &moveData.move.dstTmpAllocation) == VK_SUCCESS) - { - m_Moves.push_back(moveData.move); - if (IncrementCounters(moveData.size)) - return true; - } - } - } - } - } - } - return false; -} - -bool VmaDefragmentationContext_T::ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index) -{ - // First free single block, then populate it to the brim, then free another block, and so on - - // Fallback to previous algorithm since without granularity conflicts it can achieve max packing - if (vector.m_BufferImageGranularity == 1) - return ComputeDefragmentation_Full(vector); - - VMA_ASSERT(m_AlgorithmState != VMA_NULL); - - StateExtensive& vectorState = reinterpret_cast(m_AlgorithmState)[index]; - - bool texturePresent = false, bufferPresent = false, otherPresent = false; - switch (vectorState.operation) - { - case StateExtensive::Operation::Done: // Vector defragmented - return false; - case StateExtensive::Operation::FindFreeBlockBuffer: - case StateExtensive::Operation::FindFreeBlockTexture: - case StateExtensive::Operation::FindFreeBlockAll: - { - // No more blocks to free, just perform fast realloc and move to cleanup - if (vectorState.firstFreeBlock == 0) - { - vectorState.operation = StateExtensive::Operation::Cleanup; - return ComputeDefragmentation_Fast(vector); - } - - // No free blocks, have to clear last one - size_t last = (vectorState.firstFreeBlock == SIZE_MAX ? vector.GetBlockCount() : vectorState.firstFreeBlock) - 1; - VmaBlockMetadata* freeMetadata = vector.GetBlock(last)->m_pMetadata; - - const size_t prevMoveCount = m_Moves.size(); - for (VmaAllocHandle handle = freeMetadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = freeMetadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, freeMetadata); - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - // Check all previous blocks for free space - if (AllocInOtherBlock(0, last, moveData, vector)) - { - // Full clear performed already - if (prevMoveCount != m_Moves.size() && freeMetadata->GetNextAllocation(handle) == VK_NULL_HANDLE) - reinterpret_cast(m_AlgorithmState)[index] = last; - return true; - } - } - - if (prevMoveCount == m_Moves.size()) - { - // Cannot perform full clear, have to move data in other blocks around - if (last != 0) - { - for (size_t i = last - 1; i; --i) - { - if (ReallocWithinBlock(vector, vector.GetBlock(i))) - return true; - } - } - - if (prevMoveCount == m_Moves.size()) - { - // No possible reallocs within blocks, try to move them around fast - return ComputeDefragmentation_Fast(vector); - } - } - else - { - switch (vectorState.operation) - { - case StateExtensive::Operation::FindFreeBlockBuffer: - vectorState.operation = StateExtensive::Operation::MoveBuffers; - break; - default: - VMA_ASSERT(0); - case StateExtensive::Operation::FindFreeBlockTexture: - vectorState.operation = StateExtensive::Operation::MoveTextures; - break; - case StateExtensive::Operation::FindFreeBlockAll: - vectorState.operation = StateExtensive::Operation::MoveAll; - break; - } - vectorState.firstFreeBlock = last; - // Nothing done, block found without reallocations, can perform another reallocs in same pass - return ComputeDefragmentation_Extensive(vector, index); - } - break; - } - case StateExtensive::Operation::MoveTextures: - { - if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL, vector, - vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent)) - { - if (texturePresent) - { - vectorState.operation = StateExtensive::Operation::FindFreeBlockTexture; - return ComputeDefragmentation_Extensive(vector, index); - } - - if (!bufferPresent && !otherPresent) - { - vectorState.operation = StateExtensive::Operation::Cleanup; - break; - } - - // No more textures to move, check buffers - vectorState.operation = StateExtensive::Operation::MoveBuffers; - bufferPresent = false; - otherPresent = false; - } - else - break; - } - case StateExtensive::Operation::MoveBuffers: - { - if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_BUFFER, vector, - vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent)) - { - if (bufferPresent) - { - vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer; - return ComputeDefragmentation_Extensive(vector, index); - } - - if (!otherPresent) - { - vectorState.operation = StateExtensive::Operation::Cleanup; - break; - } - - // No more buffers to move, check all others - vectorState.operation = StateExtensive::Operation::MoveAll; - otherPresent = false; - } - else - break; - } - case StateExtensive::Operation::MoveAll: - { - if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_FREE, vector, - vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent)) - { - if (otherPresent) - { - vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer; - return ComputeDefragmentation_Extensive(vector, index); - } - // Everything moved - vectorState.operation = StateExtensive::Operation::Cleanup; - } - break; - } - case StateExtensive::Operation::Cleanup: - // Cleanup is handled below so that other operations may reuse the cleanup code. This case is here to prevent the unhandled enum value warning (C4062). - break; - } - - if (vectorState.operation == StateExtensive::Operation::Cleanup) - { - // All other work done, pack data in blocks even tighter if possible - const size_t prevMoveCount = m_Moves.size(); - for (size_t i = 0; i < vector.GetBlockCount(); ++i) - { - if (ReallocWithinBlock(vector, vector.GetBlock(i))) - return true; - } - - if (prevMoveCount == m_Moves.size()) - vectorState.operation = StateExtensive::Operation::Done; - } - return false; -} - -void VmaDefragmentationContext_T::UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state) -{ - size_t allocCount = 0; - size_t freeCount = 0; - state.avgFreeSize = 0; - state.avgAllocSize = 0; - - for (size_t i = 0; i < vector.GetBlockCount(); ++i) - { - VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata; - - allocCount += metadata->GetAllocationCount(); - freeCount += metadata->GetFreeRegionsCount(); - state.avgFreeSize += metadata->GetSumFreeSize(); - state.avgAllocSize += metadata->GetSize(); - } - - state.avgAllocSize = (state.avgAllocSize - state.avgFreeSize) / allocCount; - state.avgFreeSize /= freeCount; -} - -bool VmaDefragmentationContext_T::MoveDataToFreeBlocks(VmaSuballocationType currentType, - VmaBlockVector& vector, size_t firstFreeBlock, - bool& texturePresent, bool& bufferPresent, bool& otherPresent) -{ - const size_t prevMoveCount = m_Moves.size(); - for (size_t i = firstFreeBlock ; i;) - { - VmaDeviceMemoryBlock* block = vector.GetBlock(--i); - VmaBlockMetadata* metadata = block->m_pMetadata; - - for (VmaAllocHandle handle = metadata->GetAllocationListBegin(); - handle != VK_NULL_HANDLE; - handle = metadata->GetNextAllocation(handle)) - { - MoveAllocationData moveData = GetMoveData(handle, metadata); - // Ignore newly created allocations by defragmentation algorithm - if (moveData.move.srcAllocation->GetUserData() == this) - continue; - switch (CheckCounters(moveData.move.srcAllocation->GetSize())) - { - case CounterStatus::Ignore: - continue; - case CounterStatus::End: - return true; - default: - VMA_ASSERT(0); - case CounterStatus::Pass: - break; - } - - // Move only single type of resources at once - if (!VmaIsBufferImageGranularityConflict(moveData.type, currentType)) - { - // Try to fit allocation into free blocks - if (AllocInOtherBlock(firstFreeBlock, vector.GetBlockCount(), moveData, vector)) - return false; - } - - if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL)) - texturePresent = true; - else if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_BUFFER)) - bufferPresent = true; - else - otherPresent = true; - } - } - return prevMoveCount == m_Moves.size(); -} -#endif // _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS - -#ifndef _VMA_POOL_T_FUNCTIONS -VmaPool_T::VmaPool_T( - VmaAllocator hAllocator, - const VmaPoolCreateInfo& createInfo, - VkDeviceSize preferredBlockSize) - : m_BlockVector( - hAllocator, - this, // hParentPool - createInfo.memoryTypeIndex, - createInfo.blockSize != 0 ? createInfo.blockSize : preferredBlockSize, - createInfo.minBlockCount, - createInfo.maxBlockCount, - (createInfo.flags& VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT) != 0 ? 1 : hAllocator->GetBufferImageGranularity(), - createInfo.blockSize != 0, // explicitBlockSize - createInfo.flags & VMA_POOL_CREATE_ALGORITHM_MASK, // algorithm - createInfo.priority, - VMA_MAX(hAllocator->GetMemoryTypeMinAlignment(createInfo.memoryTypeIndex), createInfo.minAllocationAlignment), - createInfo.pMemoryAllocateNext), - m_Id(0), - m_Name(VMA_NULL) {} - -VmaPool_T::~VmaPool_T() -{ - VMA_ASSERT(m_PrevPool == VMA_NULL && m_NextPool == VMA_NULL); -} - -void VmaPool_T::SetName(const char* pName) -{ - const VkAllocationCallbacks* allocs = m_BlockVector.GetAllocator()->GetAllocationCallbacks(); - VmaFreeString(allocs, m_Name); - - if (pName != VMA_NULL) - { - m_Name = VmaCreateStringCopy(allocs, pName); - } - else - { - m_Name = VMA_NULL; - } -} -#endif // _VMA_POOL_T_FUNCTIONS - -#ifndef _VMA_ALLOCATOR_T_FUNCTIONS -VmaAllocator_T::VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo) : - m_UseMutex((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT) == 0), - m_VulkanApiVersion(pCreateInfo->vulkanApiVersion != 0 ? pCreateInfo->vulkanApiVersion : VK_API_VERSION_1_0), - m_UseKhrDedicatedAllocation((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0), - m_UseKhrBindMemory2((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0), - m_UseExtMemoryBudget((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0), - m_UseAmdDeviceCoherentMemory((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT) != 0), - m_UseKhrBufferDeviceAddress((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT) != 0), - m_UseExtMemoryPriority((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT) != 0), - m_hDevice(pCreateInfo->device), - m_hInstance(pCreateInfo->instance), - m_AllocationCallbacksSpecified(pCreateInfo->pAllocationCallbacks != VMA_NULL), - m_AllocationCallbacks(pCreateInfo->pAllocationCallbacks ? - *pCreateInfo->pAllocationCallbacks : VmaEmptyAllocationCallbacks), - m_AllocationObjectAllocator(&m_AllocationCallbacks), - m_HeapSizeLimitMask(0), - m_DeviceMemoryCount(0), - m_PreferredLargeHeapBlockSize(0), - m_PhysicalDevice(pCreateInfo->physicalDevice), - m_GpuDefragmentationMemoryTypeBits(UINT32_MAX), - m_NextPoolId(0), - m_GlobalMemoryTypeBits(UINT32_MAX) -{ - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - m_UseKhrDedicatedAllocation = false; - m_UseKhrBindMemory2 = false; - } - - if(VMA_DEBUG_DETECT_CORRUPTION) - { - // Needs to be multiply of uint32_t size because we are going to write VMA_CORRUPTION_DETECTION_MAGIC_VALUE to it. - VMA_ASSERT(VMA_DEBUG_MARGIN % sizeof(uint32_t) == 0); - } - - VMA_ASSERT(pCreateInfo->physicalDevice && pCreateInfo->device && pCreateInfo->instance); - - if(m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0)) - { -#if !(VMA_DEDICATED_ALLOCATION) - if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0) - { - VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT set but required extensions are disabled by preprocessor macros."); - } -#endif -#if !(VMA_BIND_MEMORY2) - if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0) - { - VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT set but required extension is disabled by preprocessor macros."); - } -#endif - } -#if !(VMA_MEMORY_BUDGET) - if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0) - { - VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT set but required extension is disabled by preprocessor macros."); - } -#endif -#if !(VMA_BUFFER_DEVICE_ADDRESS) - if(m_UseKhrBufferDeviceAddress) - { - VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT is set but required extension or Vulkan 1.2 is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro."); - } -#endif -#if VMA_VULKAN_VERSION < 1002000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 2, 0)) - { - VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_2 but required Vulkan version is disabled by preprocessor macros."); - } -#endif -#if VMA_VULKAN_VERSION < 1001000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_1 but required Vulkan version is disabled by preprocessor macros."); - } -#endif -#if !(VMA_MEMORY_PRIORITY) - if(m_UseExtMemoryPriority) - { - VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT is set but required extension is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro."); - } -#endif - - memset(&m_DeviceMemoryCallbacks, 0 ,sizeof(m_DeviceMemoryCallbacks)); - memset(&m_PhysicalDeviceProperties, 0, sizeof(m_PhysicalDeviceProperties)); - memset(&m_MemProps, 0, sizeof(m_MemProps)); - - memset(&m_pBlockVectors, 0, sizeof(m_pBlockVectors)); - memset(&m_VulkanFunctions, 0, sizeof(m_VulkanFunctions)); - -#if VMA_EXTERNAL_MEMORY - memset(&m_TypeExternalMemoryHandleTypes, 0, sizeof(m_TypeExternalMemoryHandleTypes)); -#endif // #if VMA_EXTERNAL_MEMORY - - if(pCreateInfo->pDeviceMemoryCallbacks != VMA_NULL) - { - m_DeviceMemoryCallbacks.pUserData = pCreateInfo->pDeviceMemoryCallbacks->pUserData; - m_DeviceMemoryCallbacks.pfnAllocate = pCreateInfo->pDeviceMemoryCallbacks->pfnAllocate; - m_DeviceMemoryCallbacks.pfnFree = pCreateInfo->pDeviceMemoryCallbacks->pfnFree; - } - - ImportVulkanFunctions(pCreateInfo->pVulkanFunctions); - - (*m_VulkanFunctions.vkGetPhysicalDeviceProperties)(m_PhysicalDevice, &m_PhysicalDeviceProperties); - (*m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties)(m_PhysicalDevice, &m_MemProps); - - VMA_ASSERT(VmaIsPow2(VMA_MIN_ALIGNMENT)); - VMA_ASSERT(VmaIsPow2(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY)); - VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.bufferImageGranularity)); - VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.nonCoherentAtomSize)); - - m_PreferredLargeHeapBlockSize = (pCreateInfo->preferredLargeHeapBlockSize != 0) ? - pCreateInfo->preferredLargeHeapBlockSize : static_cast(VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE); - - m_GlobalMemoryTypeBits = CalculateGlobalMemoryTypeBits(); - -#if VMA_EXTERNAL_MEMORY - if(pCreateInfo->pTypeExternalMemoryHandleTypes != VMA_NULL) - { - memcpy(m_TypeExternalMemoryHandleTypes, pCreateInfo->pTypeExternalMemoryHandleTypes, - sizeof(VkExternalMemoryHandleTypeFlagsKHR) * GetMemoryTypeCount()); - } -#endif // #if VMA_EXTERNAL_MEMORY - - if(pCreateInfo->pHeapSizeLimit != VMA_NULL) - { - for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex) - { - const VkDeviceSize limit = pCreateInfo->pHeapSizeLimit[heapIndex]; - if(limit != VK_WHOLE_SIZE) - { - m_HeapSizeLimitMask |= 1u << heapIndex; - if(limit < m_MemProps.memoryHeaps[heapIndex].size) - { - m_MemProps.memoryHeaps[heapIndex].size = limit; - } - } - } - } - - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - // Create only supported types - if((m_GlobalMemoryTypeBits & (1u << memTypeIndex)) != 0) - { - const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(memTypeIndex); - m_pBlockVectors[memTypeIndex] = vma_new(this, VmaBlockVector)( - this, - VK_NULL_HANDLE, // hParentPool - memTypeIndex, - preferredBlockSize, - 0, - SIZE_MAX, - GetBufferImageGranularity(), - false, // explicitBlockSize - 0, // algorithm - 0.5f, // priority (0.5 is the default per Vulkan spec) - GetMemoryTypeMinAlignment(memTypeIndex), // minAllocationAlignment - VMA_NULL); // // pMemoryAllocateNext - // No need to call m_pBlockVectors[memTypeIndex][blockVectorTypeIndex]->CreateMinBlocks here, - // becase minBlockCount is 0. - } - } -} - -VkResult VmaAllocator_T::Init(const VmaAllocatorCreateInfo* pCreateInfo) -{ - VkResult res = VK_SUCCESS; - -#if VMA_MEMORY_BUDGET - if(m_UseExtMemoryBudget) - { - UpdateVulkanBudget(); - } -#endif // #if VMA_MEMORY_BUDGET - - return res; -} - -VmaAllocator_T::~VmaAllocator_T() -{ - VMA_ASSERT(m_Pools.IsEmpty()); - - for(size_t memTypeIndex = GetMemoryTypeCount(); memTypeIndex--; ) - { - vma_delete(this, m_pBlockVectors[memTypeIndex]); - } -} - -void VmaAllocator_T::ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions) -{ -#if VMA_STATIC_VULKAN_FUNCTIONS == 1 - ImportVulkanFunctions_Static(); -#endif - - if(pVulkanFunctions != VMA_NULL) - { - ImportVulkanFunctions_Custom(pVulkanFunctions); - } - -#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1 - ImportVulkanFunctions_Dynamic(); -#endif - - ValidateVulkanFunctions(); -} - -#if VMA_STATIC_VULKAN_FUNCTIONS == 1 - -void VmaAllocator_T::ImportVulkanFunctions_Static() -{ - // Vulkan 1.0 - m_VulkanFunctions.vkGetInstanceProcAddr = (PFN_vkGetInstanceProcAddr)vkGetInstanceProcAddr; - m_VulkanFunctions.vkGetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)vkGetDeviceProcAddr; - m_VulkanFunctions.vkGetPhysicalDeviceProperties = (PFN_vkGetPhysicalDeviceProperties)vkGetPhysicalDeviceProperties; - m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties = (PFN_vkGetPhysicalDeviceMemoryProperties)vkGetPhysicalDeviceMemoryProperties; - m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory; - m_VulkanFunctions.vkFreeMemory = (PFN_vkFreeMemory)vkFreeMemory; - m_VulkanFunctions.vkMapMemory = (PFN_vkMapMemory)vkMapMemory; - m_VulkanFunctions.vkUnmapMemory = (PFN_vkUnmapMemory)vkUnmapMemory; - m_VulkanFunctions.vkFlushMappedMemoryRanges = (PFN_vkFlushMappedMemoryRanges)vkFlushMappedMemoryRanges; - m_VulkanFunctions.vkInvalidateMappedMemoryRanges = (PFN_vkInvalidateMappedMemoryRanges)vkInvalidateMappedMemoryRanges; - m_VulkanFunctions.vkBindBufferMemory = (PFN_vkBindBufferMemory)vkBindBufferMemory; - m_VulkanFunctions.vkBindImageMemory = (PFN_vkBindImageMemory)vkBindImageMemory; - m_VulkanFunctions.vkGetBufferMemoryRequirements = (PFN_vkGetBufferMemoryRequirements)vkGetBufferMemoryRequirements; - m_VulkanFunctions.vkGetImageMemoryRequirements = (PFN_vkGetImageMemoryRequirements)vkGetImageMemoryRequirements; - m_VulkanFunctions.vkCreateBuffer = (PFN_vkCreateBuffer)vkCreateBuffer; - m_VulkanFunctions.vkDestroyBuffer = (PFN_vkDestroyBuffer)vkDestroyBuffer; - m_VulkanFunctions.vkCreateImage = (PFN_vkCreateImage)vkCreateImage; - m_VulkanFunctions.vkDestroyImage = (PFN_vkDestroyImage)vkDestroyImage; - m_VulkanFunctions.vkCmdCopyBuffer = (PFN_vkCmdCopyBuffer)vkCmdCopyBuffer; - - // Vulkan 1.1 -#if VMA_VULKAN_VERSION >= 1001000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR = (PFN_vkGetBufferMemoryRequirements2)vkGetBufferMemoryRequirements2; - m_VulkanFunctions.vkGetImageMemoryRequirements2KHR = (PFN_vkGetImageMemoryRequirements2)vkGetImageMemoryRequirements2; - m_VulkanFunctions.vkBindBufferMemory2KHR = (PFN_vkBindBufferMemory2)vkBindBufferMemory2; - m_VulkanFunctions.vkBindImageMemory2KHR = (PFN_vkBindImageMemory2)vkBindImageMemory2; - m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR = (PFN_vkGetPhysicalDeviceMemoryProperties2)vkGetPhysicalDeviceMemoryProperties2; - } -#endif - -#if VMA_VULKAN_VERSION >= 1003000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0)) - { - m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements = (PFN_vkGetDeviceBufferMemoryRequirements)vkGetDeviceBufferMemoryRequirements; - m_VulkanFunctions.vkGetDeviceImageMemoryRequirements = (PFN_vkGetDeviceImageMemoryRequirements)vkGetDeviceImageMemoryRequirements; - } -#endif -} - -#endif // VMA_STATIC_VULKAN_FUNCTIONS == 1 - -void VmaAllocator_T::ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions) -{ - VMA_ASSERT(pVulkanFunctions != VMA_NULL); - -#define VMA_COPY_IF_NOT_NULL(funcName) \ - if(pVulkanFunctions->funcName != VMA_NULL) m_VulkanFunctions.funcName = pVulkanFunctions->funcName; - - VMA_COPY_IF_NOT_NULL(vkGetInstanceProcAddr); - VMA_COPY_IF_NOT_NULL(vkGetDeviceProcAddr); - VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceProperties); - VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties); - VMA_COPY_IF_NOT_NULL(vkAllocateMemory); - VMA_COPY_IF_NOT_NULL(vkFreeMemory); - VMA_COPY_IF_NOT_NULL(vkMapMemory); - VMA_COPY_IF_NOT_NULL(vkUnmapMemory); - VMA_COPY_IF_NOT_NULL(vkFlushMappedMemoryRanges); - VMA_COPY_IF_NOT_NULL(vkInvalidateMappedMemoryRanges); - VMA_COPY_IF_NOT_NULL(vkBindBufferMemory); - VMA_COPY_IF_NOT_NULL(vkBindImageMemory); - VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements); - VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements); - VMA_COPY_IF_NOT_NULL(vkCreateBuffer); - VMA_COPY_IF_NOT_NULL(vkDestroyBuffer); - VMA_COPY_IF_NOT_NULL(vkCreateImage); - VMA_COPY_IF_NOT_NULL(vkDestroyImage); - VMA_COPY_IF_NOT_NULL(vkCmdCopyBuffer); - -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements2KHR); - VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements2KHR); -#endif - -#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000 - VMA_COPY_IF_NOT_NULL(vkBindBufferMemory2KHR); - VMA_COPY_IF_NOT_NULL(vkBindImageMemory2KHR); -#endif - -#if VMA_MEMORY_BUDGET - VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties2KHR); -#endif - -#if VMA_VULKAN_VERSION >= 1003000 - VMA_COPY_IF_NOT_NULL(vkGetDeviceBufferMemoryRequirements); - VMA_COPY_IF_NOT_NULL(vkGetDeviceImageMemoryRequirements); -#endif - -#undef VMA_COPY_IF_NOT_NULL -} - -#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1 - -void VmaAllocator_T::ImportVulkanFunctions_Dynamic() -{ - VMA_ASSERT(m_VulkanFunctions.vkGetInstanceProcAddr && m_VulkanFunctions.vkGetDeviceProcAddr && - "To use VMA_DYNAMIC_VULKAN_FUNCTIONS in new versions of VMA you now have to pass " - "VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as VmaAllocatorCreateInfo::pVulkanFunctions. " - "Other members can be null."); - -#define VMA_FETCH_INSTANCE_FUNC(memberName, functionPointerType, functionNameString) \ - if(m_VulkanFunctions.memberName == VMA_NULL) \ - m_VulkanFunctions.memberName = \ - (functionPointerType)m_VulkanFunctions.vkGetInstanceProcAddr(m_hInstance, functionNameString); -#define VMA_FETCH_DEVICE_FUNC(memberName, functionPointerType, functionNameString) \ - if(m_VulkanFunctions.memberName == VMA_NULL) \ - m_VulkanFunctions.memberName = \ - (functionPointerType)m_VulkanFunctions.vkGetDeviceProcAddr(m_hDevice, functionNameString); - - VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceProperties, PFN_vkGetPhysicalDeviceProperties, "vkGetPhysicalDeviceProperties"); - VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties, PFN_vkGetPhysicalDeviceMemoryProperties, "vkGetPhysicalDeviceMemoryProperties"); - VMA_FETCH_DEVICE_FUNC(vkAllocateMemory, PFN_vkAllocateMemory, "vkAllocateMemory"); - VMA_FETCH_DEVICE_FUNC(vkFreeMemory, PFN_vkFreeMemory, "vkFreeMemory"); - VMA_FETCH_DEVICE_FUNC(vkMapMemory, PFN_vkMapMemory, "vkMapMemory"); - VMA_FETCH_DEVICE_FUNC(vkUnmapMemory, PFN_vkUnmapMemory, "vkUnmapMemory"); - VMA_FETCH_DEVICE_FUNC(vkFlushMappedMemoryRanges, PFN_vkFlushMappedMemoryRanges, "vkFlushMappedMemoryRanges"); - VMA_FETCH_DEVICE_FUNC(vkInvalidateMappedMemoryRanges, PFN_vkInvalidateMappedMemoryRanges, "vkInvalidateMappedMemoryRanges"); - VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory, PFN_vkBindBufferMemory, "vkBindBufferMemory"); - VMA_FETCH_DEVICE_FUNC(vkBindImageMemory, PFN_vkBindImageMemory, "vkBindImageMemory"); - VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements, PFN_vkGetBufferMemoryRequirements, "vkGetBufferMemoryRequirements"); - VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements, PFN_vkGetImageMemoryRequirements, "vkGetImageMemoryRequirements"); - VMA_FETCH_DEVICE_FUNC(vkCreateBuffer, PFN_vkCreateBuffer, "vkCreateBuffer"); - VMA_FETCH_DEVICE_FUNC(vkDestroyBuffer, PFN_vkDestroyBuffer, "vkDestroyBuffer"); - VMA_FETCH_DEVICE_FUNC(vkCreateImage, PFN_vkCreateImage, "vkCreateImage"); - VMA_FETCH_DEVICE_FUNC(vkDestroyImage, PFN_vkDestroyImage, "vkDestroyImage"); - VMA_FETCH_DEVICE_FUNC(vkCmdCopyBuffer, PFN_vkCmdCopyBuffer, "vkCmdCopyBuffer"); - -#if VMA_VULKAN_VERSION >= 1001000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2, "vkGetBufferMemoryRequirements2"); - VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2, "vkGetImageMemoryRequirements2"); - VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2, "vkBindBufferMemory2"); - VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2, "vkBindImageMemory2"); - VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2, "vkGetPhysicalDeviceMemoryProperties2"); - } -#endif - -#if VMA_DEDICATED_ALLOCATION - if(m_UseKhrDedicatedAllocation) - { - VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2KHR, "vkGetBufferMemoryRequirements2KHR"); - VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2KHR, "vkGetImageMemoryRequirements2KHR"); - } -#endif - -#if VMA_BIND_MEMORY2 - if(m_UseKhrBindMemory2) - { - VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2KHR, "vkBindBufferMemory2KHR"); - VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2KHR, "vkBindImageMemory2KHR"); - } -#endif // #if VMA_BIND_MEMORY2 - -#if VMA_MEMORY_BUDGET - if(m_UseExtMemoryBudget) - { - VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2KHR, "vkGetPhysicalDeviceMemoryProperties2KHR"); - } -#endif // #if VMA_MEMORY_BUDGET - -#if VMA_VULKAN_VERSION >= 1003000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0)) - { - VMA_FETCH_DEVICE_FUNC(vkGetDeviceBufferMemoryRequirements, PFN_vkGetDeviceBufferMemoryRequirements, "vkGetDeviceBufferMemoryRequirements"); - VMA_FETCH_DEVICE_FUNC(vkGetDeviceImageMemoryRequirements, PFN_vkGetDeviceImageMemoryRequirements, "vkGetDeviceImageMemoryRequirements"); - } -#endif - -#undef VMA_FETCH_DEVICE_FUNC -#undef VMA_FETCH_INSTANCE_FUNC -} - -#endif // VMA_DYNAMIC_VULKAN_FUNCTIONS == 1 - -void VmaAllocator_T::ValidateVulkanFunctions() -{ - VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceProperties != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkAllocateMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkFreeMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkMapMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkUnmapMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkFlushMappedMemoryRanges != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkInvalidateMappedMemoryRanges != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkCreateBuffer != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkDestroyBuffer != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkCreateImage != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkDestroyImage != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkCmdCopyBuffer != VMA_NULL); - -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrDedicatedAllocation) - { - VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements2KHR != VMA_NULL); - } -#endif - -#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrBindMemory2) - { - VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL); - } -#endif - -#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000 - if(m_UseExtMemoryBudget || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR != VMA_NULL); - } -#endif - -#if VMA_VULKAN_VERSION >= 1003000 - if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0)) - { - VMA_ASSERT(m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements != VMA_NULL); - VMA_ASSERT(m_VulkanFunctions.vkGetDeviceImageMemoryRequirements != VMA_NULL); - } -#endif -} - -VkDeviceSize VmaAllocator_T::CalcPreferredBlockSize(uint32_t memTypeIndex) -{ - const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex); - const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size; - const bool isSmallHeap = heapSize <= VMA_SMALL_HEAP_MAX_SIZE; - return VmaAlignUp(isSmallHeap ? (heapSize / 8) : m_PreferredLargeHeapBlockSize, (VkDeviceSize)32); -} - -VkResult VmaAllocator_T::AllocateMemoryOfType( - VmaPool pool, - VkDeviceSize size, - VkDeviceSize alignment, - bool dedicatedPreferred, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, - const VmaAllocationCreateInfo& createInfo, - uint32_t memTypeIndex, - VmaSuballocationType suballocType, - VmaDedicatedAllocationList& dedicatedAllocations, - VmaBlockVector& blockVector, - size_t allocationCount, - VmaAllocation* pAllocations) -{ - VMA_ASSERT(pAllocations != VMA_NULL); - VMA_DEBUG_LOG(" AllocateMemory: MemoryTypeIndex=%u, AllocationCount=%zu, Size=%llu", memTypeIndex, allocationCount, size); - - VmaAllocationCreateInfo finalCreateInfo = createInfo; - VkResult res = CalcMemTypeParams( - finalCreateInfo, - memTypeIndex, - size, - allocationCount); - if(res != VK_SUCCESS) - return res; - - if((finalCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0) - { - return AllocateDedicatedMemory( - pool, - size, - suballocType, - dedicatedAllocations, - memTypeIndex, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0, - (finalCreateInfo.flags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0, - finalCreateInfo.pUserData, - finalCreateInfo.priority, - dedicatedBuffer, - dedicatedImage, - dedicatedBufferImageUsage, - allocationCount, - pAllocations, - blockVector.GetAllocationNextPtr()); - } - else - { - const bool canAllocateDedicated = - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0 && - (pool == VK_NULL_HANDLE || !blockVector.HasExplicitBlockSize()); - - if(canAllocateDedicated) - { - // Heuristics: Allocate dedicated memory if requested size if greater than half of preferred block size. - if(size > blockVector.GetPreferredBlockSize() / 2) - { - dedicatedPreferred = true; - } - // Protection against creating each allocation as dedicated when we reach or exceed heap size/budget, - // which can quickly deplete maxMemoryAllocationCount: Don't prefer dedicated allocations when above - // 3/4 of the maximum allocation count. - if(m_DeviceMemoryCount.load() > m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount * 3 / 4) - { - dedicatedPreferred = false; - } - - if(dedicatedPreferred) - { - res = AllocateDedicatedMemory( - pool, - size, - suballocType, - dedicatedAllocations, - memTypeIndex, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0, - (finalCreateInfo.flags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0, - finalCreateInfo.pUserData, - finalCreateInfo.priority, - dedicatedBuffer, - dedicatedImage, - dedicatedBufferImageUsage, - allocationCount, - pAllocations, - blockVector.GetAllocationNextPtr()); - if(res == VK_SUCCESS) - { - // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here. - VMA_DEBUG_LOG(" Allocated as DedicatedMemory"); - return VK_SUCCESS; - } - } - } - - res = blockVector.Allocate( - size, - alignment, - finalCreateInfo, - suballocType, - allocationCount, - pAllocations); - if(res == VK_SUCCESS) - return VK_SUCCESS; - - // Try dedicated memory. - if(canAllocateDedicated && !dedicatedPreferred) - { - res = AllocateDedicatedMemory( - pool, - size, - suballocType, - dedicatedAllocations, - memTypeIndex, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0, - (finalCreateInfo.flags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0, - (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0, - finalCreateInfo.pUserData, - finalCreateInfo.priority, - dedicatedBuffer, - dedicatedImage, - dedicatedBufferImageUsage, - allocationCount, - pAllocations, - blockVector.GetAllocationNextPtr()); - if(res == VK_SUCCESS) - { - // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here. - VMA_DEBUG_LOG(" Allocated as DedicatedMemory"); - return VK_SUCCESS; - } - } - // Everything failed: Return error code. - VMA_DEBUG_LOG(" vkAllocateMemory FAILED"); - return res; - } -} - -VkResult VmaAllocator_T::AllocateDedicatedMemory( - VmaPool pool, - VkDeviceSize size, - VmaSuballocationType suballocType, - VmaDedicatedAllocationList& dedicatedAllocations, - uint32_t memTypeIndex, - bool map, - bool isUserDataString, - bool isMappingAllowed, - bool canAliasMemory, - void* pUserData, - float priority, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, - size_t allocationCount, - VmaAllocation* pAllocations, - const void* pNextChain) -{ - VMA_ASSERT(allocationCount > 0 && pAllocations); - - VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO }; - allocInfo.memoryTypeIndex = memTypeIndex; - allocInfo.allocationSize = size; - allocInfo.pNext = pNextChain; - -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - VkMemoryDedicatedAllocateInfoKHR dedicatedAllocInfo = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO_KHR }; - if(!canAliasMemory) - { - if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - if(dedicatedBuffer != VK_NULL_HANDLE) - { - VMA_ASSERT(dedicatedImage == VK_NULL_HANDLE); - dedicatedAllocInfo.buffer = dedicatedBuffer; - VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo); - } - else if(dedicatedImage != VK_NULL_HANDLE) - { - dedicatedAllocInfo.image = dedicatedImage; - VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo); - } - } - } -#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - -#if VMA_BUFFER_DEVICE_ADDRESS - VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR }; - if(m_UseKhrBufferDeviceAddress) - { - bool canContainBufferWithDeviceAddress = true; - if(dedicatedBuffer != VK_NULL_HANDLE) - { - canContainBufferWithDeviceAddress = dedicatedBufferImageUsage == UINT32_MAX || // Usage flags unknown - (dedicatedBufferImageUsage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_EXT) != 0; - } - else if(dedicatedImage != VK_NULL_HANDLE) - { - canContainBufferWithDeviceAddress = false; - } - if(canContainBufferWithDeviceAddress) - { - allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR; - VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo); - } - } -#endif // #if VMA_BUFFER_DEVICE_ADDRESS - -#if VMA_MEMORY_PRIORITY - VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT }; - if(m_UseExtMemoryPriority) - { - VMA_ASSERT(priority >= 0.f && priority <= 1.f); - priorityInfo.priority = priority; - VmaPnextChainPushFront(&allocInfo, &priorityInfo); - } -#endif // #if VMA_MEMORY_PRIORITY - -#if VMA_EXTERNAL_MEMORY - // Attach VkExportMemoryAllocateInfoKHR if necessary. - VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; - exportMemoryAllocInfo.handleTypes = GetExternalMemoryHandleTypeFlags(memTypeIndex); - if(exportMemoryAllocInfo.handleTypes != 0) - { - VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo); - } -#endif // #if VMA_EXTERNAL_MEMORY - - size_t allocIndex; - VkResult res = VK_SUCCESS; - for(allocIndex = 0; allocIndex < allocationCount; ++allocIndex) - { - res = AllocateDedicatedMemoryPage( - pool, - size, - suballocType, - memTypeIndex, - allocInfo, - map, - isUserDataString, - isMappingAllowed, - pUserData, - pAllocations + allocIndex); - if(res != VK_SUCCESS) - { - break; - } - } - - if(res == VK_SUCCESS) - { - for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex) - { - dedicatedAllocations.Register(pAllocations[allocIndex]); - } - VMA_DEBUG_LOG(" Allocated DedicatedMemory Count=%zu, MemoryTypeIndex=#%u", allocationCount, memTypeIndex); - } - else - { - // Free all already created allocations. - while(allocIndex--) - { - VmaAllocation currAlloc = pAllocations[allocIndex]; - VkDeviceMemory hMemory = currAlloc->GetMemory(); - - /* - There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory - before vkFreeMemory. - - if(currAlloc->GetMappedData() != VMA_NULL) - { - (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory); - } - */ - - FreeVulkanMemory(memTypeIndex, currAlloc->GetSize(), hMemory); - m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), currAlloc->GetSize()); - m_AllocationObjectAllocator.Free(currAlloc); - } - - memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount); - } - - return res; -} - -VkResult VmaAllocator_T::AllocateDedicatedMemoryPage( - VmaPool pool, - VkDeviceSize size, - VmaSuballocationType suballocType, - uint32_t memTypeIndex, - const VkMemoryAllocateInfo& allocInfo, - bool map, - bool isUserDataString, - bool isMappingAllowed, - void* pUserData, - VmaAllocation* pAllocation) -{ - VkDeviceMemory hMemory = VK_NULL_HANDLE; - VkResult res = AllocateVulkanMemory(&allocInfo, &hMemory); - if(res < 0) - { - VMA_DEBUG_LOG(" vkAllocateMemory FAILED"); - return res; - } - - void* pMappedData = VMA_NULL; - if(map) - { - res = (*m_VulkanFunctions.vkMapMemory)( - m_hDevice, - hMemory, - 0, - VK_WHOLE_SIZE, - 0, - &pMappedData); - if(res < 0) - { - VMA_DEBUG_LOG(" vkMapMemory FAILED"); - FreeVulkanMemory(memTypeIndex, size, hMemory); - return res; - } - } - - *pAllocation = m_AllocationObjectAllocator.Allocate(isMappingAllowed); - (*pAllocation)->InitDedicatedAllocation(pool, memTypeIndex, hMemory, suballocType, pMappedData, size); - if (isUserDataString) - (*pAllocation)->SetName(this, (const char*)pUserData); - else - (*pAllocation)->SetUserData(this, pUserData); - m_Budget.AddAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), size); - if(VMA_DEBUG_INITIALIZE_ALLOCATIONS) - { - FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED); - } - - return VK_SUCCESS; -} - -void VmaAllocator_T::GetBufferMemoryRequirements( - VkBuffer hBuffer, - VkMemoryRequirements& memReq, - bool& requiresDedicatedAllocation, - bool& prefersDedicatedAllocation) const -{ -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - VkBufferMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_BUFFER_MEMORY_REQUIREMENTS_INFO_2_KHR }; - memReqInfo.buffer = hBuffer; - - VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR }; - - VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR }; - VmaPnextChainPushFront(&memReq2, &memDedicatedReq); - - (*m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2); - - memReq = memReq2.memoryRequirements; - requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE); - prefersDedicatedAllocation = (memDedicatedReq.prefersDedicatedAllocation != VK_FALSE); - } - else -#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - { - (*m_VulkanFunctions.vkGetBufferMemoryRequirements)(m_hDevice, hBuffer, &memReq); - requiresDedicatedAllocation = false; - prefersDedicatedAllocation = false; - } -} - -void VmaAllocator_T::GetImageMemoryRequirements( - VkImage hImage, - VkMemoryRequirements& memReq, - bool& requiresDedicatedAllocation, - bool& prefersDedicatedAllocation) const -{ -#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) - { - VkImageMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_IMAGE_MEMORY_REQUIREMENTS_INFO_2_KHR }; - memReqInfo.image = hImage; - - VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR }; - - VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR }; - VmaPnextChainPushFront(&memReq2, &memDedicatedReq); - - (*m_VulkanFunctions.vkGetImageMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2); - - memReq = memReq2.memoryRequirements; - requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE); - prefersDedicatedAllocation = (memDedicatedReq.prefersDedicatedAllocation != VK_FALSE); - } - else -#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000 - { - (*m_VulkanFunctions.vkGetImageMemoryRequirements)(m_hDevice, hImage, &memReq); - requiresDedicatedAllocation = false; - prefersDedicatedAllocation = false; - } -} - -VkResult VmaAllocator_T::FindMemoryTypeIndex( - uint32_t memoryTypeBits, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - VkFlags bufImgUsage, - uint32_t* pMemoryTypeIndex) const -{ - memoryTypeBits &= GetGlobalMemoryTypeBits(); - - if(pAllocationCreateInfo->memoryTypeBits != 0) - { - memoryTypeBits &= pAllocationCreateInfo->memoryTypeBits; - } - - VkMemoryPropertyFlags requiredFlags = 0, preferredFlags = 0, notPreferredFlags = 0; - if(!FindMemoryPreferences( - IsIntegratedGpu(), - *pAllocationCreateInfo, - bufImgUsage, - requiredFlags, preferredFlags, notPreferredFlags)) - { - return VK_ERROR_FEATURE_NOT_PRESENT; - } - - *pMemoryTypeIndex = UINT32_MAX; - uint32_t minCost = UINT32_MAX; - for(uint32_t memTypeIndex = 0, memTypeBit = 1; - memTypeIndex < GetMemoryTypeCount(); - ++memTypeIndex, memTypeBit <<= 1) - { - // This memory type is acceptable according to memoryTypeBits bitmask. - if((memTypeBit & memoryTypeBits) != 0) - { - const VkMemoryPropertyFlags currFlags = - m_MemProps.memoryTypes[memTypeIndex].propertyFlags; - // This memory type contains requiredFlags. - if((requiredFlags & ~currFlags) == 0) - { - // Calculate cost as number of bits from preferredFlags not present in this memory type. - uint32_t currCost = VMA_COUNT_BITS_SET(preferredFlags & ~currFlags) + - VMA_COUNT_BITS_SET(currFlags & notPreferredFlags); - // Remember memory type with lowest cost. - if(currCost < minCost) - { - *pMemoryTypeIndex = memTypeIndex; - if(currCost == 0) - { - return VK_SUCCESS; - } - minCost = currCost; - } - } - } - } - return (*pMemoryTypeIndex != UINT32_MAX) ? VK_SUCCESS : VK_ERROR_FEATURE_NOT_PRESENT; -} - -VkResult VmaAllocator_T::CalcMemTypeParams( - VmaAllocationCreateInfo& inoutCreateInfo, - uint32_t memTypeIndex, - VkDeviceSize size, - size_t allocationCount) -{ - // If memory type is not HOST_VISIBLE, disable MAPPED. - if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0 && - (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0) - { - inoutCreateInfo.flags &= ~VMA_ALLOCATION_CREATE_MAPPED_BIT; - } - - if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 && - (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT) != 0) - { - const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex); - VmaBudget heapBudget = {}; - GetHeapBudgets(&heapBudget, heapIndex, 1); - if(heapBudget.usage + size * allocationCount > heapBudget.budget) - { - return VK_ERROR_OUT_OF_DEVICE_MEMORY; - } - } - return VK_SUCCESS; -} - -VkResult VmaAllocator_T::CalcAllocationParams( - VmaAllocationCreateInfo& inoutCreateInfo, - bool dedicatedRequired, - bool dedicatedPreferred) -{ - VMA_ASSERT((inoutCreateInfo.flags & - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != - (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) && - "Specifying both flags VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT is incorrect."); - VMA_ASSERT((((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) == 0 || - (inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0)) && - "Specifying VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT requires also VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT."); - if(inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST) - { - if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0) - { - VMA_ASSERT((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0 && - "When using VMA_ALLOCATION_CREATE_MAPPED_BIT and usage = VMA_MEMORY_USAGE_AUTO*, you must also specify VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT."); - } - } - - // If memory is lazily allocated, it should be always dedicated. - if(dedicatedRequired || - inoutCreateInfo.usage == VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED) - { - inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT; - } - - if(inoutCreateInfo.pool != VK_NULL_HANDLE) - { - if(inoutCreateInfo.pool->m_BlockVector.HasExplicitBlockSize() && - (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0) - { - VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT while current custom pool doesn't support dedicated allocations."); - return VK_ERROR_FEATURE_NOT_PRESENT; - } - inoutCreateInfo.priority = inoutCreateInfo.pool->m_BlockVector.GetPriority(); - } - - if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 && - (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0) - { - VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT together with VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT makes no sense."); - return VK_ERROR_FEATURE_NOT_PRESENT; - } - - if(VMA_DEBUG_ALWAYS_DEDICATED_MEMORY && - (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0) - { - inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT; - } - - // Non-auto USAGE values imply HOST_ACCESS flags. - // And so does VMA_MEMORY_USAGE_UNKNOWN because it is used with custom pools. - // Which specific flag is used doesn't matter. They change things only when used with VMA_MEMORY_USAGE_AUTO*. - // Otherwise they just protect from assert on mapping. - if(inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO && - inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE && - inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_HOST) - { - if((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) == 0) - { - inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT; - } - } - - return VK_SUCCESS; -} - -VkResult VmaAllocator_T::AllocateMemory( - const VkMemoryRequirements& vkMemReq, - bool requiresDedicatedAllocation, - bool prefersDedicatedAllocation, - VkBuffer dedicatedBuffer, - VkImage dedicatedImage, - VkFlags dedicatedBufferImageUsage, - const VmaAllocationCreateInfo& createInfo, - VmaSuballocationType suballocType, - size_t allocationCount, - VmaAllocation* pAllocations) -{ - memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount); - - VMA_ASSERT(VmaIsPow2(vkMemReq.alignment)); - - if(vkMemReq.size == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - - VmaAllocationCreateInfo createInfoFinal = createInfo; - VkResult res = CalcAllocationParams(createInfoFinal, requiresDedicatedAllocation, prefersDedicatedAllocation); - if(res != VK_SUCCESS) - return res; - - if(createInfoFinal.pool != VK_NULL_HANDLE) - { - VmaBlockVector& blockVector = createInfoFinal.pool->m_BlockVector; - return AllocateMemoryOfType( - createInfoFinal.pool, - vkMemReq.size, - vkMemReq.alignment, - prefersDedicatedAllocation, - dedicatedBuffer, - dedicatedImage, - dedicatedBufferImageUsage, - createInfoFinal, - blockVector.GetMemoryTypeIndex(), - suballocType, - createInfoFinal.pool->m_DedicatedAllocations, - blockVector, - allocationCount, - pAllocations); - } - else - { - // Bit mask of memory Vulkan types acceptable for this allocation. - uint32_t memoryTypeBits = vkMemReq.memoryTypeBits; - uint32_t memTypeIndex = UINT32_MAX; - res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex); - // Can't find any single memory type matching requirements. res is VK_ERROR_FEATURE_NOT_PRESENT. - if(res != VK_SUCCESS) - return res; - do - { - VmaBlockVector* blockVector = m_pBlockVectors[memTypeIndex]; - VMA_ASSERT(blockVector && "Trying to use unsupported memory type!"); - res = AllocateMemoryOfType( - VK_NULL_HANDLE, - vkMemReq.size, - vkMemReq.alignment, - requiresDedicatedAllocation || prefersDedicatedAllocation, - dedicatedBuffer, - dedicatedImage, - dedicatedBufferImageUsage, - createInfoFinal, - memTypeIndex, - suballocType, - m_DedicatedAllocations[memTypeIndex], - *blockVector, - allocationCount, - pAllocations); - // Allocation succeeded - if(res == VK_SUCCESS) - return VK_SUCCESS; - - // Remove old memTypeIndex from list of possibilities. - memoryTypeBits &= ~(1u << memTypeIndex); - // Find alternative memTypeIndex. - res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex); - } while(res == VK_SUCCESS); - - // No other matching memory type index could be found. - // Not returning res, which is VK_ERROR_FEATURE_NOT_PRESENT, because we already failed to allocate once. - return VK_ERROR_OUT_OF_DEVICE_MEMORY; - } -} - -void VmaAllocator_T::FreeMemory( - size_t allocationCount, - const VmaAllocation* pAllocations) -{ - VMA_ASSERT(pAllocations); - - for(size_t allocIndex = allocationCount; allocIndex--; ) - { - VmaAllocation allocation = pAllocations[allocIndex]; - - if(allocation != VK_NULL_HANDLE) - { - if(VMA_DEBUG_INITIALIZE_ALLOCATIONS) - { - FillAllocation(allocation, VMA_ALLOCATION_FILL_PATTERN_DESTROYED); - } - - allocation->FreeName(this); - - switch(allocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - VmaBlockVector* pBlockVector = VMA_NULL; - VmaPool hPool = allocation->GetParentPool(); - if(hPool != VK_NULL_HANDLE) - { - pBlockVector = &hPool->m_BlockVector; - } - else - { - const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex(); - pBlockVector = m_pBlockVectors[memTypeIndex]; - VMA_ASSERT(pBlockVector && "Trying to free memory of unsupported type!"); - } - pBlockVector->Free(allocation); - } - break; - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - FreeDedicatedMemory(allocation); - break; - default: - VMA_ASSERT(0); - } - } - } -} - -void VmaAllocator_T::CalculateStatistics(VmaTotalStatistics* pStats) -{ - // Initialize. - VmaClearDetailedStatistics(pStats->total); - for(uint32_t i = 0; i < VK_MAX_MEMORY_TYPES; ++i) - VmaClearDetailedStatistics(pStats->memoryType[i]); - for(uint32_t i = 0; i < VK_MAX_MEMORY_HEAPS; ++i) - VmaClearDetailedStatistics(pStats->memoryHeap[i]); - - // Process default pools. - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex]; - if (pBlockVector != VMA_NULL) - pBlockVector->AddDetailedStatistics(pStats->memoryType[memTypeIndex]); - } - - // Process custom pools. - { - VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex); - for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool)) - { - VmaBlockVector& blockVector = pool->m_BlockVector; - const uint32_t memTypeIndex = blockVector.GetMemoryTypeIndex(); - blockVector.AddDetailedStatistics(pStats->memoryType[memTypeIndex]); - pool->m_DedicatedAllocations.AddDetailedStatistics(pStats->memoryType[memTypeIndex]); - } - } - - // Process dedicated allocations. - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - m_DedicatedAllocations[memTypeIndex].AddDetailedStatistics(pStats->memoryType[memTypeIndex]); - } - - // Sum from memory types to memory heaps. - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - const uint32_t memHeapIndex = m_MemProps.memoryTypes[memTypeIndex].heapIndex; - VmaAddDetailedStatistics(pStats->memoryHeap[memHeapIndex], pStats->memoryType[memTypeIndex]); - } - - // Sum from memory heaps to total. - for(uint32_t memHeapIndex = 0; memHeapIndex < GetMemoryHeapCount(); ++memHeapIndex) - VmaAddDetailedStatistics(pStats->total, pStats->memoryHeap[memHeapIndex]); - - VMA_ASSERT(pStats->total.statistics.allocationCount == 0 || - pStats->total.allocationSizeMax >= pStats->total.allocationSizeMin); - VMA_ASSERT(pStats->total.unusedRangeCount == 0 || - pStats->total.unusedRangeSizeMax >= pStats->total.unusedRangeSizeMin); -} - -void VmaAllocator_T::GetHeapBudgets(VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount) -{ -#if VMA_MEMORY_BUDGET - if(m_UseExtMemoryBudget) - { - if(m_Budget.m_OperationsSinceBudgetFetch < 30) - { - VmaMutexLockRead lockRead(m_Budget.m_BudgetMutex, m_UseMutex); - for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets) - { - const uint32_t heapIndex = firstHeap + i; - - outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex]; - outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex]; - outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex]; - outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex]; - - if(m_Budget.m_VulkanUsage[heapIndex] + outBudgets->statistics.blockBytes > m_Budget.m_BlockBytesAtBudgetFetch[heapIndex]) - { - outBudgets->usage = m_Budget.m_VulkanUsage[heapIndex] + - outBudgets->statistics.blockBytes - m_Budget.m_BlockBytesAtBudgetFetch[heapIndex]; - } - else - { - outBudgets->usage = 0; - } - - // Have to take MIN with heap size because explicit HeapSizeLimit is included in it. - outBudgets->budget = VMA_MIN( - m_Budget.m_VulkanBudget[heapIndex], m_MemProps.memoryHeaps[heapIndex].size); - } - } - else - { - UpdateVulkanBudget(); // Outside of mutex lock - GetHeapBudgets(outBudgets, firstHeap, heapCount); // Recursion - } - } - else -#endif - { - for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets) - { - const uint32_t heapIndex = firstHeap + i; - - outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex]; - outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex]; - outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex]; - outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex]; - - outBudgets->usage = outBudgets->statistics.blockBytes; - outBudgets->budget = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics. - } - } -} - -void VmaAllocator_T::GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo) -{ - pAllocationInfo->memoryType = hAllocation->GetMemoryTypeIndex(); - pAllocationInfo->deviceMemory = hAllocation->GetMemory(); - pAllocationInfo->offset = hAllocation->GetOffset(); - pAllocationInfo->size = hAllocation->GetSize(); - pAllocationInfo->pMappedData = hAllocation->GetMappedData(); - pAllocationInfo->pUserData = hAllocation->GetUserData(); - pAllocationInfo->pName = hAllocation->GetName(); -} - -VkResult VmaAllocator_T::CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool) -{ - VMA_DEBUG_LOG(" CreatePool: MemoryTypeIndex=%u, flags=%u", pCreateInfo->memoryTypeIndex, pCreateInfo->flags); - - VmaPoolCreateInfo newCreateInfo = *pCreateInfo; - - // Protection against uninitialized new structure member. If garbage data are left there, this pointer dereference would crash. - if(pCreateInfo->pMemoryAllocateNext) - { - VMA_ASSERT(((const VkBaseInStructure*)pCreateInfo->pMemoryAllocateNext)->sType != 0); - } - - if(newCreateInfo.maxBlockCount == 0) - { - newCreateInfo.maxBlockCount = SIZE_MAX; - } - if(newCreateInfo.minBlockCount > newCreateInfo.maxBlockCount) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - // Memory type index out of range or forbidden. - if(pCreateInfo->memoryTypeIndex >= GetMemoryTypeCount() || - ((1u << pCreateInfo->memoryTypeIndex) & m_GlobalMemoryTypeBits) == 0) - { - return VK_ERROR_FEATURE_NOT_PRESENT; - } - if(newCreateInfo.minAllocationAlignment > 0) - { - VMA_ASSERT(VmaIsPow2(newCreateInfo.minAllocationAlignment)); - } - - const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(newCreateInfo.memoryTypeIndex); - - *pPool = vma_new(this, VmaPool_T)(this, newCreateInfo, preferredBlockSize); - - VkResult res = (*pPool)->m_BlockVector.CreateMinBlocks(); - if(res != VK_SUCCESS) - { - vma_delete(this, *pPool); - *pPool = VMA_NULL; - return res; - } - - // Add to m_Pools. - { - VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex); - (*pPool)->SetId(m_NextPoolId++); - m_Pools.PushBack(*pPool); - } - - return VK_SUCCESS; -} - -void VmaAllocator_T::DestroyPool(VmaPool pool) -{ - // Remove from m_Pools. - { - VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex); - m_Pools.Remove(pool); - } - - vma_delete(this, pool); -} - -void VmaAllocator_T::GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats) -{ - VmaClearStatistics(*pPoolStats); - pool->m_BlockVector.AddStatistics(*pPoolStats); - pool->m_DedicatedAllocations.AddStatistics(*pPoolStats); -} - -void VmaAllocator_T::CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats) -{ - VmaClearDetailedStatistics(*pPoolStats); - pool->m_BlockVector.AddDetailedStatistics(*pPoolStats); - pool->m_DedicatedAllocations.AddDetailedStatistics(*pPoolStats); -} - -void VmaAllocator_T::SetCurrentFrameIndex(uint32_t frameIndex) -{ - m_CurrentFrameIndex.store(frameIndex); - -#if VMA_MEMORY_BUDGET - if(m_UseExtMemoryBudget) - { - UpdateVulkanBudget(); - } -#endif // #if VMA_MEMORY_BUDGET -} - -VkResult VmaAllocator_T::CheckPoolCorruption(VmaPool hPool) -{ - return hPool->m_BlockVector.CheckCorruption(); -} - -VkResult VmaAllocator_T::CheckCorruption(uint32_t memoryTypeBits) -{ - VkResult finalRes = VK_ERROR_FEATURE_NOT_PRESENT; - - // Process default pools. - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex]; - if(pBlockVector != VMA_NULL) - { - VkResult localRes = pBlockVector->CheckCorruption(); - switch(localRes) - { - case VK_ERROR_FEATURE_NOT_PRESENT: - break; - case VK_SUCCESS: - finalRes = VK_SUCCESS; - break; - default: - return localRes; - } - } - } - - // Process custom pools. - { - VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex); - for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool)) - { - if(((1u << pool->m_BlockVector.GetMemoryTypeIndex()) & memoryTypeBits) != 0) - { - VkResult localRes = pool->m_BlockVector.CheckCorruption(); - switch(localRes) - { - case VK_ERROR_FEATURE_NOT_PRESENT: - break; - case VK_SUCCESS: - finalRes = VK_SUCCESS; - break; - default: - return localRes; - } - } - } - } - - return finalRes; -} - -VkResult VmaAllocator_T::AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory) -{ - AtomicTransactionalIncrement deviceMemoryCountIncrement; - const uint64_t prevDeviceMemoryCount = deviceMemoryCountIncrement.Increment(&m_DeviceMemoryCount); -#if VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT - if(prevDeviceMemoryCount >= m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount) - { - return VK_ERROR_TOO_MANY_OBJECTS; - } -#endif - - const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(pAllocateInfo->memoryTypeIndex); - - // HeapSizeLimit is in effect for this heap. - if((m_HeapSizeLimitMask & (1u << heapIndex)) != 0) - { - const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size; - VkDeviceSize blockBytes = m_Budget.m_BlockBytes[heapIndex]; - for(;;) - { - const VkDeviceSize blockBytesAfterAllocation = blockBytes + pAllocateInfo->allocationSize; - if(blockBytesAfterAllocation > heapSize) - { - return VK_ERROR_OUT_OF_DEVICE_MEMORY; - } - if(m_Budget.m_BlockBytes[heapIndex].compare_exchange_strong(blockBytes, blockBytesAfterAllocation)) - { - break; - } - } - } - else - { - m_Budget.m_BlockBytes[heapIndex] += pAllocateInfo->allocationSize; - } - ++m_Budget.m_BlockCount[heapIndex]; - - // VULKAN CALL vkAllocateMemory. - VkResult res = (*m_VulkanFunctions.vkAllocateMemory)(m_hDevice, pAllocateInfo, GetAllocationCallbacks(), pMemory); - - if(res == VK_SUCCESS) - { -#if VMA_MEMORY_BUDGET - ++m_Budget.m_OperationsSinceBudgetFetch; -#endif - - // Informative callback. - if(m_DeviceMemoryCallbacks.pfnAllocate != VMA_NULL) - { - (*m_DeviceMemoryCallbacks.pfnAllocate)(this, pAllocateInfo->memoryTypeIndex, *pMemory, pAllocateInfo->allocationSize, m_DeviceMemoryCallbacks.pUserData); - } - - deviceMemoryCountIncrement.Commit(); - } - else - { - --m_Budget.m_BlockCount[heapIndex]; - m_Budget.m_BlockBytes[heapIndex] -= pAllocateInfo->allocationSize; - } - - return res; -} - -void VmaAllocator_T::FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory) -{ - // Informative callback. - if(m_DeviceMemoryCallbacks.pfnFree != VMA_NULL) - { - (*m_DeviceMemoryCallbacks.pfnFree)(this, memoryType, hMemory, size, m_DeviceMemoryCallbacks.pUserData); - } - - // VULKAN CALL vkFreeMemory. - (*m_VulkanFunctions.vkFreeMemory)(m_hDevice, hMemory, GetAllocationCallbacks()); - - const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memoryType); - --m_Budget.m_BlockCount[heapIndex]; - m_Budget.m_BlockBytes[heapIndex] -= size; - - --m_DeviceMemoryCount; -} - -VkResult VmaAllocator_T::BindVulkanBuffer( - VkDeviceMemory memory, - VkDeviceSize memoryOffset, - VkBuffer buffer, - const void* pNext) -{ - if(pNext != VMA_NULL) - { -#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2 - if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) && - m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL) - { - VkBindBufferMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_BUFFER_MEMORY_INFO_KHR }; - bindBufferMemoryInfo.pNext = pNext; - bindBufferMemoryInfo.buffer = buffer; - bindBufferMemoryInfo.memory = memory; - bindBufferMemoryInfo.memoryOffset = memoryOffset; - return (*m_VulkanFunctions.vkBindBufferMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo); - } - else -#endif // #if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2 - { - return VK_ERROR_EXTENSION_NOT_PRESENT; - } - } - else - { - return (*m_VulkanFunctions.vkBindBufferMemory)(m_hDevice, buffer, memory, memoryOffset); - } -} - -VkResult VmaAllocator_T::BindVulkanImage( - VkDeviceMemory memory, - VkDeviceSize memoryOffset, - VkImage image, - const void* pNext) -{ - if(pNext != VMA_NULL) - { -#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2 - if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) && - m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL) - { - VkBindImageMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_IMAGE_MEMORY_INFO_KHR }; - bindBufferMemoryInfo.pNext = pNext; - bindBufferMemoryInfo.image = image; - bindBufferMemoryInfo.memory = memory; - bindBufferMemoryInfo.memoryOffset = memoryOffset; - return (*m_VulkanFunctions.vkBindImageMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo); - } - else -#endif // #if VMA_BIND_MEMORY2 - { - return VK_ERROR_EXTENSION_NOT_PRESENT; - } - } - else - { - return (*m_VulkanFunctions.vkBindImageMemory)(m_hDevice, image, memory, memoryOffset); - } -} - -VkResult VmaAllocator_T::Map(VmaAllocation hAllocation, void** ppData) -{ - switch(hAllocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock(); - char *pBytes = VMA_NULL; - VkResult res = pBlock->Map(this, 1, (void**)&pBytes); - if(res == VK_SUCCESS) - { - *ppData = pBytes + (ptrdiff_t)hAllocation->GetOffset(); - hAllocation->BlockAllocMap(); - } - return res; - } - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - return hAllocation->DedicatedAllocMap(this, ppData); - default: - VMA_ASSERT(0); - return VK_ERROR_MEMORY_MAP_FAILED; - } -} - -void VmaAllocator_T::Unmap(VmaAllocation hAllocation) -{ - switch(hAllocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock(); - hAllocation->BlockAllocUnmap(); - pBlock->Unmap(this, 1); - } - break; - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - hAllocation->DedicatedAllocUnmap(this); - break; - default: - VMA_ASSERT(0); - } -} - -VkResult VmaAllocator_T::BindBufferMemory( - VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkBuffer hBuffer, - const void* pNext) -{ - VkResult res = VK_SUCCESS; - switch(hAllocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - res = BindVulkanBuffer(hAllocation->GetMemory(), allocationLocalOffset, hBuffer, pNext); - break; - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock(); - VMA_ASSERT(pBlock && "Binding buffer to allocation that doesn't belong to any block."); - res = pBlock->BindBufferMemory(this, hAllocation, allocationLocalOffset, hBuffer, pNext); - break; - } - default: - VMA_ASSERT(0); - } - return res; -} - -VkResult VmaAllocator_T::BindImageMemory( - VmaAllocation hAllocation, - VkDeviceSize allocationLocalOffset, - VkImage hImage, - const void* pNext) -{ - VkResult res = VK_SUCCESS; - switch(hAllocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - res = BindVulkanImage(hAllocation->GetMemory(), allocationLocalOffset, hImage, pNext); - break; - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock(); - VMA_ASSERT(pBlock && "Binding image to allocation that doesn't belong to any block."); - res = pBlock->BindImageMemory(this, hAllocation, allocationLocalOffset, hImage, pNext); - break; - } - default: - VMA_ASSERT(0); - } - return res; -} - -VkResult VmaAllocator_T::FlushOrInvalidateAllocation( - VmaAllocation hAllocation, - VkDeviceSize offset, VkDeviceSize size, - VMA_CACHE_OPERATION op) -{ - VkResult res = VK_SUCCESS; - - VkMappedMemoryRange memRange = {}; - if(GetFlushOrInvalidateRange(hAllocation, offset, size, memRange)) - { - switch(op) - { - case VMA_CACHE_FLUSH: - res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, 1, &memRange); - break; - case VMA_CACHE_INVALIDATE: - res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, 1, &memRange); - break; - default: - VMA_ASSERT(0); - } - } - // else: Just ignore this call. - return res; -} - -VkResult VmaAllocator_T::FlushOrInvalidateAllocations( - uint32_t allocationCount, - const VmaAllocation* allocations, - const VkDeviceSize* offsets, const VkDeviceSize* sizes, - VMA_CACHE_OPERATION op) -{ - typedef VmaStlAllocator RangeAllocator; - typedef VmaSmallVector RangeVector; - RangeVector ranges = RangeVector(RangeAllocator(GetAllocationCallbacks())); - - for(uint32_t allocIndex = 0; allocIndex < allocationCount; ++allocIndex) - { - const VmaAllocation alloc = allocations[allocIndex]; - const VkDeviceSize offset = offsets != VMA_NULL ? offsets[allocIndex] : 0; - const VkDeviceSize size = sizes != VMA_NULL ? sizes[allocIndex] : VK_WHOLE_SIZE; - VkMappedMemoryRange newRange; - if(GetFlushOrInvalidateRange(alloc, offset, size, newRange)) - { - ranges.push_back(newRange); - } - } - - VkResult res = VK_SUCCESS; - if(!ranges.empty()) - { - switch(op) - { - case VMA_CACHE_FLUSH: - res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data()); - break; - case VMA_CACHE_INVALIDATE: - res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data()); - break; - default: - VMA_ASSERT(0); - } - } - // else: Just ignore this call. - return res; -} - -void VmaAllocator_T::FreeDedicatedMemory(const VmaAllocation allocation) -{ - VMA_ASSERT(allocation && allocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED); - - const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex(); - VmaPool parentPool = allocation->GetParentPool(); - if(parentPool == VK_NULL_HANDLE) - { - // Default pool - m_DedicatedAllocations[memTypeIndex].Unregister(allocation); - } - else - { - // Custom pool - parentPool->m_DedicatedAllocations.Unregister(allocation); - } - - VkDeviceMemory hMemory = allocation->GetMemory(); - - /* - There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory - before vkFreeMemory. - - if(allocation->GetMappedData() != VMA_NULL) - { - (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory); - } - */ - - FreeVulkanMemory(memTypeIndex, allocation->GetSize(), hMemory); - - m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(allocation->GetMemoryTypeIndex()), allocation->GetSize()); - m_AllocationObjectAllocator.Free(allocation); - - VMA_DEBUG_LOG(" Freed DedicatedMemory MemoryTypeIndex=%u", memTypeIndex); -} - -uint32_t VmaAllocator_T::CalculateGpuDefragmentationMemoryTypeBits() const -{ - VkBufferCreateInfo dummyBufCreateInfo; - VmaFillGpuDefragmentationBufferCreateInfo(dummyBufCreateInfo); - - uint32_t memoryTypeBits = 0; - - // Create buffer. - VkBuffer buf = VK_NULL_HANDLE; - VkResult res = (*GetVulkanFunctions().vkCreateBuffer)( - m_hDevice, &dummyBufCreateInfo, GetAllocationCallbacks(), &buf); - if(res == VK_SUCCESS) - { - // Query for supported memory types. - VkMemoryRequirements memReq; - (*GetVulkanFunctions().vkGetBufferMemoryRequirements)(m_hDevice, buf, &memReq); - memoryTypeBits = memReq.memoryTypeBits; - - // Destroy buffer. - (*GetVulkanFunctions().vkDestroyBuffer)(m_hDevice, buf, GetAllocationCallbacks()); - } - - return memoryTypeBits; -} - -uint32_t VmaAllocator_T::CalculateGlobalMemoryTypeBits() const -{ - // Make sure memory information is already fetched. - VMA_ASSERT(GetMemoryTypeCount() > 0); - - uint32_t memoryTypeBits = UINT32_MAX; - - if(!m_UseAmdDeviceCoherentMemory) - { - // Exclude memory types that have VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD. - for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - if((m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY) != 0) - { - memoryTypeBits &= ~(1u << memTypeIndex); - } - } - } - - return memoryTypeBits; -} - -bool VmaAllocator_T::GetFlushOrInvalidateRange( - VmaAllocation allocation, - VkDeviceSize offset, VkDeviceSize size, - VkMappedMemoryRange& outRange) const -{ - const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex(); - if(size > 0 && IsMemoryTypeNonCoherent(memTypeIndex)) - { - const VkDeviceSize nonCoherentAtomSize = m_PhysicalDeviceProperties.limits.nonCoherentAtomSize; - const VkDeviceSize allocationSize = allocation->GetSize(); - VMA_ASSERT(offset <= allocationSize); - - outRange.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE; - outRange.pNext = VMA_NULL; - outRange.memory = allocation->GetMemory(); - - switch(allocation->GetType()) - { - case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED: - outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize); - if(size == VK_WHOLE_SIZE) - { - outRange.size = allocationSize - outRange.offset; - } - else - { - VMA_ASSERT(offset + size <= allocationSize); - outRange.size = VMA_MIN( - VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize), - allocationSize - outRange.offset); - } - break; - case VmaAllocation_T::ALLOCATION_TYPE_BLOCK: - { - // 1. Still within this allocation. - outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize); - if(size == VK_WHOLE_SIZE) - { - size = allocationSize - offset; - } - else - { - VMA_ASSERT(offset + size <= allocationSize); - } - outRange.size = VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize); - - // 2. Adjust to whole block. - const VkDeviceSize allocationOffset = allocation->GetOffset(); - VMA_ASSERT(allocationOffset % nonCoherentAtomSize == 0); - const VkDeviceSize blockSize = allocation->GetBlock()->m_pMetadata->GetSize(); - outRange.offset += allocationOffset; - outRange.size = VMA_MIN(outRange.size, blockSize - outRange.offset); - - break; - } - default: - VMA_ASSERT(0); - } - return true; - } - return false; -} - -#if VMA_MEMORY_BUDGET -void VmaAllocator_T::UpdateVulkanBudget() -{ - VMA_ASSERT(m_UseExtMemoryBudget); - - VkPhysicalDeviceMemoryProperties2KHR memProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_PROPERTIES_2_KHR }; - - VkPhysicalDeviceMemoryBudgetPropertiesEXT budgetProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT }; - VmaPnextChainPushFront(&memProps, &budgetProps); - - GetVulkanFunctions().vkGetPhysicalDeviceMemoryProperties2KHR(m_PhysicalDevice, &memProps); - - { - VmaMutexLockWrite lockWrite(m_Budget.m_BudgetMutex, m_UseMutex); - - for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex) - { - m_Budget.m_VulkanUsage[heapIndex] = budgetProps.heapUsage[heapIndex]; - m_Budget.m_VulkanBudget[heapIndex] = budgetProps.heapBudget[heapIndex]; - m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] = m_Budget.m_BlockBytes[heapIndex].load(); - - // Some bugged drivers return the budget incorrectly, e.g. 0 or much bigger than heap size. - if(m_Budget.m_VulkanBudget[heapIndex] == 0) - { - m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics. - } - else if(m_Budget.m_VulkanBudget[heapIndex] > m_MemProps.memoryHeaps[heapIndex].size) - { - m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size; - } - if(m_Budget.m_VulkanUsage[heapIndex] == 0 && m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] > 0) - { - m_Budget.m_VulkanUsage[heapIndex] = m_Budget.m_BlockBytesAtBudgetFetch[heapIndex]; - } - } - m_Budget.m_OperationsSinceBudgetFetch = 0; - } -} -#endif // VMA_MEMORY_BUDGET - -void VmaAllocator_T::FillAllocation(const VmaAllocation hAllocation, uint8_t pattern) -{ - if(VMA_DEBUG_INITIALIZE_ALLOCATIONS && - hAllocation->IsMappingAllowed() && - (m_MemProps.memoryTypes[hAllocation->GetMemoryTypeIndex()].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0) - { - void* pData = VMA_NULL; - VkResult res = Map(hAllocation, &pData); - if(res == VK_SUCCESS) - { - memset(pData, (int)pattern, (size_t)hAllocation->GetSize()); - FlushOrInvalidateAllocation(hAllocation, 0, VK_WHOLE_SIZE, VMA_CACHE_FLUSH); - Unmap(hAllocation); - } - else - { - VMA_ASSERT(0 && "VMA_DEBUG_INITIALIZE_ALLOCATIONS is enabled, but couldn't map memory to fill allocation."); - } - } -} - -uint32_t VmaAllocator_T::GetGpuDefragmentationMemoryTypeBits() -{ - uint32_t memoryTypeBits = m_GpuDefragmentationMemoryTypeBits.load(); - if(memoryTypeBits == UINT32_MAX) - { - memoryTypeBits = CalculateGpuDefragmentationMemoryTypeBits(); - m_GpuDefragmentationMemoryTypeBits.store(memoryTypeBits); - } - return memoryTypeBits; -} - -#if VMA_STATS_STRING_ENABLED -void VmaAllocator_T::PrintDetailedMap(VmaJsonWriter& json) -{ - json.WriteString("DefaultPools"); - json.BeginObject(); - { - for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - VmaBlockVector* pBlockVector = m_pBlockVectors[memTypeIndex]; - VmaDedicatedAllocationList& dedicatedAllocList = m_DedicatedAllocations[memTypeIndex]; - if (pBlockVector != VMA_NULL) - { - json.BeginString("Type "); - json.ContinueString(memTypeIndex); - json.EndString(); - json.BeginObject(); - { - json.WriteString("PreferredBlockSize"); - json.WriteNumber(pBlockVector->GetPreferredBlockSize()); - - json.WriteString("Blocks"); - pBlockVector->PrintDetailedMap(json); - - json.WriteString("DedicatedAllocations"); - dedicatedAllocList.BuildStatsString(json); - } - json.EndObject(); - } - } - } - json.EndObject(); - - json.WriteString("CustomPools"); - json.BeginObject(); - { - VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex); - if (!m_Pools.IsEmpty()) - { - for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex) - { - bool displayType = true; - size_t index = 0; - for (VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool)) - { - VmaBlockVector& blockVector = pool->m_BlockVector; - if (blockVector.GetMemoryTypeIndex() == memTypeIndex) - { - if (displayType) - { - json.BeginString("Type "); - json.ContinueString(memTypeIndex); - json.EndString(); - json.BeginArray(); - displayType = false; - } - - json.BeginObject(); - { - json.WriteString("Name"); - json.BeginString(); - json.ContinueString_Size(index++); - if (pool->GetName()) - { - json.ContinueString(" - "); - json.ContinueString(pool->GetName()); - } - json.EndString(); - - json.WriteString("PreferredBlockSize"); - json.WriteNumber(blockVector.GetPreferredBlockSize()); - - json.WriteString("Blocks"); - blockVector.PrintDetailedMap(json); - - json.WriteString("DedicatedAllocations"); - pool->m_DedicatedAllocations.BuildStatsString(json); - } - json.EndObject(); - } - } - - if (!displayType) - json.EndArray(); - } - } - } - json.EndObject(); -} -#endif // VMA_STATS_STRING_ENABLED -#endif // _VMA_ALLOCATOR_T_FUNCTIONS - - -#ifndef _VMA_PUBLIC_INTERFACE -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator( - const VmaAllocatorCreateInfo* pCreateInfo, - VmaAllocator* pAllocator) -{ - VMA_ASSERT(pCreateInfo && pAllocator); - VMA_ASSERT(pCreateInfo->vulkanApiVersion == 0 || - (VK_VERSION_MAJOR(pCreateInfo->vulkanApiVersion) == 1 && VK_VERSION_MINOR(pCreateInfo->vulkanApiVersion) <= 3)); - VMA_DEBUG_LOG("vmaCreateAllocator"); - *pAllocator = vma_new(pCreateInfo->pAllocationCallbacks, VmaAllocator_T)(pCreateInfo); - VkResult result = (*pAllocator)->Init(pCreateInfo); - if(result < 0) - { - vma_delete(pCreateInfo->pAllocationCallbacks, *pAllocator); - *pAllocator = VK_NULL_HANDLE; - } - return result; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator( - VmaAllocator allocator) -{ - if(allocator != VK_NULL_HANDLE) - { - VMA_DEBUG_LOG("vmaDestroyAllocator"); - VkAllocationCallbacks allocationCallbacks = allocator->m_AllocationCallbacks; // Have to copy the callbacks when destroying. - vma_delete(&allocationCallbacks, allocator); - } -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo(VmaAllocator allocator, VmaAllocatorInfo* pAllocatorInfo) -{ - VMA_ASSERT(allocator && pAllocatorInfo); - pAllocatorInfo->instance = allocator->m_hInstance; - pAllocatorInfo->physicalDevice = allocator->GetPhysicalDevice(); - pAllocatorInfo->device = allocator->m_hDevice; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties( - VmaAllocator allocator, - const VkPhysicalDeviceProperties **ppPhysicalDeviceProperties) -{ - VMA_ASSERT(allocator && ppPhysicalDeviceProperties); - *ppPhysicalDeviceProperties = &allocator->m_PhysicalDeviceProperties; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties( - VmaAllocator allocator, - const VkPhysicalDeviceMemoryProperties** ppPhysicalDeviceMemoryProperties) -{ - VMA_ASSERT(allocator && ppPhysicalDeviceMemoryProperties); - *ppPhysicalDeviceMemoryProperties = &allocator->m_MemProps; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties( - VmaAllocator allocator, - uint32_t memoryTypeIndex, - VkMemoryPropertyFlags* pFlags) -{ - VMA_ASSERT(allocator && pFlags); - VMA_ASSERT(memoryTypeIndex < allocator->GetMemoryTypeCount()); - *pFlags = allocator->m_MemProps.memoryTypes[memoryTypeIndex].propertyFlags; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex( - VmaAllocator allocator, - uint32_t frameIndex) -{ - VMA_ASSERT(allocator); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->SetCurrentFrameIndex(frameIndex); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics( - VmaAllocator allocator, - VmaTotalStatistics* pStats) -{ - VMA_ASSERT(allocator && pStats); - VMA_DEBUG_GLOBAL_MUTEX_LOCK - allocator->CalculateStatistics(pStats); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets( - VmaAllocator allocator, - VmaBudget* pBudgets) -{ - VMA_ASSERT(allocator && pBudgets); - VMA_DEBUG_GLOBAL_MUTEX_LOCK - allocator->GetHeapBudgets(pBudgets, 0, allocator->GetMemoryHeapCount()); -} - -#if VMA_STATS_STRING_ENABLED - -VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString( - VmaAllocator allocator, - char** ppStatsString, - VkBool32 detailedMap) -{ - VMA_ASSERT(allocator && ppStatsString); - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VmaStringBuilder sb(allocator->GetAllocationCallbacks()); - { - VmaBudget budgets[VK_MAX_MEMORY_HEAPS]; - allocator->GetHeapBudgets(budgets, 0, allocator->GetMemoryHeapCount()); - - VmaTotalStatistics stats; - allocator->CalculateStatistics(&stats); - - VmaJsonWriter json(allocator->GetAllocationCallbacks(), sb); - json.BeginObject(); - { - json.WriteString("General"); - json.BeginObject(); - { - const VkPhysicalDeviceProperties& deviceProperties = allocator->m_PhysicalDeviceProperties; - const VkPhysicalDeviceMemoryProperties& memoryProperties = allocator->m_MemProps; - - json.WriteString("API"); - json.WriteString("Vulkan"); - - json.WriteString("apiVersion"); - json.BeginString(); - json.ContinueString(VK_API_VERSION_MAJOR(deviceProperties.apiVersion)); - json.ContinueString("."); - json.ContinueString(VK_API_VERSION_MINOR(deviceProperties.apiVersion)); - json.ContinueString("."); - json.ContinueString(VK_API_VERSION_PATCH(deviceProperties.apiVersion)); - json.EndString(); - - json.WriteString("GPU"); - json.WriteString(deviceProperties.deviceName); - json.WriteString("deviceType"); - json.WriteNumber(static_cast(deviceProperties.deviceType)); - - json.WriteString("maxMemoryAllocationCount"); - json.WriteNumber(deviceProperties.limits.maxMemoryAllocationCount); - json.WriteString("bufferImageGranularity"); - json.WriteNumber(deviceProperties.limits.bufferImageGranularity); - json.WriteString("nonCoherentAtomSize"); - json.WriteNumber(deviceProperties.limits.nonCoherentAtomSize); - - json.WriteString("memoryHeapCount"); - json.WriteNumber(memoryProperties.memoryHeapCount); - json.WriteString("memoryTypeCount"); - json.WriteNumber(memoryProperties.memoryTypeCount); - } - json.EndObject(); - } - { - json.WriteString("Total"); - VmaPrintDetailedStatistics(json, stats.total); - } - { - json.WriteString("MemoryInfo"); - json.BeginObject(); - { - for (uint32_t heapIndex = 0; heapIndex < allocator->GetMemoryHeapCount(); ++heapIndex) - { - json.BeginString("Heap "); - json.ContinueString(heapIndex); - json.EndString(); - json.BeginObject(); - { - const VkMemoryHeap& heapInfo = allocator->m_MemProps.memoryHeaps[heapIndex]; - json.WriteString("Flags"); - json.BeginArray(true); - { - if (heapInfo.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT) - json.WriteString("DEVICE_LOCAL"); - #if VMA_VULKAN_VERSION >= 1001000 - if (heapInfo.flags & VK_MEMORY_HEAP_MULTI_INSTANCE_BIT) - json.WriteString("MULTI_INSTANCE"); - #endif - - VkMemoryHeapFlags flags = heapInfo.flags & - ~(VK_MEMORY_HEAP_DEVICE_LOCAL_BIT - #if VMA_VULKAN_VERSION >= 1001000 - | VK_MEMORY_HEAP_MULTI_INSTANCE_BIT - #endif - ); - if (flags != 0) - json.WriteNumber(flags); - } - json.EndArray(); - - json.WriteString("Size"); - json.WriteNumber(heapInfo.size); - - json.WriteString("Budget"); - json.BeginObject(); - { - json.WriteString("BudgetBytes"); - json.WriteNumber(budgets[heapIndex].budget); - json.WriteString("UsageBytes"); - json.WriteNumber(budgets[heapIndex].usage); - } - json.EndObject(); - - json.WriteString("Stats"); - VmaPrintDetailedStatistics(json, stats.memoryHeap[heapIndex]); - - json.WriteString("MemoryPools"); - json.BeginObject(); - { - for (uint32_t typeIndex = 0; typeIndex < allocator->GetMemoryTypeCount(); ++typeIndex) - { - if (allocator->MemoryTypeIndexToHeapIndex(typeIndex) == heapIndex) - { - json.BeginString("Type "); - json.ContinueString(typeIndex); - json.EndString(); - json.BeginObject(); - { - json.WriteString("Flags"); - json.BeginArray(true); - { - VkMemoryPropertyFlags flags = allocator->m_MemProps.memoryTypes[typeIndex].propertyFlags; - if (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) - json.WriteString("DEVICE_LOCAL"); - if (flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) - json.WriteString("HOST_VISIBLE"); - if (flags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) - json.WriteString("HOST_COHERENT"); - if (flags & VK_MEMORY_PROPERTY_HOST_CACHED_BIT) - json.WriteString("HOST_CACHED"); - if (flags & VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT) - json.WriteString("LAZILY_ALLOCATED"); - #if VMA_VULKAN_VERSION >= 1001000 - if (flags & VK_MEMORY_PROPERTY_PROTECTED_BIT) - json.WriteString("PROTECTED"); - #endif - #if VK_AMD_device_coherent_memory - if (flags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY) - json.WriteString("DEVICE_COHERENT_AMD"); - if (flags & VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY) - json.WriteString("DEVICE_UNCACHED_AMD"); - #endif - - flags &= ~(VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT - #if VMA_VULKAN_VERSION >= 1001000 - | VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT - #endif - #if VK_AMD_device_coherent_memory - | VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY - | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY - #endif - | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT - | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT - | VK_MEMORY_PROPERTY_HOST_CACHED_BIT); - if (flags != 0) - json.WriteNumber(flags); - } - json.EndArray(); - - json.WriteString("Stats"); - VmaPrintDetailedStatistics(json, stats.memoryType[typeIndex]); - } - json.EndObject(); - } - } - - } - json.EndObject(); - } - json.EndObject(); - } - } - json.EndObject(); - } - - if (detailedMap == VK_TRUE) - allocator->PrintDetailedMap(json); - - json.EndObject(); - } - - *ppStatsString = VmaCreateStringCopy(allocator->GetAllocationCallbacks(), sb.GetData(), sb.GetLength()); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString( - VmaAllocator allocator, - char* pStatsString) -{ - if(pStatsString != VMA_NULL) - { - VMA_ASSERT(allocator); - VmaFreeString(allocator->GetAllocationCallbacks(), pStatsString); - } -} - -#endif // VMA_STATS_STRING_ENABLED - -/* -This function is not protected by any mutex because it just reads immutable data. -*/ -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex( - VmaAllocator allocator, - uint32_t memoryTypeBits, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - uint32_t* pMemoryTypeIndex) -{ - VMA_ASSERT(allocator != VK_NULL_HANDLE); - VMA_ASSERT(pAllocationCreateInfo != VMA_NULL); - VMA_ASSERT(pMemoryTypeIndex != VMA_NULL); - - return allocator->FindMemoryTypeIndex(memoryTypeBits, pAllocationCreateInfo, UINT32_MAX, pMemoryTypeIndex); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo( - VmaAllocator allocator, - const VkBufferCreateInfo* pBufferCreateInfo, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - uint32_t* pMemoryTypeIndex) -{ - VMA_ASSERT(allocator != VK_NULL_HANDLE); - VMA_ASSERT(pBufferCreateInfo != VMA_NULL); - VMA_ASSERT(pAllocationCreateInfo != VMA_NULL); - VMA_ASSERT(pMemoryTypeIndex != VMA_NULL); - - const VkDevice hDev = allocator->m_hDevice; - const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions(); - VkResult res; - -#if VMA_VULKAN_VERSION >= 1003000 - if(funcs->vkGetDeviceBufferMemoryRequirements) - { - // Can query straight from VkBufferCreateInfo :) - VkDeviceBufferMemoryRequirements devBufMemReq = {VK_STRUCTURE_TYPE_DEVICE_BUFFER_MEMORY_REQUIREMENTS}; - devBufMemReq.pCreateInfo = pBufferCreateInfo; - - VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2}; - (*funcs->vkGetDeviceBufferMemoryRequirements)(hDev, &devBufMemReq, &memReq); - - res = allocator->FindMemoryTypeIndex( - memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex); - } - else -#endif // #if VMA_VULKAN_VERSION >= 1003000 - { - // Must create a dummy buffer to query :( - VkBuffer hBuffer = VK_NULL_HANDLE; - res = funcs->vkCreateBuffer( - hDev, pBufferCreateInfo, allocator->GetAllocationCallbacks(), &hBuffer); - if(res == VK_SUCCESS) - { - VkMemoryRequirements memReq = {}; - funcs->vkGetBufferMemoryRequirements(hDev, hBuffer, &memReq); - - res = allocator->FindMemoryTypeIndex( - memReq.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex); - - funcs->vkDestroyBuffer( - hDev, hBuffer, allocator->GetAllocationCallbacks()); - } - } - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo( - VmaAllocator allocator, - const VkImageCreateInfo* pImageCreateInfo, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - uint32_t* pMemoryTypeIndex) -{ - VMA_ASSERT(allocator != VK_NULL_HANDLE); - VMA_ASSERT(pImageCreateInfo != VMA_NULL); - VMA_ASSERT(pAllocationCreateInfo != VMA_NULL); - VMA_ASSERT(pMemoryTypeIndex != VMA_NULL); - - const VkDevice hDev = allocator->m_hDevice; - const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions(); - VkResult res; - -#if VMA_VULKAN_VERSION >= 1003000 - if(funcs->vkGetDeviceImageMemoryRequirements) - { - // Can query straight from VkImageCreateInfo :) - VkDeviceImageMemoryRequirements devImgMemReq = {VK_STRUCTURE_TYPE_DEVICE_IMAGE_MEMORY_REQUIREMENTS}; - devImgMemReq.pCreateInfo = pImageCreateInfo; - VMA_ASSERT(pImageCreateInfo->tiling != VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY && (pImageCreateInfo->flags & VK_IMAGE_CREATE_DISJOINT_BIT_COPY) == 0 && - "Cannot use this VkImageCreateInfo with vmaFindMemoryTypeIndexForImageInfo as I don't know what to pass as VkDeviceImageMemoryRequirements::planeAspect."); - - VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2}; - (*funcs->vkGetDeviceImageMemoryRequirements)(hDev, &devImgMemReq, &memReq); - - res = allocator->FindMemoryTypeIndex( - memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex); - } - else -#endif // #if VMA_VULKAN_VERSION >= 1003000 - { - // Must create a dummy image to query :( - VkImage hImage = VK_NULL_HANDLE; - res = funcs->vkCreateImage( - hDev, pImageCreateInfo, allocator->GetAllocationCallbacks(), &hImage); - if(res == VK_SUCCESS) - { - VkMemoryRequirements memReq = {}; - funcs->vkGetImageMemoryRequirements(hDev, hImage, &memReq); - - res = allocator->FindMemoryTypeIndex( - memReq.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex); - - funcs->vkDestroyImage( - hDev, hImage, allocator->GetAllocationCallbacks()); - } - } - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool( - VmaAllocator allocator, - const VmaPoolCreateInfo* pCreateInfo, - VmaPool* pPool) -{ - VMA_ASSERT(allocator && pCreateInfo && pPool); - - VMA_DEBUG_LOG("vmaCreatePool"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->CreatePool(pCreateInfo, pPool); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool( - VmaAllocator allocator, - VmaPool pool) -{ - VMA_ASSERT(allocator); - - if(pool == VK_NULL_HANDLE) - { - return; - } - - VMA_DEBUG_LOG("vmaDestroyPool"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->DestroyPool(pool); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics( - VmaAllocator allocator, - VmaPool pool, - VmaStatistics* pPoolStats) -{ - VMA_ASSERT(allocator && pool && pPoolStats); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->GetPoolStatistics(pool, pPoolStats); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics( - VmaAllocator allocator, - VmaPool pool, - VmaDetailedStatistics* pPoolStats) -{ - VMA_ASSERT(allocator && pool && pPoolStats); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->CalculatePoolStatistics(pool, pPoolStats); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption(VmaAllocator allocator, VmaPool pool) -{ - VMA_ASSERT(allocator && pool); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VMA_DEBUG_LOG("vmaCheckPoolCorruption"); - - return allocator->CheckPoolCorruption(pool); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName( - VmaAllocator allocator, - VmaPool pool, - const char** ppName) -{ - VMA_ASSERT(allocator && pool && ppName); - - VMA_DEBUG_LOG("vmaGetPoolName"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - *ppName = pool->GetName(); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName( - VmaAllocator allocator, - VmaPool pool, - const char* pName) -{ - VMA_ASSERT(allocator && pool); - - VMA_DEBUG_LOG("vmaSetPoolName"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - pool->SetName(pName); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory( - VmaAllocator allocator, - const VkMemoryRequirements* pVkMemoryRequirements, - const VmaAllocationCreateInfo* pCreateInfo, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocation); - - VMA_DEBUG_LOG("vmaAllocateMemory"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VkResult result = allocator->AllocateMemory( - *pVkMemoryRequirements, - false, // requiresDedicatedAllocation - false, // prefersDedicatedAllocation - VK_NULL_HANDLE, // dedicatedBuffer - VK_NULL_HANDLE, // dedicatedImage - UINT32_MAX, // dedicatedBufferImageUsage - *pCreateInfo, - VMA_SUBALLOCATION_TYPE_UNKNOWN, - 1, // allocationCount - pAllocation); - - if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return result; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages( - VmaAllocator allocator, - const VkMemoryRequirements* pVkMemoryRequirements, - const VmaAllocationCreateInfo* pCreateInfo, - size_t allocationCount, - VmaAllocation* pAllocations, - VmaAllocationInfo* pAllocationInfo) -{ - if(allocationCount == 0) - { - return VK_SUCCESS; - } - - VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocations); - - VMA_DEBUG_LOG("vmaAllocateMemoryPages"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VkResult result = allocator->AllocateMemory( - *pVkMemoryRequirements, - false, // requiresDedicatedAllocation - false, // prefersDedicatedAllocation - VK_NULL_HANDLE, // dedicatedBuffer - VK_NULL_HANDLE, // dedicatedImage - UINT32_MAX, // dedicatedBufferImageUsage - *pCreateInfo, - VMA_SUBALLOCATION_TYPE_UNKNOWN, - allocationCount, - pAllocations); - - if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS) - { - for(size_t i = 0; i < allocationCount; ++i) - { - allocator->GetAllocationInfo(pAllocations[i], pAllocationInfo + i); - } - } - - return result; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer( - VmaAllocator allocator, - VkBuffer buffer, - const VmaAllocationCreateInfo* pCreateInfo, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && buffer != VK_NULL_HANDLE && pCreateInfo && pAllocation); - - VMA_DEBUG_LOG("vmaAllocateMemoryForBuffer"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VkMemoryRequirements vkMemReq = {}; - bool requiresDedicatedAllocation = false; - bool prefersDedicatedAllocation = false; - allocator->GetBufferMemoryRequirements(buffer, vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation); - - VkResult result = allocator->AllocateMemory( - vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation, - buffer, // dedicatedBuffer - VK_NULL_HANDLE, // dedicatedImage - UINT32_MAX, // dedicatedBufferImageUsage - *pCreateInfo, - VMA_SUBALLOCATION_TYPE_BUFFER, - 1, // allocationCount - pAllocation); - - if(pAllocationInfo && result == VK_SUCCESS) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return result; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage( - VmaAllocator allocator, - VkImage image, - const VmaAllocationCreateInfo* pCreateInfo, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && image != VK_NULL_HANDLE && pCreateInfo && pAllocation); - - VMA_DEBUG_LOG("vmaAllocateMemoryForImage"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - VkMemoryRequirements vkMemReq = {}; - bool requiresDedicatedAllocation = false; - bool prefersDedicatedAllocation = false; - allocator->GetImageMemoryRequirements(image, vkMemReq, - requiresDedicatedAllocation, prefersDedicatedAllocation); - - VkResult result = allocator->AllocateMemory( - vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation, - VK_NULL_HANDLE, // dedicatedBuffer - image, // dedicatedImage - UINT32_MAX, // dedicatedBufferImageUsage - *pCreateInfo, - VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN, - 1, // allocationCount - pAllocation); - - if(pAllocationInfo && result == VK_SUCCESS) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return result; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory( - VmaAllocator allocator, - VmaAllocation allocation) -{ - VMA_ASSERT(allocator); - - if(allocation == VK_NULL_HANDLE) - { - return; - } - - VMA_DEBUG_LOG("vmaFreeMemory"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->FreeMemory( - 1, // allocationCount - &allocation); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages( - VmaAllocator allocator, - size_t allocationCount, - const VmaAllocation* pAllocations) -{ - if(allocationCount == 0) - { - return; - } - - VMA_ASSERT(allocator); - - VMA_DEBUG_LOG("vmaFreeMemoryPages"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->FreeMemory(allocationCount, pAllocations); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo( - VmaAllocator allocator, - VmaAllocation allocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && allocation && pAllocationInfo); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->GetAllocationInfo(allocation, pAllocationInfo); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData( - VmaAllocator allocator, - VmaAllocation allocation, - void* pUserData) -{ - VMA_ASSERT(allocator && allocation); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocation->SetUserData(allocator, pUserData); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const char* VMA_NULLABLE pName) -{ - allocation->SetName(allocator, pName); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - VkMemoryPropertyFlags* VMA_NOT_NULL pFlags) -{ - VMA_ASSERT(allocator && allocation && pFlags); - const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex(); - *pFlags = allocator->m_MemProps.memoryTypes[memTypeIndex].propertyFlags; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory( - VmaAllocator allocator, - VmaAllocation allocation, - void** ppData) -{ - VMA_ASSERT(allocator && allocation && ppData); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->Map(allocation, ppData); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory( - VmaAllocator allocator, - VmaAllocation allocation) -{ - VMA_ASSERT(allocator && allocation); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - allocator->Unmap(allocation); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation( - VmaAllocator allocator, - VmaAllocation allocation, - VkDeviceSize offset, - VkDeviceSize size) -{ - VMA_ASSERT(allocator && allocation); - - VMA_DEBUG_LOG("vmaFlushAllocation"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_FLUSH); - - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation( - VmaAllocator allocator, - VmaAllocation allocation, - VkDeviceSize offset, - VkDeviceSize size) -{ - VMA_ASSERT(allocator && allocation); - - VMA_DEBUG_LOG("vmaInvalidateAllocation"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_INVALIDATE); - - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations( - VmaAllocator allocator, - uint32_t allocationCount, - const VmaAllocation* allocations, - const VkDeviceSize* offsets, - const VkDeviceSize* sizes) -{ - VMA_ASSERT(allocator); - - if(allocationCount == 0) - { - return VK_SUCCESS; - } - - VMA_ASSERT(allocations); - - VMA_DEBUG_LOG("vmaFlushAllocations"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_FLUSH); - - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations( - VmaAllocator allocator, - uint32_t allocationCount, - const VmaAllocation* allocations, - const VkDeviceSize* offsets, - const VkDeviceSize* sizes) -{ - VMA_ASSERT(allocator); - - if(allocationCount == 0) - { - return VK_SUCCESS; - } - - VMA_ASSERT(allocations); - - VMA_DEBUG_LOG("vmaInvalidateAllocations"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_INVALIDATE); - - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption( - VmaAllocator allocator, - uint32_t memoryTypeBits) -{ - VMA_ASSERT(allocator); - - VMA_DEBUG_LOG("vmaCheckCorruption"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->CheckCorruption(memoryTypeBits); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation( - VmaAllocator allocator, - const VmaDefragmentationInfo* pInfo, - VmaDefragmentationContext* pContext) -{ - VMA_ASSERT(allocator && pInfo && pContext); - - VMA_DEBUG_LOG("vmaBeginDefragmentation"); - - if (pInfo->pool != VMA_NULL) - { - // Check if run on supported algorithms - if (pInfo->pool->m_BlockVector.GetAlgorithm() & VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) - return VK_ERROR_FEATURE_NOT_PRESENT; - } - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - *pContext = vma_new(allocator, VmaDefragmentationContext_T)(allocator, *pInfo); - return VK_SUCCESS; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation( - VmaAllocator allocator, - VmaDefragmentationContext context, - VmaDefragmentationStats* pStats) -{ - VMA_ASSERT(allocator && context); - - VMA_DEBUG_LOG("vmaEndDefragmentation"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - if (pStats) - context->GetStats(*pStats); - vma_delete(allocator, context); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass( - VmaAllocator VMA_NOT_NULL allocator, - VmaDefragmentationContext VMA_NOT_NULL context, - VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo) -{ - VMA_ASSERT(context && pPassInfo); - - VMA_DEBUG_LOG("vmaBeginDefragmentationPass"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return context->DefragmentPassBegin(*pPassInfo); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass( - VmaAllocator VMA_NOT_NULL allocator, - VmaDefragmentationContext VMA_NOT_NULL context, - VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo) -{ - VMA_ASSERT(context && pPassInfo); - - VMA_DEBUG_LOG("vmaEndDefragmentationPass"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return context->DefragmentPassEnd(*pPassInfo); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory( - VmaAllocator allocator, - VmaAllocation allocation, - VkBuffer buffer) -{ - VMA_ASSERT(allocator && allocation && buffer); - - VMA_DEBUG_LOG("vmaBindBufferMemory"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->BindBufferMemory(allocation, 0, buffer, VMA_NULL); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2( - VmaAllocator allocator, - VmaAllocation allocation, - VkDeviceSize allocationLocalOffset, - VkBuffer buffer, - const void* pNext) -{ - VMA_ASSERT(allocator && allocation && buffer); - - VMA_DEBUG_LOG("vmaBindBufferMemory2"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->BindBufferMemory(allocation, allocationLocalOffset, buffer, pNext); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory( - VmaAllocator allocator, - VmaAllocation allocation, - VkImage image) -{ - VMA_ASSERT(allocator && allocation && image); - - VMA_DEBUG_LOG("vmaBindImageMemory"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->BindImageMemory(allocation, 0, image, VMA_NULL); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2( - VmaAllocator allocator, - VmaAllocation allocation, - VkDeviceSize allocationLocalOffset, - VkImage image, - const void* pNext) -{ - VMA_ASSERT(allocator && allocation && image); - - VMA_DEBUG_LOG("vmaBindImageMemory2"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - return allocator->BindImageMemory(allocation, allocationLocalOffset, image, pNext); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer( - VmaAllocator allocator, - const VkBufferCreateInfo* pBufferCreateInfo, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - VkBuffer* pBuffer, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && pBuffer && pAllocation); - - if(pBufferCreateInfo->size == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 && - !allocator->m_UseKhrBufferDeviceAddress) - { - VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used."); - return VK_ERROR_INITIALIZATION_FAILED; - } - - VMA_DEBUG_LOG("vmaCreateBuffer"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - *pBuffer = VK_NULL_HANDLE; - *pAllocation = VK_NULL_HANDLE; - - // 1. Create VkBuffer. - VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)( - allocator->m_hDevice, - pBufferCreateInfo, - allocator->GetAllocationCallbacks(), - pBuffer); - if(res >= 0) - { - // 2. vkGetBufferMemoryRequirements. - VkMemoryRequirements vkMemReq = {}; - bool requiresDedicatedAllocation = false; - bool prefersDedicatedAllocation = false; - allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq, - requiresDedicatedAllocation, prefersDedicatedAllocation); - - // 3. Allocate memory using allocator. - res = allocator->AllocateMemory( - vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation, - *pBuffer, // dedicatedBuffer - VK_NULL_HANDLE, // dedicatedImage - pBufferCreateInfo->usage, // dedicatedBufferImageUsage - *pAllocationCreateInfo, - VMA_SUBALLOCATION_TYPE_BUFFER, - 1, // allocationCount - pAllocation); - - if(res >= 0) - { - // 3. Bind buffer with memory. - if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0) - { - res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL); - } - if(res >= 0) - { - // All steps succeeded. - #if VMA_STATS_STRING_ENABLED - (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage); - #endif - if(pAllocationInfo != VMA_NULL) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return VK_SUCCESS; - } - allocator->FreeMemory( - 1, // allocationCount - pAllocation); - *pAllocation = VK_NULL_HANDLE; - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks()); - *pBuffer = VK_NULL_HANDLE; - return res; - } - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks()); - *pBuffer = VK_NULL_HANDLE; - return res; - } - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment( - VmaAllocator allocator, - const VkBufferCreateInfo* pBufferCreateInfo, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - VkDeviceSize minAlignment, - VkBuffer* pBuffer, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && VmaIsPow2(minAlignment) && pBuffer && pAllocation); - - if(pBufferCreateInfo->size == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 && - !allocator->m_UseKhrBufferDeviceAddress) - { - VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used."); - return VK_ERROR_INITIALIZATION_FAILED; - } - - VMA_DEBUG_LOG("vmaCreateBufferWithAlignment"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - *pBuffer = VK_NULL_HANDLE; - *pAllocation = VK_NULL_HANDLE; - - // 1. Create VkBuffer. - VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)( - allocator->m_hDevice, - pBufferCreateInfo, - allocator->GetAllocationCallbacks(), - pBuffer); - if(res >= 0) - { - // 2. vkGetBufferMemoryRequirements. - VkMemoryRequirements vkMemReq = {}; - bool requiresDedicatedAllocation = false; - bool prefersDedicatedAllocation = false; - allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq, - requiresDedicatedAllocation, prefersDedicatedAllocation); - - // 2a. Include minAlignment - vkMemReq.alignment = VMA_MAX(vkMemReq.alignment, minAlignment); - - // 3. Allocate memory using allocator. - res = allocator->AllocateMemory( - vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation, - *pBuffer, // dedicatedBuffer - VK_NULL_HANDLE, // dedicatedImage - pBufferCreateInfo->usage, // dedicatedBufferImageUsage - *pAllocationCreateInfo, - VMA_SUBALLOCATION_TYPE_BUFFER, - 1, // allocationCount - pAllocation); - - if(res >= 0) - { - // 3. Bind buffer with memory. - if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0) - { - res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL); - } - if(res >= 0) - { - // All steps succeeded. - #if VMA_STATS_STRING_ENABLED - (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage); - #endif - if(pAllocationInfo != VMA_NULL) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return VK_SUCCESS; - } - allocator->FreeMemory( - 1, // allocationCount - pAllocation); - *pAllocation = VK_NULL_HANDLE; - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks()); - *pBuffer = VK_NULL_HANDLE; - return res; - } - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks()); - *pBuffer = VK_NULL_HANDLE; - return res; - } - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo, - VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer) -{ - VMA_ASSERT(allocator && pBufferCreateInfo && pBuffer && allocation); - - VMA_DEBUG_LOG("vmaCreateAliasingBuffer"); - - *pBuffer = VK_NULL_HANDLE; - - if (pBufferCreateInfo->size == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - if ((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 && - !allocator->m_UseKhrBufferDeviceAddress) - { - VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used."); - return VK_ERROR_INITIALIZATION_FAILED; - } - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - // 1. Create VkBuffer. - VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)( - allocator->m_hDevice, - pBufferCreateInfo, - allocator->GetAllocationCallbacks(), - pBuffer); - if (res >= 0) - { - // 2. Bind buffer with memory. - res = allocator->BindBufferMemory(allocation, 0, *pBuffer, VMA_NULL); - if (res >= 0) - { - return VK_SUCCESS; - } - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks()); - } - return res; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer( - VmaAllocator allocator, - VkBuffer buffer, - VmaAllocation allocation) -{ - VMA_ASSERT(allocator); - - if(buffer == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE) - { - return; - } - - VMA_DEBUG_LOG("vmaDestroyBuffer"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - if(buffer != VK_NULL_HANDLE) - { - (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, buffer, allocator->GetAllocationCallbacks()); - } - - if(allocation != VK_NULL_HANDLE) - { - allocator->FreeMemory( - 1, // allocationCount - &allocation); - } -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage( - VmaAllocator allocator, - const VkImageCreateInfo* pImageCreateInfo, - const VmaAllocationCreateInfo* pAllocationCreateInfo, - VkImage* pImage, - VmaAllocation* pAllocation, - VmaAllocationInfo* pAllocationInfo) -{ - VMA_ASSERT(allocator && pImageCreateInfo && pAllocationCreateInfo && pImage && pAllocation); - - if(pImageCreateInfo->extent.width == 0 || - pImageCreateInfo->extent.height == 0 || - pImageCreateInfo->extent.depth == 0 || - pImageCreateInfo->mipLevels == 0 || - pImageCreateInfo->arrayLayers == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - - VMA_DEBUG_LOG("vmaCreateImage"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - *pImage = VK_NULL_HANDLE; - *pAllocation = VK_NULL_HANDLE; - - // 1. Create VkImage. - VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)( - allocator->m_hDevice, - pImageCreateInfo, - allocator->GetAllocationCallbacks(), - pImage); - if(res >= 0) - { - VmaSuballocationType suballocType = pImageCreateInfo->tiling == VK_IMAGE_TILING_OPTIMAL ? - VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL : - VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR; - - // 2. Allocate memory using allocator. - VkMemoryRequirements vkMemReq = {}; - bool requiresDedicatedAllocation = false; - bool prefersDedicatedAllocation = false; - allocator->GetImageMemoryRequirements(*pImage, vkMemReq, - requiresDedicatedAllocation, prefersDedicatedAllocation); - - res = allocator->AllocateMemory( - vkMemReq, - requiresDedicatedAllocation, - prefersDedicatedAllocation, - VK_NULL_HANDLE, // dedicatedBuffer - *pImage, // dedicatedImage - pImageCreateInfo->usage, // dedicatedBufferImageUsage - *pAllocationCreateInfo, - suballocType, - 1, // allocationCount - pAllocation); - - if(res >= 0) - { - // 3. Bind image with memory. - if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0) - { - res = allocator->BindImageMemory(*pAllocation, 0, *pImage, VMA_NULL); - } - if(res >= 0) - { - // All steps succeeded. - #if VMA_STATS_STRING_ENABLED - (*pAllocation)->InitBufferImageUsage(pImageCreateInfo->usage); - #endif - if(pAllocationInfo != VMA_NULL) - { - allocator->GetAllocationInfo(*pAllocation, pAllocationInfo); - } - - return VK_SUCCESS; - } - allocator->FreeMemory( - 1, // allocationCount - pAllocation); - *pAllocation = VK_NULL_HANDLE; - (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks()); - *pImage = VK_NULL_HANDLE; - return res; - } - (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks()); - *pImage = VK_NULL_HANDLE; - return res; - } - return res; -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage( - VmaAllocator VMA_NOT_NULL allocator, - VmaAllocation VMA_NOT_NULL allocation, - const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo, - VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage) -{ - VMA_ASSERT(allocator && pImageCreateInfo && pImage && allocation); - - *pImage = VK_NULL_HANDLE; - - VMA_DEBUG_LOG("vmaCreateImage"); - - if (pImageCreateInfo->extent.width == 0 || - pImageCreateInfo->extent.height == 0 || - pImageCreateInfo->extent.depth == 0 || - pImageCreateInfo->mipLevels == 0 || - pImageCreateInfo->arrayLayers == 0) - { - return VK_ERROR_INITIALIZATION_FAILED; - } - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - // 1. Create VkImage. - VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)( - allocator->m_hDevice, - pImageCreateInfo, - allocator->GetAllocationCallbacks(), - pImage); - if (res >= 0) - { - // 2. Bind image with memory. - res = allocator->BindImageMemory(allocation, 0, *pImage, VMA_NULL); - if (res >= 0) - { - return VK_SUCCESS; - } - (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks()); - } - return res; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage( - VmaAllocator VMA_NOT_NULL allocator, - VkImage VMA_NULLABLE_NON_DISPATCHABLE image, - VmaAllocation VMA_NULLABLE allocation) -{ - VMA_ASSERT(allocator); - - if(image == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE) - { - return; - } - - VMA_DEBUG_LOG("vmaDestroyImage"); - - VMA_DEBUG_GLOBAL_MUTEX_LOCK - - if(image != VK_NULL_HANDLE) - { - (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, image, allocator->GetAllocationCallbacks()); - } - if(allocation != VK_NULL_HANDLE) - { - allocator->FreeMemory( - 1, // allocationCount - &allocation); - } -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock( - const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo, - VmaVirtualBlock VMA_NULLABLE * VMA_NOT_NULL pVirtualBlock) -{ - VMA_ASSERT(pCreateInfo && pVirtualBlock); - VMA_ASSERT(pCreateInfo->size > 0); - VMA_DEBUG_LOG("vmaCreateVirtualBlock"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - *pVirtualBlock = vma_new(pCreateInfo->pAllocationCallbacks, VmaVirtualBlock_T)(*pCreateInfo); - VkResult res = (*pVirtualBlock)->Init(); - if(res < 0) - { - vma_delete(pCreateInfo->pAllocationCallbacks, *pVirtualBlock); - *pVirtualBlock = VK_NULL_HANDLE; - } - return res; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock(VmaVirtualBlock VMA_NULLABLE virtualBlock) -{ - if(virtualBlock != VK_NULL_HANDLE) - { - VMA_DEBUG_LOG("vmaDestroyVirtualBlock"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - VkAllocationCallbacks allocationCallbacks = virtualBlock->m_AllocationCallbacks; // Have to copy the callbacks when destroying. - vma_delete(&allocationCallbacks, virtualBlock); - } -} - -VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty(VmaVirtualBlock VMA_NOT_NULL virtualBlock) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE); - VMA_DEBUG_LOG("vmaIsVirtualBlockEmpty"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - return virtualBlock->IsEmpty() ? VK_TRUE : VK_FALSE; -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pVirtualAllocInfo != VMA_NULL); - VMA_DEBUG_LOG("vmaGetVirtualAllocationInfo"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->GetAllocationInfo(allocation, *pVirtualAllocInfo); -} - -VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation, - VkDeviceSize* VMA_NULLABLE pOffset) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pCreateInfo != VMA_NULL && pAllocation != VMA_NULL); - VMA_DEBUG_LOG("vmaVirtualAllocate"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - return virtualBlock->Allocate(*pCreateInfo, *pAllocation, pOffset); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree(VmaVirtualBlock VMA_NOT_NULL virtualBlock, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation) -{ - if(allocation != VK_NULL_HANDLE) - { - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE); - VMA_DEBUG_LOG("vmaVirtualFree"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->Free(allocation); - } -} - -VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock(VmaVirtualBlock VMA_NOT_NULL virtualBlock) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE); - VMA_DEBUG_LOG("vmaClearVirtualBlock"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->Clear(); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, void* VMA_NULLABLE pUserData) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE); - VMA_DEBUG_LOG("vmaSetVirtualAllocationUserData"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->SetAllocationUserData(allocation, pUserData); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaStatistics* VMA_NOT_NULL pStats) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL); - VMA_DEBUG_LOG("vmaGetVirtualBlockStatistics"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->GetStatistics(*pStats); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - VmaDetailedStatistics* VMA_NOT_NULL pStats) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL); - VMA_DEBUG_LOG("vmaCalculateVirtualBlockStatistics"); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - virtualBlock->CalculateDetailedStatistics(*pStats); -} - -#if VMA_STATS_STRING_ENABLED - -VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - char* VMA_NULLABLE * VMA_NOT_NULL ppStatsString, VkBool32 detailedMap) -{ - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && ppStatsString != VMA_NULL); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - const VkAllocationCallbacks* allocationCallbacks = virtualBlock->GetAllocationCallbacks(); - VmaStringBuilder sb(allocationCallbacks); - virtualBlock->BuildStatsString(detailedMap != VK_FALSE, sb); - *ppStatsString = VmaCreateStringCopy(allocationCallbacks, sb.GetData(), sb.GetLength()); -} - -VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock, - char* VMA_NULLABLE pStatsString) -{ - if(pStatsString != VMA_NULL) - { - VMA_ASSERT(virtualBlock != VK_NULL_HANDLE); - VMA_DEBUG_GLOBAL_MUTEX_LOCK; - VmaFreeString(virtualBlock->GetAllocationCallbacks(), pStatsString); - } -} -#endif // VMA_STATS_STRING_ENABLED -#endif // _VMA_PUBLIC_INTERFACE -#endif // VMA_IMPLEMENTATION - -/** -\page quick_start Quick start - -\section quick_start_project_setup Project setup - -Vulkan Memory Allocator comes in form of a "stb-style" single header file. -You don't need to build it as a separate library project. -You can add this file directly to your project and submit it to code repository next to your other source files. - -"Single header" doesn't mean that everything is contained in C/C++ declarations, -like it tends to be in case of inline functions or C++ templates. -It means that implementation is bundled with interface in a single file and needs to be extracted using preprocessor macro. -If you don't do it properly, you will get linker errors. - -To do it properly: - --# Include "vk_mem_alloc.h" file in each CPP file where you want to use the library. - This includes declarations of all members of the library. --# In exactly one CPP file define following macro before this include. - It enables also internal definitions. - -\code -#define VMA_IMPLEMENTATION -#include "vk_mem_alloc.h" -\endcode - -It may be a good idea to create dedicated CPP file just for this purpose. - -This library includes header ``, which in turn -includes `` on Windows. If you need some specific macros defined -before including these headers (like `WIN32_LEAN_AND_MEAN` or -`WINVER` for Windows, `VK_USE_PLATFORM_WIN32_KHR` for Vulkan), you must define -them before every `#include` of this library. - -This library is written in C++, but has C-compatible interface. -Thus you can include and use vk_mem_alloc.h in C or C++ code, but full -implementation with `VMA_IMPLEMENTATION` macro must be compiled as C++, NOT as C. -Some features of C++14 used. STL containers, RTTI, or C++ exceptions are not used. - - -\section quick_start_initialization Initialization - -At program startup: - --# Initialize Vulkan to have `VkPhysicalDevice`, `VkDevice` and `VkInstance` object. --# Fill VmaAllocatorCreateInfo structure and create #VmaAllocator object by - calling vmaCreateAllocator(). - -Only members `physicalDevice`, `device`, `instance` are required. -However, you should inform the library which Vulkan version do you use by setting -VmaAllocatorCreateInfo::vulkanApiVersion and which extensions did you enable -by setting VmaAllocatorCreateInfo::flags (like #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT for VK_KHR_buffer_device_address). -Otherwise, VMA would use only features of Vulkan 1.0 core with no extensions. - -You may need to configure importing Vulkan functions. There are 3 ways to do this: - --# **If you link with Vulkan static library** (e.g. "vulkan-1.lib" on Windows): - - You don't need to do anything. - - VMA will use these, as macro `VMA_STATIC_VULKAN_FUNCTIONS` is defined to 1 by default. --# **If you want VMA to fetch pointers to Vulkan functions dynamically** using `vkGetInstanceProcAddr`, - `vkGetDeviceProcAddr` (this is the option presented in the example below): - - Define `VMA_STATIC_VULKAN_FUNCTIONS` to 0, `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 1. - - Provide pointers to these two functions via VmaVulkanFunctions::vkGetInstanceProcAddr, - VmaVulkanFunctions::vkGetDeviceProcAddr. - - The library will fetch pointers to all other functions it needs internally. --# **If you fetch pointers to all Vulkan functions in a custom way**, e.g. using some loader like - [Volk](https://github.com/zeux/volk): - - Define `VMA_STATIC_VULKAN_FUNCTIONS` and `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 0. - - Pass these pointers via structure #VmaVulkanFunctions. - -\code -VmaVulkanFunctions vulkanFunctions = {}; -vulkanFunctions.vkGetInstanceProcAddr = &vkGetInstanceProcAddr; -vulkanFunctions.vkGetDeviceProcAddr = &vkGetDeviceProcAddr; - -VmaAllocatorCreateInfo allocatorCreateInfo = {}; -allocatorCreateInfo.vulkanApiVersion = VK_API_VERSION_1_2; -allocatorCreateInfo.physicalDevice = physicalDevice; -allocatorCreateInfo.device = device; -allocatorCreateInfo.instance = instance; -allocatorCreateInfo.pVulkanFunctions = &vulkanFunctions; - -VmaAllocator allocator; -vmaCreateAllocator(&allocatorCreateInfo, &allocator); -\endcode - - -\section quick_start_resource_allocation Resource allocation - -When you want to create a buffer or image: - --# Fill `VkBufferCreateInfo` / `VkImageCreateInfo` structure. --# Fill VmaAllocationCreateInfo structure. --# Call vmaCreateBuffer() / vmaCreateImage() to get `VkBuffer`/`VkImage` with memory - already allocated and bound to it, plus #VmaAllocation objects that represents its underlying memory. - -\code -VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufferInfo.size = 65536; -bufferInfo.usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocInfo = {}; -allocInfo.usage = VMA_MEMORY_USAGE_AUTO; - -VkBuffer buffer; -VmaAllocation allocation; -vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr); -\endcode - -Don't forget to destroy your objects when no longer needed: - -\code -vmaDestroyBuffer(allocator, buffer, allocation); -vmaDestroyAllocator(allocator); -\endcode - - -\page choosing_memory_type Choosing memory type - -Physical devices in Vulkan support various combinations of memory heaps and -types. Help with choosing correct and optimal memory type for your specific -resource is one of the key features of this library. You can use it by filling -appropriate members of VmaAllocationCreateInfo structure, as described below. -You can also combine multiple methods. - --# If you just want to find memory type index that meets your requirements, you - can use function: vmaFindMemoryTypeIndexForBufferInfo(), - vmaFindMemoryTypeIndexForImageInfo(), vmaFindMemoryTypeIndex(). --# If you want to allocate a region of device memory without association with any - specific image or buffer, you can use function vmaAllocateMemory(). Usage of - this function is not recommended and usually not needed. - vmaAllocateMemoryPages() function is also provided for creating multiple allocations at once, - which may be useful for sparse binding. --# If you already have a buffer or an image created, you want to allocate memory - for it and then you will bind it yourself, you can use function - vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage(). - For binding you should use functions: vmaBindBufferMemory(), vmaBindImageMemory() - or their extended versions: vmaBindBufferMemory2(), vmaBindImageMemory2(). --# **This is the easiest and recommended way to use this library:** - If you want to create a buffer or an image, allocate memory for it and bind - them together, all in one call, you can use function vmaCreateBuffer(), - vmaCreateImage(). - -When using 3. or 4., the library internally queries Vulkan for memory types -supported for that buffer or image (function `vkGetBufferMemoryRequirements()`) -and uses only one of these types. - -If no memory type can be found that meets all the requirements, these functions -return `VK_ERROR_FEATURE_NOT_PRESENT`. - -You can leave VmaAllocationCreateInfo structure completely filled with zeros. -It means no requirements are specified for memory type. -It is valid, although not very useful. - -\section choosing_memory_type_usage Usage - -The easiest way to specify memory requirements is to fill member -VmaAllocationCreateInfo::usage using one of the values of enum #VmaMemoryUsage. -It defines high level, common usage types. -Since version 3 of the library, it is recommended to use #VMA_MEMORY_USAGE_AUTO to let it select best memory type for your resource automatically. - -For example, if you want to create a uniform buffer that will be filled using -transfer only once or infrequently and then used for rendering every frame as a uniform buffer, you can -do it using following code. The buffer will most likely end up in a memory type with -`VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT` to be fast to access by the GPU device. - -\code -VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufferInfo.size = 65536; -bufferInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocInfo = {}; -allocInfo.usage = VMA_MEMORY_USAGE_AUTO; - -VkBuffer buffer; -VmaAllocation allocation; -vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr); -\endcode - -If you have a preference for putting the resource in GPU (device) memory or CPU (host) memory -on systems with discrete graphics card that have the memories separate, you can use -#VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST. - -When using `VMA_MEMORY_USAGE_AUTO*` while you want to map the allocated memory, -you also need to specify one of the host access flags: -#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT. -This will help the library decide about preferred memory type to ensure it has `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` -so you can map it. - -For example, a staging buffer that will be filled via mapped pointer and then -used as a source of transfer to the buffer decribed previously can be created like this. -It will likely and up in a memory type that is `HOST_VISIBLE` and `HOST_COHERENT` -but not `HOST_CACHED` (meaning uncached, write-combined) and not `DEVICE_LOCAL` (meaning system RAM). - -\code -VkBufferCreateInfo stagingBufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -stagingBufferInfo.size = 65536; -stagingBufferInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT; - -VmaAllocationCreateInfo stagingAllocInfo = {}; -stagingAllocInfo.usage = VMA_MEMORY_USAGE_AUTO; -stagingAllocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT; - -VkBuffer stagingBuffer; -VmaAllocation stagingAllocation; -vmaCreateBuffer(allocator, &stagingBufferInfo, &stagingAllocInfo, &stagingBuffer, &stagingAllocation, nullptr); -\endcode - -For more examples of creating different kinds of resources, see chapter \ref usage_patterns. - -Usage values `VMA_MEMORY_USAGE_AUTO*` are legal to use only when the library knows -about the resource being created by having `VkBufferCreateInfo` / `VkImageCreateInfo` passed, -so they work with functions like: vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo() etc. -If you allocate raw memory using function vmaAllocateMemory(), you have to use other means of selecting -memory type, as decribed below. - -\note -Old usage values (`VMA_MEMORY_USAGE_GPU_ONLY`, `VMA_MEMORY_USAGE_CPU_ONLY`, -`VMA_MEMORY_USAGE_CPU_TO_GPU`, `VMA_MEMORY_USAGE_GPU_TO_CPU`, `VMA_MEMORY_USAGE_CPU_COPY`) -are still available and work same way as in previous versions of the library -for backward compatibility, but they are not recommended. - -\section choosing_memory_type_required_preferred_flags Required and preferred flags - -You can specify more detailed requirements by filling members -VmaAllocationCreateInfo::requiredFlags and VmaAllocationCreateInfo::preferredFlags -with a combination of bits from enum `VkMemoryPropertyFlags`. For example, -if you want to create a buffer that will be persistently mapped on host (so it -must be `HOST_VISIBLE`) and preferably will also be `HOST_COHERENT` and `HOST_CACHED`, -use following code: - -\code -VmaAllocationCreateInfo allocInfo = {}; -allocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT; -allocInfo.preferredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT; -allocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT | VMA_ALLOCATION_CREATE_MAPPED_BIT; - -VkBuffer buffer; -VmaAllocation allocation; -vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr); -\endcode - -A memory type is chosen that has all the required flags and as many preferred -flags set as possible. - -Value passed in VmaAllocationCreateInfo::usage is internally converted to a set of required and preferred flags, -plus some extra "magic" (heuristics). - -\section choosing_memory_type_explicit_memory_types Explicit memory types - -If you inspected memory types available on the physical device and you have -a preference for memory types that you want to use, you can fill member -VmaAllocationCreateInfo::memoryTypeBits. It is a bit mask, where each bit set -means that a memory type with that index is allowed to be used for the -allocation. Special value 0, just like `UINT32_MAX`, means there are no -restrictions to memory type index. - -Please note that this member is NOT just a memory type index. -Still you can use it to choose just one, specific memory type. -For example, if you already determined that your buffer should be created in -memory type 2, use following code: - -\code -uint32_t memoryTypeIndex = 2; - -VmaAllocationCreateInfo allocInfo = {}; -allocInfo.memoryTypeBits = 1u << memoryTypeIndex; - -VkBuffer buffer; -VmaAllocation allocation; -vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr); -\endcode - - -\section choosing_memory_type_custom_memory_pools Custom memory pools - -If you allocate from custom memory pool, all the ways of specifying memory -requirements described above are not applicable and the aforementioned members -of VmaAllocationCreateInfo structure are ignored. Memory type is selected -explicitly when creating the pool and then used to make all the allocations from -that pool. For further details, see \ref custom_memory_pools. - -\section choosing_memory_type_dedicated_allocations Dedicated allocations - -Memory for allocations is reserved out of larger block of `VkDeviceMemory` -allocated from Vulkan internally. That is the main feature of this whole library. -You can still request a separate memory block to be created for an allocation, -just like you would do in a trivial solution without using any allocator. -In that case, a buffer or image is always bound to that memory at offset 0. -This is called a "dedicated allocation". -You can explicitly request it by using flag #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. -The library can also internally decide to use dedicated allocation in some cases, e.g.: - -- When the size of the allocation is large. -- When [VK_KHR_dedicated_allocation](@ref vk_khr_dedicated_allocation) extension is enabled - and it reports that dedicated allocation is required or recommended for the resource. -- When allocation of next big memory block fails due to not enough device memory, - but allocation with the exact requested size succeeds. - - -\page memory_mapping Memory mapping - -To "map memory" in Vulkan means to obtain a CPU pointer to `VkDeviceMemory`, -to be able to read from it or write to it in CPU code. -Mapping is possible only of memory allocated from a memory type that has -`VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` flag. -Functions `vkMapMemory()`, `vkUnmapMemory()` are designed for this purpose. -You can use them directly with memory allocated by this library, -but it is not recommended because of following issue: -Mapping the same `VkDeviceMemory` block multiple times is illegal - only one mapping at a time is allowed. -This includes mapping disjoint regions. Mapping is not reference-counted internally by Vulkan. -Because of this, Vulkan Memory Allocator provides following facilities: - -\note If you want to be able to map an allocation, you need to specify one of the flags -#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT -in VmaAllocationCreateInfo::flags. These flags are required for an allocation to be mappable -when using #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` enum values. -For other usage values they are ignored and every such allocation made in `HOST_VISIBLE` memory type is mappable, -but they can still be used for consistency. - -\section memory_mapping_mapping_functions Mapping functions - -The library provides following functions for mapping of a specific #VmaAllocation: vmaMapMemory(), vmaUnmapMemory(). -They are safer and more convenient to use than standard Vulkan functions. -You can map an allocation multiple times simultaneously - mapping is reference-counted internally. -You can also map different allocations simultaneously regardless of whether they use the same `VkDeviceMemory` block. -The way it is implemented is that the library always maps entire memory block, not just region of the allocation. -For further details, see description of vmaMapMemory() function. -Example: - -\code -// Having these objects initialized: -struct ConstantBuffer -{ - ... -}; -ConstantBuffer constantBufferData = ... - -VmaAllocator allocator = ... -VkBuffer constantBuffer = ... -VmaAllocation constantBufferAllocation = ... - -// You can map and fill your buffer using following code: - -void* mappedData; -vmaMapMemory(allocator, constantBufferAllocation, &mappedData); -memcpy(mappedData, &constantBufferData, sizeof(constantBufferData)); -vmaUnmapMemory(allocator, constantBufferAllocation); -\endcode - -When mapping, you may see a warning from Vulkan validation layer similar to this one: - -Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used. - -It happens because the library maps entire `VkDeviceMemory` block, where different -types of images and buffers may end up together, especially on GPUs with unified memory like Intel. -You can safely ignore it if you are sure you access only memory of the intended -object that you wanted to map. - - -\section memory_mapping_persistently_mapped_memory Persistently mapped memory - -Kepping your memory persistently mapped is generally OK in Vulkan. -You don't need to unmap it before using its data on the GPU. -The library provides a special feature designed for that: -Allocations made with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag set in -VmaAllocationCreateInfo::flags stay mapped all the time, -so you can just access CPU pointer to it any time -without a need to call any "map" or "unmap" function. -Example: - -\code -VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufCreateInfo.size = sizeof(ConstantBuffer); -bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | - VMA_ALLOCATION_CREATE_MAPPED_BIT; - -VkBuffer buf; -VmaAllocation alloc; -VmaAllocationInfo allocInfo; -vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo); - -// Buffer is already mapped. You can access its memory. -memcpy(allocInfo.pMappedData, &constantBufferData, sizeof(constantBufferData)); -\endcode - -\note #VMA_ALLOCATION_CREATE_MAPPED_BIT by itself doesn't guarantee that the allocation will end up -in a mappable memory type. -For this, you need to also specify #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or -#VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT. -#VMA_ALLOCATION_CREATE_MAPPED_BIT only guarantees that if the memory is `HOST_VISIBLE`, the allocation will be mapped on creation. -For an example of how to make use of this fact, see section \ref usage_patterns_advanced_data_uploading. - -\section memory_mapping_cache_control Cache flush and invalidate - -Memory in Vulkan doesn't need to be unmapped before using it on GPU, -but unless a memory types has `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT` flag set, -you need to manually **invalidate** cache before reading of mapped pointer -and **flush** cache after writing to mapped pointer. -Map/unmap operations don't do that automatically. -Vulkan provides following functions for this purpose `vkFlushMappedMemoryRanges()`, -`vkInvalidateMappedMemoryRanges()`, but this library provides more convenient -functions that refer to given allocation object: vmaFlushAllocation(), -vmaInvalidateAllocation(), -or multiple objects at once: vmaFlushAllocations(), vmaInvalidateAllocations(). - -Regions of memory specified for flush/invalidate must be aligned to -`VkPhysicalDeviceLimits::nonCoherentAtomSize`. This is automatically ensured by the library. -In any memory type that is `HOST_VISIBLE` but not `HOST_COHERENT`, all allocations -within blocks are aligned to this value, so their offsets are always multiply of -`nonCoherentAtomSize` and two different allocations never share same "line" of this size. - -Also, Windows drivers from all 3 PC GPU vendors (AMD, Intel, NVIDIA) -currently provide `HOST_COHERENT` flag on all memory types that are -`HOST_VISIBLE`, so on PC you may not need to bother. - - -\page staying_within_budget Staying within budget - -When developing a graphics-intensive game or program, it is important to avoid allocating -more GPU memory than it is physically available. When the memory is over-committed, -various bad things can happen, depending on the specific GPU, graphics driver, and -operating system: - -- It may just work without any problems. -- The application may slow down because some memory blocks are moved to system RAM - and the GPU has to access them through PCI Express bus. -- A new allocation may take very long time to complete, even few seconds, and possibly - freeze entire system. -- The new allocation may fail with `VK_ERROR_OUT_OF_DEVICE_MEMORY`. -- It may even result in GPU crash (TDR), observed as `VK_ERROR_DEVICE_LOST` - returned somewhere later. - -\section staying_within_budget_querying_for_budget Querying for budget - -To query for current memory usage and available budget, use function vmaGetHeapBudgets(). -Returned structure #VmaBudget contains quantities expressed in bytes, per Vulkan memory heap. - -Please note that this function returns different information and works faster than -vmaCalculateStatistics(). vmaGetHeapBudgets() can be called every frame or even before every -allocation, while vmaCalculateStatistics() is intended to be used rarely, -only to obtain statistical information, e.g. for debugging purposes. - -It is recommended to use VK_EXT_memory_budget device extension to obtain information -about the budget from Vulkan device. VMA is able to use this extension automatically. -When not enabled, the allocator behaves same way, but then it estimates current usage -and available budget based on its internal information and Vulkan memory heap sizes, -which may be less precise. In order to use this extension: - -1. Make sure extensions VK_EXT_memory_budget and VK_KHR_get_physical_device_properties2 - required by it are available and enable them. Please note that the first is a device - extension and the second is instance extension! -2. Use flag #VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT when creating #VmaAllocator object. -3. Make sure to call vmaSetCurrentFrameIndex() every frame. Budget is queried from - Vulkan inside of it to avoid overhead of querying it with every allocation. - -\section staying_within_budget_controlling_memory_usage Controlling memory usage - -There are many ways in which you can try to stay within the budget. - -First, when making new allocation requires allocating a new memory block, the library -tries not to exceed the budget automatically. If a block with default recommended size -(e.g. 256 MB) would go over budget, a smaller block is allocated, possibly even -dedicated memory for just this resource. - -If the size of the requested resource plus current memory usage is more than the -budget, by default the library still tries to create it, leaving it to the Vulkan -implementation whether the allocation succeeds or fails. You can change this behavior -by using #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag. With it, the allocation is -not made if it would exceed the budget or if the budget is already exceeded. -VMA then tries to make the allocation from the next eligible Vulkan memory type. -The all of them fail, the call then fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`. -Example usage pattern may be to pass the #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag -when creating resources that are not essential for the application (e.g. the texture -of a specific object) and not to pass it when creating critically important resources -(e.g. render targets). - -On AMD graphics cards there is a custom vendor extension available: VK_AMD_memory_overallocation_behavior -that allows to control the behavior of the Vulkan implementation in out-of-memory cases - -whether it should fail with an error code or still allow the allocation. -Usage of this extension involves only passing extra structure on Vulkan device creation, -so it is out of scope of this library. - -Finally, you can also use #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT flag to make sure -a new allocation is created only when it fits inside one of the existing memory blocks. -If it would require to allocate a new block, if fails instead with `VK_ERROR_OUT_OF_DEVICE_MEMORY`. -This also ensures that the function call is very fast because it never goes to Vulkan -to obtain a new block. - -\note Creating \ref custom_memory_pools with VmaPoolCreateInfo::minBlockCount -set to more than 0 will currently try to allocate memory blocks without checking whether they -fit within budget. - - -\page resource_aliasing Resource aliasing (overlap) - -New explicit graphics APIs (Vulkan and Direct3D 12), thanks to manual memory -management, give an opportunity to alias (overlap) multiple resources in the -same region of memory - a feature not available in the old APIs (Direct3D 11, OpenGL). -It can be useful to save video memory, but it must be used with caution. - -For example, if you know the flow of your whole render frame in advance, you -are going to use some intermediate textures or buffers only during a small range of render passes, -and you know these ranges don't overlap in time, you can bind these resources to -the same place in memory, even if they have completely different parameters (width, height, format etc.). - -![Resource aliasing (overlap)](../gfx/Aliasing.png) - -Such scenario is possible using VMA, but you need to create your images manually. -Then you need to calculate parameters of an allocation to be made using formula: - -- allocation size = max(size of each image) -- allocation alignment = max(alignment of each image) -- allocation memoryTypeBits = bitwise AND(memoryTypeBits of each image) - -Following example shows two different images bound to the same place in memory, -allocated to fit largest of them. - -\code -// A 512x512 texture to be sampled. -VkImageCreateInfo img1CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO }; -img1CreateInfo.imageType = VK_IMAGE_TYPE_2D; -img1CreateInfo.extent.width = 512; -img1CreateInfo.extent.height = 512; -img1CreateInfo.extent.depth = 1; -img1CreateInfo.mipLevels = 10; -img1CreateInfo.arrayLayers = 1; -img1CreateInfo.format = VK_FORMAT_R8G8B8A8_SRGB; -img1CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL; -img1CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; -img1CreateInfo.usage = VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT; -img1CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT; - -// A full screen texture to be used as color attachment. -VkImageCreateInfo img2CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO }; -img2CreateInfo.imageType = VK_IMAGE_TYPE_2D; -img2CreateInfo.extent.width = 1920; -img2CreateInfo.extent.height = 1080; -img2CreateInfo.extent.depth = 1; -img2CreateInfo.mipLevels = 1; -img2CreateInfo.arrayLayers = 1; -img2CreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM; -img2CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL; -img2CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; -img2CreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT; -img2CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT; - -VkImage img1; -res = vkCreateImage(device, &img1CreateInfo, nullptr, &img1); -VkImage img2; -res = vkCreateImage(device, &img2CreateInfo, nullptr, &img2); - -VkMemoryRequirements img1MemReq; -vkGetImageMemoryRequirements(device, img1, &img1MemReq); -VkMemoryRequirements img2MemReq; -vkGetImageMemoryRequirements(device, img2, &img2MemReq); - -VkMemoryRequirements finalMemReq = {}; -finalMemReq.size = std::max(img1MemReq.size, img2MemReq.size); -finalMemReq.alignment = std::max(img1MemReq.alignment, img2MemReq.alignment); -finalMemReq.memoryTypeBits = img1MemReq.memoryTypeBits & img2MemReq.memoryTypeBits; -// Validate if(finalMemReq.memoryTypeBits != 0) - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.preferredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; - -VmaAllocation alloc; -res = vmaAllocateMemory(allocator, &finalMemReq, &allocCreateInfo, &alloc, nullptr); - -res = vmaBindImageMemory(allocator, alloc, img1); -res = vmaBindImageMemory(allocator, alloc, img2); - -// You can use img1, img2 here, but not at the same time! - -vmaFreeMemory(allocator, alloc); -vkDestroyImage(allocator, img2, nullptr); -vkDestroyImage(allocator, img1, nullptr); -\endcode - -Remember that using resources that alias in memory requires proper synchronization. -You need to issue a memory barrier to make sure commands that use `img1` and `img2` -don't overlap on GPU timeline. -You also need to treat a resource after aliasing as uninitialized - containing garbage data. -For example, if you use `img1` and then want to use `img2`, you need to issue -an image memory barrier for `img2` with `oldLayout` = `VK_IMAGE_LAYOUT_UNDEFINED`. - -Additional considerations: - -- Vulkan also allows to interpret contents of memory between aliasing resources consistently in some cases. -See chapter 11.8. "Memory Aliasing" of Vulkan specification or `VK_IMAGE_CREATE_ALIAS_BIT` flag. -- You can create more complex layout where different images and buffers are bound -at different offsets inside one large allocation. For example, one can imagine -a big texture used in some render passes, aliasing with a set of many small buffers -used between in some further passes. To bind a resource at non-zero offset in an allocation, -use vmaBindBufferMemory2() / vmaBindImageMemory2(). -- Before allocating memory for the resources you want to alias, check `memoryTypeBits` -returned in memory requirements of each resource to make sure the bits overlap. -Some GPUs may expose multiple memory types suitable e.g. only for buffers or -images with `COLOR_ATTACHMENT` usage, so the sets of memory types supported by your -resources may be disjoint. Aliasing them is not possible in that case. - - -\page custom_memory_pools Custom memory pools - -A memory pool contains a number of `VkDeviceMemory` blocks. -The library automatically creates and manages default pool for each memory type available on the device. -Default memory pool automatically grows in size. -Size of allocated blocks is also variable and managed automatically. - -You can create custom pool and allocate memory out of it. -It can be useful if you want to: - -- Keep certain kind of allocations separate from others. -- Enforce particular, fixed size of Vulkan memory blocks. -- Limit maximum amount of Vulkan memory allocated for that pool. -- Reserve minimum or fixed amount of Vulkan memory always preallocated for that pool. -- Use extra parameters for a set of your allocations that are available in #VmaPoolCreateInfo but not in - #VmaAllocationCreateInfo - e.g., custom minimum alignment, custom `pNext` chain. -- Perform defragmentation on a specific subset of your allocations. - -To use custom memory pools: - --# Fill VmaPoolCreateInfo structure. --# Call vmaCreatePool() to obtain #VmaPool handle. --# When making an allocation, set VmaAllocationCreateInfo::pool to this handle. - You don't need to specify any other parameters of this structure, like `usage`. - -Example: - -\code -// Find memoryTypeIndex for the pool. -VkBufferCreateInfo sampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -sampleBufCreateInfo.size = 0x10000; // Doesn't matter. -sampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo sampleAllocCreateInfo = {}; -sampleAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; - -uint32_t memTypeIndex; -VkResult res = vmaFindMemoryTypeIndexForBufferInfo(allocator, - &sampleBufCreateInfo, &sampleAllocCreateInfo, &memTypeIndex); -// Check res... - -// Create a pool that can have at most 2 blocks, 128 MiB each. -VmaPoolCreateInfo poolCreateInfo = {}; -poolCreateInfo.memoryTypeIndex = memTypeIndex; -poolCreateInfo.blockSize = 128ull * 1024 * 1024; -poolCreateInfo.maxBlockCount = 2; - -VmaPool pool; -res = vmaCreatePool(allocator, &poolCreateInfo, &pool); -// Check res... - -// Allocate a buffer out of it. -VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufCreateInfo.size = 1024; -bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.pool = pool; - -VkBuffer buf; -VmaAllocation alloc; -res = vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, nullptr); -// Check res... -\endcode - -You have to free all allocations made from this pool before destroying it. - -\code -vmaDestroyBuffer(allocator, buf, alloc); -vmaDestroyPool(allocator, pool); -\endcode - -New versions of this library support creating dedicated allocations in custom pools. -It is supported only when VmaPoolCreateInfo::blockSize = 0. -To use this feature, set VmaAllocationCreateInfo::pool to the pointer to your custom pool and -VmaAllocationCreateInfo::flags to #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. - -\note Excessive use of custom pools is a common mistake when using this library. -Custom pools may be useful for special purposes - when you want to -keep certain type of resources separate e.g. to reserve minimum amount of memory -for them or limit maximum amount of memory they can occupy. For most -resources this is not needed and so it is not recommended to create #VmaPool -objects and allocations out of them. Allocating from the default pool is sufficient. - - -\section custom_memory_pools_MemTypeIndex Choosing memory type index - -When creating a pool, you must explicitly specify memory type index. -To find the one suitable for your buffers or images, you can use helper functions -vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo(). -You need to provide structures with example parameters of buffers or images -that you are going to create in that pool. - -\code -VkBufferCreateInfo exampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -exampleBufCreateInfo.size = 1024; // Doesn't matter -exampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; - -uint32_t memTypeIndex; -vmaFindMemoryTypeIndexForBufferInfo(allocator, &exampleBufCreateInfo, &allocCreateInfo, &memTypeIndex); - -VmaPoolCreateInfo poolCreateInfo = {}; -poolCreateInfo.memoryTypeIndex = memTypeIndex; -// ... -\endcode - -When creating buffers/images allocated in that pool, provide following parameters: - -- `VkBufferCreateInfo`: Prefer to pass same parameters as above. - Otherwise you risk creating resources in a memory type that is not suitable for them, which may result in undefined behavior. - Using different `VK_BUFFER_USAGE_` flags may work, but you shouldn't create images in a pool intended for buffers - or the other way around. -- VmaAllocationCreateInfo: You don't need to pass same parameters. Fill only `pool` member. - Other members are ignored anyway. - -\section linear_algorithm Linear allocation algorithm - -Each Vulkan memory block managed by this library has accompanying metadata that -keeps track of used and unused regions. By default, the metadata structure and -algorithm tries to find best place for new allocations among free regions to -optimize memory usage. This way you can allocate and free objects in any order. - -![Default allocation algorithm](../gfx/Linear_allocator_1_algo_default.png) - -Sometimes there is a need to use simpler, linear allocation algorithm. You can -create custom pool that uses such algorithm by adding flag -#VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT to VmaPoolCreateInfo::flags while creating -#VmaPool object. Then an alternative metadata management is used. It always -creates new allocations after last one and doesn't reuse free regions after -allocations freed in the middle. It results in better allocation performance and -less memory consumed by metadata. - -![Linear allocation algorithm](../gfx/Linear_allocator_2_algo_linear.png) - -With this one flag, you can create a custom pool that can be used in many ways: -free-at-once, stack, double stack, and ring buffer. See below for details. -You don't need to specify explicitly which of these options you are going to use - it is detected automatically. - -\subsection linear_algorithm_free_at_once Free-at-once - -In a pool that uses linear algorithm, you still need to free all the allocations -individually, e.g. by using vmaFreeMemory() or vmaDestroyBuffer(). You can free -them in any order. New allocations are always made after last one - free space -in the middle is not reused. However, when you release all the allocation and -the pool becomes empty, allocation starts from the beginning again. This way you -can use linear algorithm to speed up creation of allocations that you are going -to release all at once. - -![Free-at-once](../gfx/Linear_allocator_3_free_at_once.png) - -This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount -value that allows multiple memory blocks. - -\subsection linear_algorithm_stack Stack - -When you free an allocation that was created last, its space can be reused. -Thanks to this, if you always release allocations in the order opposite to their -creation (LIFO - Last In First Out), you can achieve behavior of a stack. - -![Stack](../gfx/Linear_allocator_4_stack.png) - -This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount -value that allows multiple memory blocks. - -\subsection linear_algorithm_double_stack Double stack - -The space reserved by a custom pool with linear algorithm may be used by two -stacks: - -- First, default one, growing up from offset 0. -- Second, "upper" one, growing down from the end towards lower offsets. - -To make allocation from the upper stack, add flag #VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT -to VmaAllocationCreateInfo::flags. - -![Double stack](../gfx/Linear_allocator_7_double_stack.png) - -Double stack is available only in pools with one memory block - -VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined. - -When the two stacks' ends meet so there is not enough space between them for a -new allocation, such allocation fails with usual -`VK_ERROR_OUT_OF_DEVICE_MEMORY` error. - -\subsection linear_algorithm_ring_buffer Ring buffer - -When you free some allocations from the beginning and there is not enough free space -for a new one at the end of a pool, allocator's "cursor" wraps around to the -beginning and starts allocation there. Thanks to this, if you always release -allocations in the same order as you created them (FIFO - First In First Out), -you can achieve behavior of a ring buffer / queue. - -![Ring buffer](../gfx/Linear_allocator_5_ring_buffer.png) - -Ring buffer is available only in pools with one memory block - -VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined. - -\note \ref defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT. - - -\page defragmentation Defragmentation - -Interleaved allocations and deallocations of many objects of varying size can -cause fragmentation over time, which can lead to a situation where the library is unable -to find a continuous range of free memory for a new allocation despite there is -enough free space, just scattered across many small free ranges between existing -allocations. - -To mitigate this problem, you can use defragmentation feature. -It doesn't happen automatically though and needs your cooperation, -because VMA is a low level library that only allocates memory. -It cannot recreate buffers and images in a new place as it doesn't remember the contents of `VkBufferCreateInfo` / `VkImageCreateInfo` structures. -It cannot copy their contents as it doesn't record any commands to a command buffer. - -Example: - -\code -VmaDefragmentationInfo defragInfo = {}; -defragInfo.pool = myPool; -defragInfo.flags = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT; - -VmaDefragmentationContext defragCtx; -VkResult res = vmaBeginDefragmentation(allocator, &defragInfo, &defragCtx); -// Check res... - -for(;;) -{ - VmaDefragmentationPassMoveInfo pass; - res = vmaBeginDefragmentationPass(allocator, defragCtx, &pass); - if(res == VK_SUCCESS) - break; - else if(res != VK_INCOMPLETE) - // Handle error... - - for(uint32_t i = 0; i < pass.moveCount; ++i) - { - // Inspect pass.pMoves[i].srcAllocation, identify what buffer/image it represents. - VmaAllocationInfo allocInfo; - vmaGetAllocationInfo(allocator, pMoves[i].srcAllocation, &allocInfo); - MyEngineResourceData* resData = (MyEngineResourceData*)allocInfo.pUserData; - - // Recreate and bind this buffer/image at: pass.pMoves[i].dstMemory, pass.pMoves[i].dstOffset. - VkImageCreateInfo imgCreateInfo = ... - VkImage newImg; - res = vkCreateImage(device, &imgCreateInfo, nullptr, &newImg); - // Check res... - res = vmaBindImageMemory(allocator, pMoves[i].dstTmpAllocation, newImg); - // Check res... - - // Issue a vkCmdCopyBuffer/vkCmdCopyImage to copy its content to the new place. - vkCmdCopyImage(cmdBuf, resData->img, ..., newImg, ...); - } - - // Make sure the copy commands finished executing. - vkWaitForFences(...); - - // Destroy old buffers/images bound with pass.pMoves[i].srcAllocation. - for(uint32_t i = 0; i < pass.moveCount; ++i) - { - // ... - vkDestroyImage(device, resData->img, nullptr); - } - - // Update appropriate descriptors to point to the new places... - - res = vmaEndDefragmentationPass(allocator, defragCtx, &pass); - if(res == VK_SUCCESS) - break; - else if(res != VK_INCOMPLETE) - // Handle error... -} - -vmaEndDefragmentation(allocator, defragCtx, nullptr); -\endcode - -Although functions like vmaCreateBuffer(), vmaCreateImage(), vmaDestroyBuffer(), vmaDestroyImage() -create/destroy an allocation and a buffer/image at once, these are just a shortcut for -creating the resource, allocating memory, and binding them together. -Defragmentation works on memory allocations only. You must handle the rest manually. -Defragmentation is an iterative process that should repreat "passes" as long as related functions -return `VK_INCOMPLETE` not `VK_SUCCESS`. -In each pass: - -1. vmaBeginDefragmentationPass() function call: - - Calculates and returns the list of allocations to be moved in this pass. - Note this can be a time-consuming process. - - Reserves destination memory for them by creating temporary destination allocations - that you can query for their `VkDeviceMemory` + offset using vmaGetAllocationInfo(). -2. Inside the pass, **you should**: - - Inspect the returned list of allocations to be moved. - - Create new buffers/images and bind them at the returned destination temporary allocations. - - Copy data from source to destination resources if necessary. - - Destroy the source buffers/images, but NOT their allocations. -3. vmaEndDefragmentationPass() function call: - - Frees the source memory reserved for the allocations that are moved. - - Modifies source #VmaAllocation objects that are moved to point to the destination reserved memory. - - Frees `VkDeviceMemory` blocks that became empty. - -Unlike in previous iterations of the defragmentation API, there is no list of "movable" allocations passed as a parameter. -Defragmentation algorithm tries to move all suitable allocations. -You can, however, refuse to move some of them inside a defragmentation pass, by setting -`pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE. -This is not recommended and may result in suboptimal packing of the allocations after defragmentation. -If you cannot ensure any allocation can be moved, it is better to keep movable allocations separate in a custom pool. - -Inside a pass, for each allocation that should be moved: - -- You should copy its data from the source to the destination place by calling e.g. `vkCmdCopyBuffer()`, `vkCmdCopyImage()`. - - You need to make sure these commands finished executing before destroying the source buffers/images and before calling vmaEndDefragmentationPass(). -- If a resource doesn't contain any meaningful data, e.g. it is a transient color attachment image to be cleared, - filled, and used temporarily in each rendering frame, you can just recreate this image - without copying its data. -- If the resource is in `HOST_VISIBLE` and `HOST_CACHED` memory, you can copy its data on the CPU - using `memcpy()`. -- If you cannot move the allocation, you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE. - This will cancel the move. - - vmaEndDefragmentationPass() will then free the destination memory - not the source memory of the allocation, leaving it unchanged. -- If you decide the allocation is unimportant and can be destroyed instead of moved (e.g. it wasn't used for long time), - you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY. - - vmaEndDefragmentationPass() will then free both source and destination memory, and will destroy the source #VmaAllocation object. - -You can defragment a specific custom pool by setting VmaDefragmentationInfo::pool -(like in the example above) or all the default pools by setting this member to null. - -Defragmentation is always performed in each pool separately. -Allocations are never moved between different Vulkan memory types. -The size of the destination memory reserved for a moved allocation is the same as the original one. -Alignment of an allocation as it was determined using `vkGetBufferMemoryRequirements()` etc. is also respected after defragmentation. -Buffers/images should be recreated with the same `VkBufferCreateInfo` / `VkImageCreateInfo` parameters as the original ones. - -You can perform the defragmentation incrementally to limit the number of allocations and bytes to be moved -in each pass, e.g. to call it in sync with render frames and not to experience too big hitches. -See members: VmaDefragmentationInfo::maxBytesPerPass, VmaDefragmentationInfo::maxAllocationsPerPass. - -It is also safe to perform the defragmentation asynchronously to render frames and other Vulkan and VMA -usage, possibly from multiple threads, with the exception that allocations -returned in VmaDefragmentationPassMoveInfo::pMoves shouldn't be destroyed until the defragmentation pass is ended. - -Mapping is preserved on allocations that are moved during defragmentation. -Whether through #VMA_ALLOCATION_CREATE_MAPPED_BIT or vmaMapMemory(), the allocations -are mapped at their new place. Of course, pointer to the mapped data changes, so it needs to be queried -using VmaAllocationInfo::pMappedData. - -\note Defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT. - - -\page statistics Statistics - -This library contains several functions that return information about its internal state, -especially the amount of memory allocated from Vulkan. - -\section statistics_numeric_statistics Numeric statistics - -If you need to obtain basic statistics about memory usage per heap, together with current budget, -you can call function vmaGetHeapBudgets() and inspect structure #VmaBudget. -This is useful to keep track of memory usage and stay withing budget -(see also \ref staying_within_budget). -Example: - -\code -uint32_t heapIndex = ... - -VmaBudget budgets[VK_MAX_MEMORY_HEAPS]; -vmaGetHeapBudgets(allocator, budgets); - -printf("My heap currently has %u allocations taking %llu B,\n", - budgets[heapIndex].statistics.allocationCount, - budgets[heapIndex].statistics.allocationBytes); -printf("allocated out of %u Vulkan device memory blocks taking %llu B,\n", - budgets[heapIndex].statistics.blockCount, - budgets[heapIndex].statistics.blockBytes); -printf("Vulkan reports total usage %llu B with budget %llu B.\n", - budgets[heapIndex].usage, - budgets[heapIndex].budget); -\endcode - -You can query for more detailed statistics per memory heap, type, and totals, -including minimum and maximum allocation size and unused range size, -by calling function vmaCalculateStatistics() and inspecting structure #VmaTotalStatistics. -This function is slower though, as it has to traverse all the internal data structures, -so it should be used only for debugging purposes. - -You can query for statistics of a custom pool using function vmaGetPoolStatistics() -or vmaCalculatePoolStatistics(). - -You can query for information about a specific allocation using function vmaGetAllocationInfo(). -It fill structure #VmaAllocationInfo. - -\section statistics_json_dump JSON dump - -You can dump internal state of the allocator to a string in JSON format using function vmaBuildStatsString(). -The result is guaranteed to be correct JSON. -It uses ANSI encoding. -Any strings provided by user (see [Allocation names](@ref allocation_names)) -are copied as-is and properly escaped for JSON, so if they use UTF-8, ISO-8859-2 or any other encoding, -this JSON string can be treated as using this encoding. -It must be freed using function vmaFreeStatsString(). - -The format of this JSON string is not part of official documentation of the library, -but it will not change in backward-incompatible way without increasing library major version number -and appropriate mention in changelog. - -The JSON string contains all the data that can be obtained using vmaCalculateStatistics(). -It can also contain detailed map of allocated memory blocks and their regions - -free and occupied by allocations. -This allows e.g. to visualize the memory or assess fragmentation. - - -\page allocation_annotation Allocation names and user data - -\section allocation_user_data Allocation user data - -You can annotate allocations with your own information, e.g. for debugging purposes. -To do that, fill VmaAllocationCreateInfo::pUserData field when creating -an allocation. It is an opaque `void*` pointer. You can use it e.g. as a pointer, -some handle, index, key, ordinal number or any other value that would associate -the allocation with your custom metadata. -It it useful to identify appropriate data structures in your engine given #VmaAllocation, -e.g. when doing \ref defragmentation. - -\code -VkBufferCreateInfo bufCreateInfo = ... - -MyBufferMetadata* pMetadata = CreateBufferMetadata(); - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.pUserData = pMetadata; - -VkBuffer buffer; -VmaAllocation allocation; -vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buffer, &allocation, nullptr); -\endcode - -The pointer may be later retrieved as VmaAllocationInfo::pUserData: - -\code -VmaAllocationInfo allocInfo; -vmaGetAllocationInfo(allocator, allocation, &allocInfo); -MyBufferMetadata* pMetadata = (MyBufferMetadata*)allocInfo.pUserData; -\endcode - -It can also be changed using function vmaSetAllocationUserData(). - -Values of (non-zero) allocations' `pUserData` are printed in JSON report created by -vmaBuildStatsString() in hexadecimal form. - -\section allocation_names Allocation names - -An allocation can also carry a null-terminated string, giving a name to the allocation. -To set it, call vmaSetAllocationName(). -The library creates internal copy of the string, so the pointer you pass doesn't need -to be valid for whole lifetime of the allocation. You can free it after the call. - -\code -std::string imageName = "Texture: "; -imageName += fileName; -vmaSetAllocationName(allocator, allocation, imageName.c_str()); -\endcode - -The string can be later retrieved by inspecting VmaAllocationInfo::pName. -It is also printed in JSON report created by vmaBuildStatsString(). - -\note Setting string name to VMA allocation doesn't automatically set it to the Vulkan buffer or image created with it. -You must do it manually using an extension like VK_EXT_debug_utils, which is independent of this library. - - -\page virtual_allocator Virtual allocator - -As an extra feature, the core allocation algorithm of the library is exposed through a simple and convenient API of "virtual allocator". -It doesn't allocate any real GPU memory. It just keeps track of used and free regions of a "virtual block". -You can use it to allocate your own memory or other objects, even completely unrelated to Vulkan. -A common use case is sub-allocation of pieces of one large GPU buffer. - -\section virtual_allocator_creating_virtual_block Creating virtual block - -To use this functionality, there is no main "allocator" object. -You don't need to have #VmaAllocator object created. -All you need to do is to create a separate #VmaVirtualBlock object for each block of memory you want to be managed by the allocator: - --# Fill in #VmaVirtualBlockCreateInfo structure. --# Call vmaCreateVirtualBlock(). Get new #VmaVirtualBlock object. - -Example: - -\code -VmaVirtualBlockCreateInfo blockCreateInfo = {}; -blockCreateInfo.size = 1048576; // 1 MB - -VmaVirtualBlock block; -VkResult res = vmaCreateVirtualBlock(&blockCreateInfo, &block); -\endcode - -\section virtual_allocator_making_virtual_allocations Making virtual allocations - -#VmaVirtualBlock object contains internal data structure that keeps track of free and occupied regions -using the same code as the main Vulkan memory allocator. -Similarly to #VmaAllocation for standard GPU allocations, there is #VmaVirtualAllocation type -that represents an opaque handle to an allocation withing the virtual block. - -In order to make such allocation: - --# Fill in #VmaVirtualAllocationCreateInfo structure. --# Call vmaVirtualAllocate(). Get new #VmaVirtualAllocation object that represents the allocation. - You can also receive `VkDeviceSize offset` that was assigned to the allocation. - -Example: - -\code -VmaVirtualAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.size = 4096; // 4 KB - -VmaVirtualAllocation alloc; -VkDeviceSize offset; -res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, &offset); -if(res == VK_SUCCESS) -{ - // Use the 4 KB of your memory starting at offset. -} -else -{ - // Allocation failed - no space for it could be found. Handle this error! -} -\endcode - -\section virtual_allocator_deallocation Deallocation - -When no longer needed, an allocation can be freed by calling vmaVirtualFree(). -You can only pass to this function an allocation that was previously returned by vmaVirtualAllocate() -called for the same #VmaVirtualBlock. - -When whole block is no longer needed, the block object can be released by calling vmaDestroyVirtualBlock(). -All allocations must be freed before the block is destroyed, which is checked internally by an assert. -However, if you don't want to call vmaVirtualFree() for each allocation, you can use vmaClearVirtualBlock() to free them all at once - -a feature not available in normal Vulkan memory allocator. Example: - -\code -vmaVirtualFree(block, alloc); -vmaDestroyVirtualBlock(block); -\endcode - -\section virtual_allocator_allocation_parameters Allocation parameters - -You can attach a custom pointer to each allocation by using vmaSetVirtualAllocationUserData(). -Its default value is null. -It can be used to store any data that needs to be associated with that allocation - e.g. an index, a handle, or a pointer to some -larger data structure containing more information. Example: - -\code -struct CustomAllocData -{ - std::string m_AllocName; -}; -CustomAllocData* allocData = new CustomAllocData(); -allocData->m_AllocName = "My allocation 1"; -vmaSetVirtualAllocationUserData(block, alloc, allocData); -\endcode - -The pointer can later be fetched, along with allocation offset and size, by passing the allocation handle to function -vmaGetVirtualAllocationInfo() and inspecting returned structure #VmaVirtualAllocationInfo. -If you allocated a new object to be used as the custom pointer, don't forget to delete that object before freeing the allocation! -Example: - -\code -VmaVirtualAllocationInfo allocInfo; -vmaGetVirtualAllocationInfo(block, alloc, &allocInfo); -delete (CustomAllocData*)allocInfo.pUserData; - -vmaVirtualFree(block, alloc); -\endcode - -\section virtual_allocator_alignment_and_units Alignment and units - -It feels natural to express sizes and offsets in bytes. -If an offset of an allocation needs to be aligned to a multiply of some number (e.g. 4 bytes), you can fill optional member -VmaVirtualAllocationCreateInfo::alignment to request it. Example: - -\code -VmaVirtualAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.size = 4096; // 4 KB -allocCreateInfo.alignment = 4; // Returned offset must be a multiply of 4 B - -VmaVirtualAllocation alloc; -res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, nullptr); -\endcode - -Alignments of different allocations made from one block may vary. -However, if all alignments and sizes are always multiply of some size e.g. 4 B or `sizeof(MyDataStruct)`, -you can express all sizes, alignments, and offsets in multiples of that size instead of individual bytes. -It might be more convenient, but you need to make sure to use this new unit consistently in all the places: - -- VmaVirtualBlockCreateInfo::size -- VmaVirtualAllocationCreateInfo::size and VmaVirtualAllocationCreateInfo::alignment -- Using offset returned by vmaVirtualAllocate() or in VmaVirtualAllocationInfo::offset - -\section virtual_allocator_statistics Statistics - -You can obtain statistics of a virtual block using vmaGetVirtualBlockStatistics() -(to get brief statistics that are fast to calculate) -or vmaCalculateVirtualBlockStatistics() (to get more detailed statistics, slower to calculate). -The functions fill structures #VmaStatistics, #VmaDetailedStatistics respectively - same as used by the normal Vulkan memory allocator. -Example: - -\code -VmaStatistics stats; -vmaGetVirtualBlockStatistics(block, &stats); -printf("My virtual block has %llu bytes used by %u virtual allocations\n", - stats.allocationBytes, stats.allocationCount); -\endcode - -You can also request a full list of allocations and free regions as a string in JSON format by calling -vmaBuildVirtualBlockStatsString(). -Returned string must be later freed using vmaFreeVirtualBlockStatsString(). -The format of this string differs from the one returned by the main Vulkan allocator, but it is similar. - -\section virtual_allocator_additional_considerations Additional considerations - -The "virtual allocator" functionality is implemented on a level of individual memory blocks. -Keeping track of a whole collection of blocks, allocating new ones when out of free space, -deleting empty ones, and deciding which one to try first for a new allocation must be implemented by the user. - -Alternative allocation algorithms are supported, just like in custom pools of the real GPU memory. -See enum #VmaVirtualBlockCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT). -You can find their description in chapter \ref custom_memory_pools. -Allocation strategies are also supported. -See enum #VmaVirtualAllocationCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT). - -Following features are supported only by the allocator of the real GPU memory and not by virtual allocations: -buffer-image granularity, `VMA_DEBUG_MARGIN`, `VMA_MIN_ALIGNMENT`. - - -\page debugging_memory_usage Debugging incorrect memory usage - -If you suspect a bug with memory usage, like usage of uninitialized memory or -memory being overwritten out of bounds of an allocation, -you can use debug features of this library to verify this. - -\section debugging_memory_usage_initialization Memory initialization - -If you experience a bug with incorrect and nondeterministic data in your program and you suspect uninitialized memory to be used, -you can enable automatic memory initialization to verify this. -To do it, define macro `VMA_DEBUG_INITIALIZE_ALLOCATIONS` to 1. - -\code -#define VMA_DEBUG_INITIALIZE_ALLOCATIONS 1 -#include "vk_mem_alloc.h" -\endcode - -It makes memory of new allocations initialized to bit pattern `0xDCDCDCDC`. -Before an allocation is destroyed, its memory is filled with bit pattern `0xEFEFEFEF`. -Memory is automatically mapped and unmapped if necessary. - -If you find these values while debugging your program, good chances are that you incorrectly -read Vulkan memory that is allocated but not initialized, or already freed, respectively. - -Memory initialization works only with memory types that are `HOST_VISIBLE` and with allocations that can be mapped. -It works also with dedicated allocations. - -\section debugging_memory_usage_margins Margins - -By default, allocations are laid out in memory blocks next to each other if possible -(considering required alignment, `bufferImageGranularity`, and `nonCoherentAtomSize`). - -![Allocations without margin](../gfx/Margins_1.png) - -Define macro `VMA_DEBUG_MARGIN` to some non-zero value (e.g. 16) to enforce specified -number of bytes as a margin after every allocation. - -\code -#define VMA_DEBUG_MARGIN 16 -#include "vk_mem_alloc.h" -\endcode - -![Allocations with margin](../gfx/Margins_2.png) - -If your bug goes away after enabling margins, it means it may be caused by memory -being overwritten outside of allocation boundaries. It is not 100% certain though. -Change in application behavior may also be caused by different order and distribution -of allocations across memory blocks after margins are applied. - -Margins work with all types of memory. - -Margin is applied only to allocations made out of memory blocks and not to dedicated -allocations, which have their own memory block of specific size. -It is thus not applied to allocations made using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT flag -or those automatically decided to put into dedicated allocations, e.g. due to its -large size or recommended by VK_KHR_dedicated_allocation extension. - -Margins appear in [JSON dump](@ref statistics_json_dump) as part of free space. - -Note that enabling margins increases memory usage and fragmentation. - -Margins do not apply to \ref virtual_allocator. - -\section debugging_memory_usage_corruption_detection Corruption detection - -You can additionally define macro `VMA_DEBUG_DETECT_CORRUPTION` to 1 to enable validation -of contents of the margins. - -\code -#define VMA_DEBUG_MARGIN 16 -#define VMA_DEBUG_DETECT_CORRUPTION 1 -#include "vk_mem_alloc.h" -\endcode - -When this feature is enabled, number of bytes specified as `VMA_DEBUG_MARGIN` -(it must be multiply of 4) after every allocation is filled with a magic number. -This idea is also know as "canary". -Memory is automatically mapped and unmapped if necessary. - -This number is validated automatically when the allocation is destroyed. -If it is not equal to the expected value, `VMA_ASSERT()` is executed. -It clearly means that either CPU or GPU overwritten the memory outside of boundaries of the allocation, -which indicates a serious bug. - -You can also explicitly request checking margins of all allocations in all memory blocks -that belong to specified memory types by using function vmaCheckCorruption(), -or in memory blocks that belong to specified custom pool, by using function -vmaCheckPoolCorruption(). - -Margin validation (corruption detection) works only for memory types that are -`HOST_VISIBLE` and `HOST_COHERENT`. - - -\page opengl_interop OpenGL Interop - -VMA provides some features that help with interoperability with OpenGL. - -\section opengl_interop_exporting_memory Exporting memory - -If you want to attach `VkExportMemoryAllocateInfoKHR` structure to `pNext` chain of memory allocations made by the library: - -It is recommended to create \ref custom_memory_pools for such allocations. -Define and fill in your `VkExportMemoryAllocateInfoKHR` structure and attach it to VmaPoolCreateInfo::pMemoryAllocateNext -while creating the custom pool. -Please note that the structure must remain alive and unchanged for the whole lifetime of the #VmaPool, -not only while creating it, as no copy of the structure is made, -but its original pointer is used for each allocation instead. - -If you want to export all memory allocated by the library from certain memory types, -also dedicated allocations or other allocations made from default pools, -an alternative solution is to fill in VmaAllocatorCreateInfo::pTypeExternalMemoryHandleTypes. -It should point to an array with `VkExternalMemoryHandleTypeFlagsKHR` to be automatically passed by the library -through `VkExportMemoryAllocateInfoKHR` on each allocation made from a specific memory type. -Please note that new versions of the library also support dedicated allocations created in custom pools. - -You should not mix these two methods in a way that allows to apply both to the same memory type. -Otherwise, `VkExportMemoryAllocateInfoKHR` structure would be attached twice to the `pNext` chain of `VkMemoryAllocateInfo`. - - -\section opengl_interop_custom_alignment Custom alignment - -Buffers or images exported to a different API like OpenGL may require a different alignment, -higher than the one used by the library automatically, queried from functions like `vkGetBufferMemoryRequirements`. -To impose such alignment: - -It is recommended to create \ref custom_memory_pools for such allocations. -Set VmaPoolCreateInfo::minAllocationAlignment member to the minimum alignment required for each allocation -to be made out of this pool. -The alignment actually used will be the maximum of this member and the alignment returned for the specific buffer or image -from a function like `vkGetBufferMemoryRequirements`, which is called by VMA automatically. - -If you want to create a buffer with a specific minimum alignment out of default pools, -use special function vmaCreateBufferWithAlignment(), which takes additional parameter `minAlignment`. - -Note the problem of alignment affects only resources placed inside bigger `VkDeviceMemory` blocks and not dedicated -allocations, as these, by definition, always have alignment = 0 because the resource is bound to the beginning of its dedicated block. -Contrary to Direct3D 12, Vulkan doesn't have a concept of alignment of the entire memory block passed on its allocation. - - -\page usage_patterns Recommended usage patterns - -Vulkan gives great flexibility in memory allocation. -This chapter shows the most common patterns. - -See also slides from talk: -[Sawicki, Adam. Advanced Graphics Techniques Tutorial: Memory management in Vulkan and DX12. Game Developers Conference, 2018](https://www.gdcvault.com/play/1025458/Advanced-Graphics-Techniques-Tutorial-New) - - -\section usage_patterns_gpu_only GPU-only resource - -When: -Any resources that you frequently write and read on GPU, -e.g. images used as color attachments (aka "render targets"), depth-stencil attachments, -images/buffers used as storage image/buffer (aka "Unordered Access View (UAV)"). - -What to do: -Let the library select the optimal memory type, which will likely have `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`. - -\code -VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO }; -imgCreateInfo.imageType = VK_IMAGE_TYPE_2D; -imgCreateInfo.extent.width = 3840; -imgCreateInfo.extent.height = 2160; -imgCreateInfo.extent.depth = 1; -imgCreateInfo.mipLevels = 1; -imgCreateInfo.arrayLayers = 1; -imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM; -imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL; -imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; -imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT; -imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT; -allocCreateInfo.priority = 1.0f; - -VkImage img; -VmaAllocation alloc; -vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr); -\endcode - -Also consider: -Consider creating them as dedicated allocations using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT, -especially if they are large or if you plan to destroy and recreate them with different sizes -e.g. when display resolution changes. -Prefer to create such resources first and all other GPU resources (like textures and vertex buffers) later. -When VK_EXT_memory_priority extension is enabled, it is also worth setting high priority to such allocation -to decrease chances to be evicted to system memory by the operating system. - -\section usage_patterns_staging_copy_upload Staging copy for upload - -When: -A "staging" buffer than you want to map and fill from CPU code, then use as a source od transfer -to some GPU resource. - -What to do: -Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT. -Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`. - -\code -VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufCreateInfo.size = 65536; -bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | - VMA_ALLOCATION_CREATE_MAPPED_BIT; - -VkBuffer buf; -VmaAllocation alloc; -VmaAllocationInfo allocInfo; -vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo); - -... - -memcpy(allocInfo.pMappedData, myData, myDataSize); -\endcode - -Also consider: -You can map the allocation using vmaMapMemory() or you can create it as persistenly mapped -using #VMA_ALLOCATION_CREATE_MAPPED_BIT, as in the example above. - - -\section usage_patterns_readback Readback - -When: -Buffers for data written by or transferred from the GPU that you want to read back on the CPU, -e.g. results of some computations. - -What to do: -Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT. -Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` -and `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`. - -\code -VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufCreateInfo.size = 65536; -bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT | - VMA_ALLOCATION_CREATE_MAPPED_BIT; - -VkBuffer buf; -VmaAllocation alloc; -VmaAllocationInfo allocInfo; -vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo); - -... - -const float* downloadedData = (const float*)allocInfo.pMappedData; -\endcode - - -\section usage_patterns_advanced_data_uploading Advanced data uploading - -For resources that you frequently write on CPU via mapped pointer and -freqnently read on GPU e.g. as a uniform buffer (also called "dynamic"), multiple options are possible: - --# Easiest solution is to have one copy of the resource in `HOST_VISIBLE` memory, - even if it means system RAM (not `DEVICE_LOCAL`) on systems with a discrete graphics card, - and make the device reach out to that resource directly. - - Reads performed by the device will then go through PCI Express bus. - The performace of this access may be limited, but it may be fine depending on the size - of this resource (whether it is small enough to quickly end up in GPU cache) and the sparsity - of access. --# On systems with unified memory (e.g. AMD APU or Intel integrated graphics, mobile chips), - a memory type may be available that is both `HOST_VISIBLE` (available for mapping) and `DEVICE_LOCAL` - (fast to access from the GPU). Then, it is likely the best choice for such type of resource. --# Systems with a discrete graphics card and separate video memory may or may not expose - a memory type that is both `HOST_VISIBLE` and `DEVICE_LOCAL`, also known as Base Address Register (BAR). - If they do, it represents a piece of VRAM (or entire VRAM, if ReBAR is enabled in the motherboard BIOS) - that is available to CPU for mapping. - - Writes performed by the host to that memory go through PCI Express bus. - The performance of these writes may be limited, but it may be fine, especially on PCIe 4.0, - as long as rules of using uncached and write-combined memory are followed - only sequential writes and no reads. --# Finally, you may need or prefer to create a separate copy of the resource in `DEVICE_LOCAL` memory, - a separate "staging" copy in `HOST_VISIBLE` memory and perform an explicit transfer command between them. - -Thankfully, VMA offers an aid to create and use such resources in the the way optimal -for the current Vulkan device. To help the library make the best choice, -use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT together with -#VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT. -It will then prefer a memory type that is both `DEVICE_LOCAL` and `HOST_VISIBLE` (integrated memory or BAR), -but if no such memory type is available or allocation from it fails -(PC graphics cards have only 256 MB of BAR by default, unless ReBAR is supported and enabled in BIOS), -it will fall back to `DEVICE_LOCAL` memory for fast GPU access. -It is then up to you to detect that the allocation ended up in a memory type that is not `HOST_VISIBLE`, -so you need to create another "staging" allocation and perform explicit transfers. - -\code -VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; -bufCreateInfo.size = 65536; -bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | - VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT | - VMA_ALLOCATION_CREATE_MAPPED_BIT; - -VkBuffer buf; -VmaAllocation alloc; -VmaAllocationInfo allocInfo; -vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo); - -VkMemoryPropertyFlags memPropFlags; -vmaGetAllocationMemoryProperties(allocator, alloc, &memPropFlags); - -if(memPropFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) -{ - // Allocation ended up in a mappable memory and is already mapped - write to it directly. - - // [Executed in runtime]: - memcpy(allocInfo.pMappedData, myData, myDataSize); -} -else -{ - // Allocation ended up in a non-mappable memory - need to transfer. - VkBufferCreateInfo stagingBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; - stagingBufCreateInfo.size = 65536; - stagingBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT; - - VmaAllocationCreateInfo stagingAllocCreateInfo = {}; - stagingAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; - stagingAllocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | - VMA_ALLOCATION_CREATE_MAPPED_BIT; - - VkBuffer stagingBuf; - VmaAllocation stagingAlloc; - VmaAllocationInfo stagingAllocInfo; - vmaCreateBuffer(allocator, &stagingBufCreateInfo, &stagingAllocCreateInfo, - &stagingBuf, &stagingAlloc, stagingAllocInfo); - - // [Executed in runtime]: - memcpy(stagingAllocInfo.pMappedData, myData, myDataSize); - //vkCmdPipelineBarrier: VK_ACCESS_HOST_WRITE_BIT --> VK_ACCESS_TRANSFER_READ_BIT - VkBufferCopy bufCopy = { - 0, // srcOffset - 0, // dstOffset, - myDataSize); // size - vkCmdCopyBuffer(cmdBuf, stagingBuf, buf, 1, &bufCopy); -} -\endcode - -\section usage_patterns_other_use_cases Other use cases - -Here are some other, less obvious use cases and their recommended settings: - -- An image that is used only as transfer source and destination, but it should stay on the device, - as it is used to temporarily store a copy of some texture, e.g. from the current to the next frame, - for temporal antialiasing or other temporal effects. - - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT` - - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO -- An image that is used only as transfer source and destination, but it should be placed - in the system RAM despite it doesn't need to be mapped, because it serves as a "swap" copy to evict - least recently used textures from VRAM. - - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT` - - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_HOST, - as VMA needs a hint here to differentiate from the previous case. -- A buffer that you want to map and write from the CPU, directly read from the GPU - (e.g. as a uniform or vertex buffer), but you have a clear preference to place it in device or - host memory due to its large size. - - Use `VkBufferCreateInfo::usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT` - - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST - - Use VmaAllocationCreateInfo::flags = #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT - - -\page configuration Configuration - -Please check "CONFIGURATION SECTION" in the code to find macros that you can define -before each include of this file or change directly in this file to provide -your own implementation of basic facilities like assert, `min()` and `max()` functions, -mutex, atomic etc. -The library uses its own implementation of containers by default, but you can switch to using -STL containers instead. - -For example, define `VMA_ASSERT(expr)` before including the library to provide -custom implementation of the assertion, compatible with your project. -By default it is defined to standard C `assert(expr)` in `_DEBUG` configuration -and empty otherwise. - -\section config_Vulkan_functions Pointers to Vulkan functions - -There are multiple ways to import pointers to Vulkan functions in the library. -In the simplest case you don't need to do anything. -If the compilation or linking of your program or the initialization of the #VmaAllocator -doesn't work for you, you can try to reconfigure it. - -First, the allocator tries to fetch pointers to Vulkan functions linked statically, -like this: - -\code -m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory; -\endcode - -If you want to disable this feature, set configuration macro: `#define VMA_STATIC_VULKAN_FUNCTIONS 0`. - -Second, you can provide the pointers yourself by setting member VmaAllocatorCreateInfo::pVulkanFunctions. -You can fetch them e.g. using functions `vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` or -by using a helper library like [volk](https://github.com/zeux/volk). - -Third, VMA tries to fetch remaining pointers that are still null by calling -`vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` on its own. -You need to only fill in VmaVulkanFunctions::vkGetInstanceProcAddr and VmaVulkanFunctions::vkGetDeviceProcAddr. -Other pointers will be fetched automatically. -If you want to disable this feature, set configuration macro: `#define VMA_DYNAMIC_VULKAN_FUNCTIONS 0`. - -Finally, all the function pointers required by the library (considering selected -Vulkan version and enabled extensions) are checked with `VMA_ASSERT` if they are not null. - - -\section custom_memory_allocator Custom host memory allocator - -If you use custom allocator for CPU memory rather than default operator `new` -and `delete` from C++, you can make this library using your allocator as well -by filling optional member VmaAllocatorCreateInfo::pAllocationCallbacks. These -functions will be passed to Vulkan, as well as used by the library itself to -make any CPU-side allocations. - -\section allocation_callbacks Device memory allocation callbacks - -The library makes calls to `vkAllocateMemory()` and `vkFreeMemory()` internally. -You can setup callbacks to be informed about these calls, e.g. for the purpose -of gathering some statistics. To do it, fill optional member -VmaAllocatorCreateInfo::pDeviceMemoryCallbacks. - -\section heap_memory_limit Device heap memory limit - -When device memory of certain heap runs out of free space, new allocations may -fail (returning error code) or they may succeed, silently pushing some existing_ -memory blocks from GPU VRAM to system RAM (which degrades performance). This -behavior is implementation-dependent - it depends on GPU vendor and graphics -driver. - -On AMD cards it can be controlled while creating Vulkan device object by using -VK_AMD_memory_overallocation_behavior extension, if available. - -Alternatively, if you want to test how your program behaves with limited amount of Vulkan device -memory available without switching your graphics card to one that really has -smaller VRAM, you can use a feature of this library intended for this purpose. -To do it, fill optional member VmaAllocatorCreateInfo::pHeapSizeLimit. - - - -\page vk_khr_dedicated_allocation VK_KHR_dedicated_allocation - -VK_KHR_dedicated_allocation is a Vulkan extension which can be used to improve -performance on some GPUs. It augments Vulkan API with possibility to query -driver whether it prefers particular buffer or image to have its own, dedicated -allocation (separate `VkDeviceMemory` block) for better efficiency - to be able -to do some internal optimizations. The extension is supported by this library. -It will be used automatically when enabled. - -It has been promoted to core Vulkan 1.1, so if you use eligible Vulkan version -and inform VMA about it by setting VmaAllocatorCreateInfo::vulkanApiVersion, -you are all set. - -Otherwise, if you want to use it as an extension: - -1 . When creating Vulkan device, check if following 2 device extensions are -supported (call `vkEnumerateDeviceExtensionProperties()`). -If yes, enable them (fill `VkDeviceCreateInfo::ppEnabledExtensionNames`). - -- VK_KHR_get_memory_requirements2 -- VK_KHR_dedicated_allocation - -If you enabled these extensions: - -2 . Use #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag when creating -your #VmaAllocator to inform the library that you enabled required extensions -and you want the library to use them. - -\code -allocatorInfo.flags |= VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT; - -vmaCreateAllocator(&allocatorInfo, &allocator); -\endcode - -That is all. The extension will be automatically used whenever you create a -buffer using vmaCreateBuffer() or image using vmaCreateImage(). - -When using the extension together with Vulkan Validation Layer, you will receive -warnings like this: - -_vkBindBufferMemory(): Binding memory to buffer 0x33 but vkGetBufferMemoryRequirements() has not been called on that buffer._ - -It is OK, you should just ignore it. It happens because you use function -`vkGetBufferMemoryRequirements2KHR()` instead of standard -`vkGetBufferMemoryRequirements()`, while the validation layer seems to be -unaware of it. - -To learn more about this extension, see: - -- [VK_KHR_dedicated_allocation in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap50.html#VK_KHR_dedicated_allocation) -- [VK_KHR_dedicated_allocation unofficial manual](http://asawicki.info/articles/VK_KHR_dedicated_allocation.php5) - - - -\page vk_ext_memory_priority VK_EXT_memory_priority - -VK_EXT_memory_priority is a device extension that allows to pass additional "priority" -value to Vulkan memory allocations that the implementation may use prefer certain -buffers and images that are critical for performance to stay in device-local memory -in cases when the memory is over-subscribed, while some others may be moved to the system memory. - -VMA offers convenient usage of this extension. -If you enable it, you can pass "priority" parameter when creating allocations or custom pools -and the library automatically passes the value to Vulkan using this extension. - -If you want to use this extension in connection with VMA, follow these steps: - -\section vk_ext_memory_priority_initialization Initialization - -1) Call `vkEnumerateDeviceExtensionProperties` for the physical device. -Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_EXT_memory_priority". - -2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`. -Attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to `VkPhysicalDeviceFeatures2::pNext` to be returned. -Check if the device feature is really supported - check if `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority` is true. - -3) While creating device with `vkCreateDevice`, enable this extension - add "VK_EXT_memory_priority" -to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`. - -4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`. -Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`. -Enable this device feature - attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to -`VkPhysicalDeviceFeatures2::pNext` chain and set its member `memoryPriority` to `VK_TRUE`. - -5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you -have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT -to VmaAllocatorCreateInfo::flags. - -\section vk_ext_memory_priority_usage Usage - -When using this extension, you should initialize following member: - -- VmaAllocationCreateInfo::priority when creating a dedicated allocation with #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. -- VmaPoolCreateInfo::priority when creating a custom pool. - -It should be a floating-point value between `0.0f` and `1.0f`, where recommended default is `0.5f`. -Memory allocated with higher value can be treated by the Vulkan implementation as higher priority -and so it can have lower chances of being pushed out to system memory, experiencing degraded performance. - -It might be a good idea to create performance-critical resources like color-attachment or depth-stencil images -as dedicated and set high priority to them. For example: - -\code -VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO }; -imgCreateInfo.imageType = VK_IMAGE_TYPE_2D; -imgCreateInfo.extent.width = 3840; -imgCreateInfo.extent.height = 2160; -imgCreateInfo.extent.depth = 1; -imgCreateInfo.mipLevels = 1; -imgCreateInfo.arrayLayers = 1; -imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM; -imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL; -imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; -imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT; -imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT; - -VmaAllocationCreateInfo allocCreateInfo = {}; -allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO; -allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT; -allocCreateInfo.priority = 1.0f; - -VkImage img; -VmaAllocation alloc; -vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr); -\endcode - -`priority` member is ignored in the following situations: - -- Allocations created in custom pools: They inherit the priority, along with all other allocation parameters - from the parametrs passed in #VmaPoolCreateInfo when the pool was created. -- Allocations created in default pools: They inherit the priority from the parameters - VMA used when creating default pools, which means `priority == 0.5f`. - - -\page vk_amd_device_coherent_memory VK_AMD_device_coherent_memory - -VK_AMD_device_coherent_memory is a device extension that enables access to -additional memory types with `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and -`VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flag. It is useful mostly for -allocation of buffers intended for writing "breadcrumb markers" in between passes -or draw calls, which in turn are useful for debugging GPU crash/hang/TDR cases. - -When the extension is available but has not been enabled, Vulkan physical device -still exposes those memory types, but their usage is forbidden. VMA automatically -takes care of that - it returns `VK_ERROR_FEATURE_NOT_PRESENT` when an attempt -to allocate memory of such type is made. - -If you want to use this extension in connection with VMA, follow these steps: - -\section vk_amd_device_coherent_memory_initialization Initialization - -1) Call `vkEnumerateDeviceExtensionProperties` for the physical device. -Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_AMD_device_coherent_memory". - -2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`. -Attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to `VkPhysicalDeviceFeatures2::pNext` to be returned. -Check if the device feature is really supported - check if `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true. - -3) While creating device with `vkCreateDevice`, enable this extension - add "VK_AMD_device_coherent_memory" -to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`. - -4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`. -Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`. -Enable this device feature - attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to -`VkPhysicalDeviceFeatures2::pNext` and set its member `deviceCoherentMemory` to `VK_TRUE`. - -5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you -have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT -to VmaAllocatorCreateInfo::flags. - -\section vk_amd_device_coherent_memory_usage Usage - -After following steps described above, you can create VMA allocations and custom pools -out of the special `DEVICE_COHERENT` and `DEVICE_UNCACHED` memory types on eligible -devices. There are multiple ways to do it, for example: - -- You can request or prefer to allocate out of such memory types by adding - `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` to VmaAllocationCreateInfo::requiredFlags - or VmaAllocationCreateInfo::preferredFlags. Those flags can be freely mixed with - other ways of \ref choosing_memory_type, like setting VmaAllocationCreateInfo::usage. -- If you manually found memory type index to use for this purpose, force allocation - from this specific index by setting VmaAllocationCreateInfo::memoryTypeBits `= 1u << index`. - -\section vk_amd_device_coherent_memory_more_information More information - -To learn more about this extension, see [VK_AMD_device_coherent_memory in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_AMD_device_coherent_memory.html) - -Example use of this extension can be found in the code of the sample and test suite -accompanying this library. - - -\page enabling_buffer_device_address Enabling buffer device address - -Device extension VK_KHR_buffer_device_address -allow to fetch raw GPU pointer to a buffer and pass it for usage in a shader code. -It has been promoted to core Vulkan 1.2. - -If you want to use this feature in connection with VMA, follow these steps: - -\section enabling_buffer_device_address_initialization Initialization - -1) (For Vulkan version < 1.2) Call `vkEnumerateDeviceExtensionProperties` for the physical device. -Check if the extension is supported - if returned array of `VkExtensionProperties` contains -"VK_KHR_buffer_device_address". - -2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`. -Attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to `VkPhysicalDeviceFeatures2::pNext` to be returned. -Check if the device feature is really supported - check if `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress` is true. - -3) (For Vulkan version < 1.2) While creating device with `vkCreateDevice`, enable this extension - add -"VK_KHR_buffer_device_address" to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`. - -4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`. -Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`. -Enable this device feature - attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to -`VkPhysicalDeviceFeatures2::pNext` and set its member `bufferDeviceAddress` to `VK_TRUE`. - -5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you -have enabled this feature - add #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT -to VmaAllocatorCreateInfo::flags. - -\section enabling_buffer_device_address_usage Usage - -After following steps described above, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*` using VMA. -The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT*` to -allocated memory blocks wherever it might be needed. - -Please note that the library supports only `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*`. -The second part of this functionality related to "capture and replay" is not supported, -as it is intended for usage in debugging tools like RenderDoc, not in everyday Vulkan usage. - -\section enabling_buffer_device_address_more_information More information - -To learn more about this extension, see [VK_KHR_buffer_device_address in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap46.html#VK_KHR_buffer_device_address) - -Example use of this extension can be found in the code of the sample and test suite -accompanying this library. - -\page general_considerations General considerations - -\section general_considerations_thread_safety Thread safety - -- The library has no global state, so separate #VmaAllocator objects can be used - independently. - There should be no need to create multiple such objects though - one per `VkDevice` is enough. -- By default, all calls to functions that take #VmaAllocator as first parameter - are safe to call from multiple threads simultaneously because they are - synchronized internally when needed. - This includes allocation and deallocation from default memory pool, as well as custom #VmaPool. -- When the allocator is created with #VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT - flag, calls to functions that take such #VmaAllocator object must be - synchronized externally. -- Access to a #VmaAllocation object must be externally synchronized. For example, - you must not call vmaGetAllocationInfo() and vmaMapMemory() from different - threads at the same time if you pass the same #VmaAllocation object to these - functions. -- #VmaVirtualBlock is not safe to be used from multiple threads simultaneously. - -\section general_considerations_versioning_and_compatibility Versioning and compatibility - -The library uses [**Semantic Versioning**](https://semver.org/), -which means version numbers follow convention: Major.Minor.Patch (e.g. 2.3.0), where: - -- Incremented Patch version means a release is backward- and forward-compatible, - introducing only some internal improvements, bug fixes, optimizations etc. - or changes that are out of scope of the official API described in this documentation. -- Incremented Minor version means a release is backward-compatible, - so existing code that uses the library should continue to work, while some new - symbols could have been added: new structures, functions, new values in existing - enums and bit flags, new structure members, but not new function parameters. -- Incrementing Major version means a release could break some backward compatibility. - -All changes between official releases are documented in file "CHANGELOG.md". - -\warning Backward compatiblity is considered on the level of C++ source code, not binary linkage. -Adding new members to existing structures is treated as backward compatible if initializing -the new members to binary zero results in the old behavior. -You should always fully initialize all library structures to zeros and not rely on their -exact binary size. - -\section general_considerations_validation_layer_warnings Validation layer warnings - -When using this library, you can meet following types of warnings issued by -Vulkan validation layer. They don't necessarily indicate a bug, so you may need -to just ignore them. - -- *vkBindBufferMemory(): Binding memory to buffer 0xeb8e4 but vkGetBufferMemoryRequirements() has not been called on that buffer.* - - It happens when VK_KHR_dedicated_allocation extension is enabled. - `vkGetBufferMemoryRequirements2KHR` function is used instead, while validation layer seems to be unaware of it. -- *Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used.* - - It happens when you map a buffer or image, because the library maps entire - `VkDeviceMemory` block, where different types of images and buffers may end - up together, especially on GPUs with unified memory like Intel. -- *Non-linear image 0xebc91 is aliased with linear buffer 0xeb8e4 which may indicate a bug.* - - It may happen when you use [defragmentation](@ref defragmentation). - -\section general_considerations_allocation_algorithm Allocation algorithm - -The library uses following algorithm for allocation, in order: - --# Try to find free range of memory in existing blocks. --# If failed, try to create a new block of `VkDeviceMemory`, with preferred block size. --# If failed, try to create such block with size / 2, size / 4, size / 8. --# If failed, try to allocate separate `VkDeviceMemory` for this allocation, - just like when you use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT. --# If failed, choose other memory type that meets the requirements specified in - VmaAllocationCreateInfo and go to point 1. --# If failed, return `VK_ERROR_OUT_OF_DEVICE_MEMORY`. - -\section general_considerations_features_not_supported Features not supported - -Features deliberately excluded from the scope of this library: - --# **Data transfer.** Uploading (streaming) and downloading data of buffers and images - between CPU and GPU memory and related synchronization is responsibility of the user. - Defining some "texture" object that would automatically stream its data from a - staging copy in CPU memory to GPU memory would rather be a feature of another, - higher-level library implemented on top of VMA. - VMA doesn't record any commands to a `VkCommandBuffer`. It just allocates memory. --# **Recreation of buffers and images.** Although the library has functions for - buffer and image creation: vmaCreateBuffer(), vmaCreateImage(), you need to - recreate these objects yourself after defragmentation. That is because the big - structures `VkBufferCreateInfo`, `VkImageCreateInfo` are not stored in - #VmaAllocation object. --# **Handling CPU memory allocation failures.** When dynamically creating small C++ - objects in CPU memory (not Vulkan memory), allocation failures are not checked - and handled gracefully, because that would complicate code significantly and - is usually not needed in desktop PC applications anyway. - Success of an allocation is just checked with an assert. --# **Code free of any compiler warnings.** Maintaining the library to compile and - work correctly on so many different platforms is hard enough. Being free of - any warnings, on any version of any compiler, is simply not feasible. - There are many preprocessor macros that make some variables unused, function parameters unreferenced, - or conditional expressions constant in some configurations. - The code of this library should not be bigger or more complicated just to silence these warnings. - It is recommended to disable such warnings instead. --# This is a C++ library with C interface. **Bindings or ports to any other programming languages** are welcome as external projects but - are not going to be included into this repository. -*/ diff --git a/aten/src/ATen/native/vulkan/glsl/add.glsl b/aten/src/ATen/native/vulkan/glsl/add.glsl index 68864dd45d9c1..95e63f3a25afc 100644 --- a/aten/src/ATen/native/vulkan/glsl/add.glsl +++ b/aten/src/ATen/native/vulkan/glsl/add.glsl @@ -12,7 +12,7 @@ layout(set = 0, binding = 2) uniform PRECISION sample layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 size; ivec4 isize0; - ivec3 isize1; + ivec4 isize1; float alpha; } uBlock; @@ -24,9 +24,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input0_pos = pos % uBlock.isize0.xyz; const ivec3 input1_pos = pos % uBlock.isize1.xyz; - imageStore( - uOutput, - pos, - texelFetch(uInput0, input0_pos, 0) + uBlock.alpha * texelFetch(uInput1, input1_pos, 0)); + const vec4 v0 = uBlock.isize0.w == 1 + ? texelFetch(uInput0, input0_pos, 0).xxxx + : texelFetch(uInput0, input0_pos, 0); + const vec4 v1 = uBlock.isize1.w == 1 + ? texelFetch(uInput1, input1_pos, 0).xxxx + : texelFetch(uInput1, input1_pos, 0); + imageStore(uOutput, pos, v0 + uBlock.alpha * v1); } } diff --git a/aten/src/ATen/native/vulkan/glsl/add_.glsl b/aten/src/ATen/native/vulkan/glsl/add_.glsl index d25d3bdcf85e4..1fe72bb7a878a 100644 --- a/aten/src/ATen/native/vulkan/glsl/add_.glsl +++ b/aten/src/ATen/native/vulkan/glsl/add_.glsl @@ -7,10 +7,10 @@ layout(std430) buffer; /* Qualifiers: layout - storage - precision - memory */ layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; -layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput0; +layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 size; - ivec3 isize; + ivec4 isize; float alpha; } uBlock; @@ -21,9 +21,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input_pos = pos % uBlock.isize.xyz; + const vec4 v = uBlock.isize.w == 1 + ? texelFetch(uInput, input_pos, 0).xxxx + : texelFetch(uInput, input_pos, 0); imageStore( uOutput, pos, - imageLoad(uOutput, pos) + uBlock.alpha * texelFetch(uInput0, input_pos, 0)); + imageLoad(uOutput, pos) + uBlock.alpha * v); } } diff --git a/aten/src/ATen/native/vulkan/glsl/div.glsl b/aten/src/ATen/native/vulkan/glsl/div.glsl index 43c6e942a3e15..e7ba6de74ca8c 100644 --- a/aten/src/ATen/native/vulkan/glsl/div.glsl +++ b/aten/src/ATen/native/vulkan/glsl/div.glsl @@ -12,7 +12,7 @@ layout(set = 0, binding = 2) uniform PRECISION sample layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 size; ivec4 isize0; - ivec3 isize1; + ivec4 isize1; float alpha; } uBlock; @@ -24,9 +24,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input0_pos = pos % uBlock.isize0.xyz; const ivec3 input1_pos = pos % uBlock.isize1.xyz; - imageStore( - uOutput, - pos, - texelFetch(uInput0, input0_pos, 0) / texelFetch(uInput1, input1_pos, 0)); + const vec4 v0 = uBlock.isize0.w == 1 + ? texelFetch(uInput0, input0_pos, 0).xxxx + : texelFetch(uInput0, input0_pos, 0); + const vec4 v1 = uBlock.isize1.w == 1 + ? texelFetch(uInput1, input1_pos, 0).xxxx + : texelFetch(uInput1, input1_pos, 0); + imageStore(uOutput, pos, v0 / v1); } } diff --git a/aten/src/ATen/native/vulkan/glsl/div_.glsl b/aten/src/ATen/native/vulkan/glsl/div_.glsl index 90bfbad3cfdf2..56065a3839312 100644 --- a/aten/src/ATen/native/vulkan/glsl/div_.glsl +++ b/aten/src/ATen/native/vulkan/glsl/div_.glsl @@ -7,10 +7,10 @@ layout(std430) buffer; /* Qualifiers: layout - storage - precision - memory */ layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; -layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput0; +layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 size; - ivec3 isize; + ivec4 isize; float alpha; } uBlock; @@ -21,9 +21,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input_pos = pos % uBlock.isize.xyz; + const vec4 v = uBlock.isize.w == 1 + ? texelFetch(uInput, input_pos, 0).xxxx + : texelFetch(uInput, input_pos, 0); imageStore( uOutput, pos, - imageLoad(uOutput, pos) / texelFetch(uInput0, input_pos, 0)); + imageLoad(uOutput, pos) / v); } } diff --git a/aten/src/ATen/native/vulkan/glsl/mul.glsl b/aten/src/ATen/native/vulkan/glsl/mul.glsl index 43fcd444d11cd..c5aa2c7f522a2 100644 --- a/aten/src/ATen/native/vulkan/glsl/mul.glsl +++ b/aten/src/ATen/native/vulkan/glsl/mul.glsl @@ -12,7 +12,7 @@ layout(set = 0, binding = 2) uniform PRECISION sample layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 size; ivec4 isize0; - ivec3 isize1; + ivec4 isize1; float alpha; } uBlock; @@ -24,9 +24,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input0_pos = pos % uBlock.isize0.xyz; const ivec3 input1_pos = pos % uBlock.isize1.xyz; - imageStore( - uOutput, - pos, - texelFetch(uInput0, input0_pos, 0) * texelFetch(uInput1, input1_pos, 0)); + const vec4 v0 = uBlock.isize0.w == 1 + ? texelFetch(uInput0, input0_pos, 0).xxxx + : texelFetch(uInput0, input0_pos, 0); + const vec4 v1 = uBlock.isize1.w == 1 + ? texelFetch(uInput1, input1_pos, 0).xxxx + : texelFetch(uInput1, input1_pos, 0); + imageStore(uOutput, pos, v0 * v1); } } diff --git a/aten/src/ATen/native/vulkan/glsl/mul_.glsl b/aten/src/ATen/native/vulkan/glsl/mul_.glsl index af23e678d9871..6487c6c52760d 100644 --- a/aten/src/ATen/native/vulkan/glsl/mul_.glsl +++ b/aten/src/ATen/native/vulkan/glsl/mul_.glsl @@ -7,10 +7,10 @@ layout(std430) buffer; /* Qualifiers: layout - storage - precision - memory */ layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; -layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput0; +layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 size; - ivec3 isize; + ivec4 isize; float alpha; } uBlock; @@ -21,9 +21,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input_pos = pos % uBlock.isize.xyz; + const vec4 v = uBlock.isize.w == 1 + ? texelFetch(uInput, input_pos, 0).xxxx + : texelFetch(uInput, input_pos, 0); imageStore( uOutput, pos, - imageLoad(uOutput, pos) * texelFetch(uInput0, input_pos, 0)); + imageLoad(uOutput, pos) * v); } } diff --git a/aten/src/ATen/native/vulkan/glsl/sub.glsl b/aten/src/ATen/native/vulkan/glsl/sub.glsl index 28ce580abfcd5..9dc89551ea957 100644 --- a/aten/src/ATen/native/vulkan/glsl/sub.glsl +++ b/aten/src/ATen/native/vulkan/glsl/sub.glsl @@ -12,7 +12,7 @@ layout(set = 0, binding = 2) uniform PRECISION sample layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 size; ivec4 isize0; - ivec3 isize1; + ivec4 isize1; float alpha; } uBlock; @@ -24,9 +24,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input0_pos = pos % uBlock.isize0.xyz; const ivec3 input1_pos = pos % uBlock.isize1.xyz; - imageStore( - uOutput, - pos, - texelFetch(uInput0, input0_pos, 0) - uBlock.alpha * texelFetch(uInput1, input1_pos, 0)); + const vec4 v0 = uBlock.isize0.w == 1 + ? texelFetch(uInput0, input0_pos, 0).xxxx + : texelFetch(uInput0, input0_pos, 0); + const vec4 v1 = uBlock.isize1.w == 1 + ? texelFetch(uInput1, input1_pos, 0).xxxx + : texelFetch(uInput1, input1_pos, 0); + imageStore(uOutput, pos, v0 - uBlock.alpha * v1); } } diff --git a/aten/src/ATen/native/vulkan/glsl/sub_.glsl b/aten/src/ATen/native/vulkan/glsl/sub_.glsl index 6baaaf0a4238c..a68e6f9dc2286 100644 --- a/aten/src/ATen/native/vulkan/glsl/sub_.glsl +++ b/aten/src/ATen/native/vulkan/glsl/sub_.glsl @@ -7,10 +7,10 @@ layout(std430) buffer; /* Qualifiers: layout - storage - precision - memory */ layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; -layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput0; +layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 size; - ivec3 isize; + ivec4 isize; float alpha; } uBlock; @@ -21,9 +21,12 @@ void main() { if (all(lessThan(pos, uBlock.size.xyz))) { const ivec3 input_pos = pos % uBlock.isize.xyz; + const vec4 v = uBlock.isize.w == 1 + ? texelFetch(uInput, input_pos, 0).xxxx + : texelFetch(uInput, input_pos, 0); imageStore( uOutput, pos, - imageLoad(uOutput, pos) - uBlock.alpha * texelFetch(uInput0, input_pos, 0)); + imageLoad(uOutput, pos) - uBlock.alpha * v); } } diff --git a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp index 1551bd49b7766..65fc758b065d7 100644 --- a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp +++ b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp @@ -10,63 +10,84 @@ namespace vulkan { namespace ops { namespace { -bool broadcast_input(const Tensor& input1, const Tensor& input2) { - return ((height_size(input1) > 1 && height_size(input2) == 1) || - (height_size(input2) > 1 && height_size(input1) == 1) || - (height_size(input1) == height_size(input2))) && - ((width_size(input1) > 1 && width_size(input2) == 1) || - (width_size(input2) > 1 && width_size(input1) == 1) || - (width_size(input1) == width_size(input2))); -} - void check_inputs(const Tensor& input1, const Tensor& input2) { - TORCH_CHECK( - channels_size(input1) == channels_size(input2), - "Vulkan binary elementwise ops require channel dimension to be equal!"); - if (batch_size(input1) != batch_size(input2)) { + const std::string broadcast_error_msg = + "Incompatible dimensions for broadcasting for binary elementwise op!"; + if (get_dim(input1) != get_dim(input2)) { TORCH_CHECK( - channels_size(input1) % 4 == 0, - "Vulkan binary elementwise ops require channel to be a multiple of 4 to broadcast along batch dimension!") + get_dim(input1) == 1 || get_dim(input2), + broadcast_error_msg); + TORCH_CHECK( + (get_dim(input1) == get_dim(input2) && + get_dim(input1) % 4 == 0) || + get_dim(input1) * get_dim(input1) == + 1 || + get_dim(input2) * get_dim(input2) == + 1, + "Invalid broadcasting for Vulkan binary elementwise op! " + "If batch dimensions aren't equal, then channel dimensions must be " + "equal and multiple of 4 or one of the inputs must have " + "channel and batch dimensions both equal to 1!"); + } + if (get_dim(input1) != get_dim(input2)) { + TORCH_CHECK( + get_dim(input1) == 1 || get_dim(input2), + broadcast_error_msg); + TORCH_CHECK( + get_dim(input1) * get_dim(input1) == 1 || + get_dim(input2) * get_dim(input2) == + 1, + "Invalid broadcasting for Vulkan binary elementwise op! " + "If channel dimensions aren't equal, then one of the inputs must have " + "channel and batch dimensions both equal to 1!"); + } + if (get_dim(input1) != get_dim(input2)) { + TORCH_CHECK( + get_dim(input1) == 1 || get_dim(input2), + broadcast_error_msg); + } + if (get_dim(input1) != get_dim(input2)) { + TORCH_CHECK( + get_dim(input1) == 1 || get_dim(input2), + broadcast_error_msg); } - - const std::string broadcast_error_msg = - "Incompatible input dimensions for broadcasting for Vulkan binary elementwise op!"; - - TORCH_CHECK(broadcast_input(input1, input2), broadcast_error_msg); } -std::vector broadcast_size( - const Tensor& input1, - const Tensor& input2) { - std::vector out = {}; - int input1_size = input1.sizes().size(); - int input2_size = input2.sizes().size(); - if (input1_size > input2_size) { - for (int i = 0; i < input1_size; i++) { - out.push_back(input1.sizes()[i]); +std::vector broadcast_size(const Tensor& t1, const Tensor& t2) { + int64_t t1_size = t1.dim(); + int64_t t2_size = t2.dim(); + + std::vector out; + if (t1_size > t2_size) { + for (int64_t i = 0; i < t1_size; i++) { + out.push_back(t1.sizes()[i]); } } else { - for (int i = 0; i < input2_size; i++) { - out.push_back(input2.sizes()[i]); + for (int64_t i = 0; i < t2_size; i++) { + out.push_back(t2.sizes()[i]); } } - if (width_size(input1) > 1 && width_size(input2) == 1) { - out[out.size() - 1] = width_size(input1); - } else if (width_size(input2) > 1 && width_size(input1) == 1) { - out[out.size() - 1] = width_size(input2); + if (out.size() > 0) { + out[out.size() - 1] = + std::max(get_dim(t1), get_dim(t2)); } - if (out.size() > 1) { - if (height_size(input1) > 1 && height_size(input2) == 1) { - out[out.size() - 2] = height_size(input1); - } else if (height_size(input2) > 1 && height_size(input1) == 1) { - out[out.size() - 2] = height_size(input2); - } + out[out.size() - 2] = + std::max(get_dim(t1), get_dim(t2)); + } + if (out.size() > 2) { + out[out.size() - 3] = + std::max(get_dim(t1), get_dim(t2)); + } + if (out.size() > 3) { + out[out.size() - 4] = + std::max(get_dim(t1), get_dim(t2)); } return out; } + } // namespace using namespace api::utils; @@ -195,15 +216,17 @@ Tensor arithmetic_tensor( uvec3 extents; uint32_t fill_0; uvec3 input1_extents; - uint32_t fill_1; + uint32_t channel_batch_size_1; uvec3 input2_extents; + uint32_t channel_batch_size_2; float alpha; } block{ v_output.extents(), 0u, v_self.extents(), - 0u, + get_dim(self) * get_dim(self), v_other.extents(), + get_dim(other) * get_dim(other), alpha, }; @@ -326,6 +349,16 @@ Tensor& arithmetic_tensor_( const Tensor& other_arg, const c10::optional& alpha_arg, const api::ShaderSource& shader_descriptor) { + TORCH_CHECK( + get_dim(self_arg) >= get_dim(other_arg) && + get_dim(self_arg) >= + get_dim(other_arg) && + get_dim(self_arg) >= + get_dim(other_arg) && + get_dim(self_arg) >= get_dim(other_arg), + "Dimensions of input tensor to Vulkan in-place binary elementwise op " + "must be less than or equal the dimensions of the underlying tensor."); + check_inputs(self_arg, other_arg); TORCH_CHECK( @@ -344,11 +377,13 @@ Tensor& arithmetic_tensor_( uvec3 extents; uint32_t fill_0; uvec3 input_extents; + uint32_t channel_batch_size_other; float alpha; } block{ v_self.extents(), 0u, v_other.extents(), + get_dim(other) * get_dim(other), alpha, }; @@ -431,13 +466,6 @@ Tensor add_tensor( const Tensor& self_arg, const Tensor& other_arg, const Scalar& alpha) { - if (other_arg.sizes().size() == 0) { - return arithmetic_scalar( - self_arg, - other_arg.item(), - c10::optional(alpha.to()), - VK_KERNEL(add_scalar)); - } return arithmetic_tensor( self_arg, other_arg, c10::optional(alpha), VK_KERNEL(add)); } @@ -473,13 +501,6 @@ Tensor sub_tensor( const Tensor& self_arg, const Tensor& other_arg, const Scalar& alpha) { - if (other_arg.sizes().size() == 0) { - return arithmetic_scalar( - self_arg, - other_arg.item(), - c10::optional(-1 * alpha.to()), - VK_KERNEL(add_scalar)); - } return arithmetic_tensor( self_arg, other_arg, c10::optional(alpha), VK_KERNEL(sub)); } @@ -503,13 +524,6 @@ Tensor& mul_scalar_(Tensor& self, const Scalar& other) { } Tensor mul_tensor(const Tensor& self_arg, const Tensor& other_arg) { - if (other_arg.sizes().size() == 0) { - return arithmetic_scalar( - self_arg, - other_arg.item(), - c10::optional(), - VK_KERNEL(mul_scalar)); - } return arithmetic_tensor( self_arg, other_arg, c10::optional(), VK_KERNEL(mul)); } @@ -536,13 +550,6 @@ Tensor& div_scalar_(Tensor& self, const Scalar& other) { } Tensor div_tensor(const Tensor& self_arg, const Tensor& other_arg) { - if (other_arg.sizes().size() == 0) { - return arithmetic_scalar( - self_arg, - 1.0 / other_arg.item(), - c10::optional(), - VK_KERNEL(mul_scalar)); - } return arithmetic_tensor( self_arg, other_arg, c10::optional(), VK_KERNEL(div)); } diff --git a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp index 30407e8cec38a..84828aa60468c 100644 --- a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp +++ b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp @@ -31,7 +31,7 @@ Tensor batch_norm( "running_var must be defined in evaluation mode."); TORCH_CHECK(input_arg.dim() == 4, "Vulkan batchnorm expects 4-dim input!"); TORCH_CHECK( - channels_size(input_arg) % 4 == 0, + get_dim(input_arg) % 4 == 0, "Vulkan batchnorm expects channel dim to be multiple of 4!"); const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); diff --git a/aten/src/ATen/native/vulkan/ops/Common.cpp b/aten/src/ATen/native/vulkan/ops/Common.cpp index 9336291840967..5a3daeb074288 100644 --- a/aten/src/ATen/native/vulkan/ops/Common.cpp +++ b/aten/src/ATen/native/vulkan/ops/Common.cpp @@ -5,42 +5,6 @@ namespace native { namespace vulkan { namespace ops { -uint32_t batch_size(const Tensor& tensor) { - const IntArrayRef sizes = tensor.sizes(); - const uint32_t dims = sizes.size(); - if (dims < 4) { - return 1; - } - return sizes[dims - 4]; -} - -uint32_t channels_size(const Tensor& tensor) { - const IntArrayRef sizes = tensor.sizes(); - const uint32_t dims = sizes.size(); - if (dims < 3) { - return 1; - } - return sizes[dims - 3]; -} - -uint32_t height_size(const Tensor& tensor) { - const IntArrayRef sizes = tensor.sizes(); - const uint32_t dims = sizes.size(); - if (dims < 2) { - return 1; - } - return sizes[dims - 2]; -} - -uint32_t width_size(const Tensor& tensor) { - const IntArrayRef sizes = tensor.sizes(); - const uint32_t dims = sizes.size(); - if (dims < 1) { - return 1; - } - return sizes[dims - 1]; -} - api::utils::uvec3 adaptive_work_group_size( const api::utils::uvec3& global_work_group) { api::utils::uvec3 local_group_size = {4, 4, 4}; diff --git a/aten/src/ATen/native/vulkan/ops/Common.h b/aten/src/ATen/native/vulkan/ops/Common.h index 2cb6159038bb1..913a2feb80d3a 100644 --- a/aten/src/ATen/native/vulkan/ops/Common.h +++ b/aten/src/ATen/native/vulkan/ops/Common.h @@ -43,10 +43,54 @@ struct Layout final { }; }; -uint32_t batch_size(const Tensor& tensor); -uint32_t channels_size(const Tensor& tensor); -uint32_t height_size(const Tensor& tensor); -uint32_t width_size(const Tensor& tensor); +/* + * Maps a semantic dimension name to an integer that corresponds to its + * innermost ordering in a 4D tensor in NCHW format. Width is the innermost + * dimension, so it corresponds to 1, height is the next innermost, so it + * corresponds to 2, and so on. + */ +struct Dim4D { + static constexpr uint32_t Width = 1u; + static constexpr uint32_t Height = 2u; + static constexpr uint32_t Channel = 3u; + static constexpr uint32_t Batch = 4u; +}; + +/* + * The functions below safely return the size of the dimension at the N-th + * innermost index. If the dimensionality of the size array is not sufficient + * then 1 will be returned. The structs above are intended to be used with + * these functions. + */ +template +uint32_t get_dim(const IntArrayRef sizes) { + const uint32_t dims = sizes.size(); + return dims < N ? 1 : sizes[dims - N]; +} + +template +uint32_t get_dim(const Tensor& t_in) { + return get_dim(t_in.sizes()); +} + +template +uint32_t get_dim(const vTensor& v_in) { + return get_dim(v_in.sizes()); +} + +inline c10::optional get_optional_tensor( + const c10::impl::GenericList& gen_list, + const uint32_t idx) { + return gen_list.get(idx).isTensor() ? gen_list.get(idx).toTensor() + : c10::optional(); +} + +inline c10::optional get_optional_scalar( + const c10::impl::GenericList& gen_list, + const uint32_t idx) { + return gen_list.get(idx).isScalar() ? gen_list.get(idx).toScalar() + : c10::optional(); +} api::utils::uvec3 adaptive_work_group_size( const api::utils::uvec3& global_work_group); diff --git a/aten/src/ATen/native/vulkan/ops/Concat.cpp b/aten/src/ATen/native/vulkan/ops/Concat.cpp index d0c2c0cf6afe6..4ab543f5527f0 100644 --- a/aten/src/ATen/native/vulkan/ops/Concat.cpp +++ b/aten/src/ATen/native/vulkan/ops/Concat.cpp @@ -114,7 +114,7 @@ Tensor cat_feature_mult4ch(const TensorList tensors, vTensor& v_output) { api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images @@ -152,7 +152,7 @@ Tensor cat_height(const TensorList tensors, vTensor& v_output) { api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.cpp b/aten/src/ATen/native/vulkan/ops/Convolution.cpp index e375a887e0e2e..1f81c2a7ef19f 100644 --- a/aten/src/ATen/native/vulkan/ops/Convolution.cpp +++ b/aten/src/ATen/native/vulkan/ops/Convolution.cpp @@ -5,7 +5,6 @@ #include #include #include -#include #include namespace at { @@ -41,12 +40,19 @@ Conv2dMethod determine_method( const IntArrayRef stride, const IntArrayRef padding, const IntArrayRef dilation, - const int64_t groups) { - if (is_depthwise(filter, groups)) - return Conv2dDepthwise; - if (is_pointwise(filter)) - return Conv2dPointwise; - return Conv2dSlidingWindow; + const int64_t groups, + const bool transposed, + const bool quantized) { + if (transposed) { + return TConv2dSlidingWindow; + } + if (is_depthwise(filter, groups)) { + return quantized ? QConv2dDepthwise : Conv2dDepthwise; + } + if (is_pointwise(filter)) { + return quantized ? QConv2dPointwise : Conv2dPointwise; + } + return quantized ? QConv2dSlidingWindow : Conv2dSlidingWindow; } vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) { @@ -77,7 +83,7 @@ vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) { weight.options(), }; - api::StagingBuffer staging(context, v_weight.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_weight.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -111,7 +117,10 @@ vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) { return v_weight; } -vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) { +vTensor pack_weights_2d( + api::Context* const context, + const Tensor& weight, + bool reversed) { /* Source */ const IntArrayRef src_filter = weight.sizes(); const float* const src_weight_ptr = weight.data_ptr(); @@ -142,7 +151,7 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) { weight.options(), }; - api::StagingBuffer staging(context, v_weight.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_weight.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -161,6 +170,163 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) { float* const dst_weight_c_ptr = dst_weight_ptr + dst_c * dst_kernel_sz; + if (reversed) { + for (const auto src_ic : + c10::irange(src_filter[Layout::Filter::input])) { + for (const auto src_ih : c10::irange(src_kh_sz)) { + const int64_t dst_h = src_kh_sz - 1 - src_ih; + for (const auto src_iw : c10::irange(src_kw_sz)) { + const int64_t dst_w = src_kw_sz - 1 - src_iw; + const int64_t dst_w_offset = dst_w * stack_depth; + memcpy( + dst_weight_c_ptr + (dst_oh * src_kh_sz + dst_h) * dst_kw_sz + + src_ic + dst_w_offset, + src_weight_oc_ptr + src_ic * src_kernel_sz + + src_ih * src_kw_sz + src_iw, + sizeof(float)); + } + } + } + } else { + for (const auto src_ic : + c10::irange(src_filter[Layout::Filter::input])) { + const int64_t dst_ic4 = src_ic / 4; + for (const auto src_ih : c10::irange(src_kh_sz)) { + for (const auto src_iw : c10::irange(src_kw_sz)) { + memcpy( + dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz + + dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4, + src_weight_oc_ptr + src_ic * src_kernel_sz + + src_ih * src_kw_sz + src_iw, + sizeof(float)); + } + } + } + } + } + } + utils::pack_staging_to_vtensor(staging.buffer(), v_weight); + + return v_weight; +} + +vTensor pack_weights_dw_q(api::Context* const context, const Tensor& weight) { + /* Source */ + const IntArrayRef src_filter = weight.sizes(); + const c10::quint8* const src_weight_ptr = weight.data_ptr(); + + const int64_t src_kw_sz = src_filter[Layout::Filter::width]; + const int64_t src_kh_sz = src_filter[Layout::Filter::height]; + const int64_t src_kernel_sz = src_kw_sz * src_kh_sz; + const int64_t src_block_sz = + src_kernel_sz * src_filter[Layout::Filter::input]; + const int64_t num_stacks = + div_up(src_filter[Layout::Filter::output], INT64_C(4)); + + /* Destination */ + const int64_t dst_kw_sz = src_kernel_sz; + const int64_t dst_kh_sz = num_stacks; + const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz; + + vTensor v_weight{ + context, + { + 4, + dst_kh_sz, + dst_kw_sz, + }, + weight.options(), + weight.q_scale(), + weight.q_zero_point(), + }; + + api::StorageBuffer staging(context, at::kFloat, v_weight.numcells()); + { + api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); + + c10::quint8* dst_weight_ptr = mapping.template data(); + + memset(dst_weight_ptr, 0, v_weight.nbytes()); + + for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) { + /* Source */ + const c10::quint8* const src_weight_oc_ptr = + src_weight_ptr + src_oc * src_block_sz; + + /* Destination */ + const int64_t dst_oh = src_oc / 4; + const int64_t dst_c = src_oc % 4; + + c10::quint8* const dst_weight_c_ptr = + dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz; + + for (const auto src_ih : + c10::irange(src_filter[Layout::Filter::height])) { + memcpy( + dst_weight_c_ptr + src_ih * src_kw_sz, + src_weight_oc_ptr + src_ih * src_kw_sz, + sizeof(c10::quint8) * src_kw_sz); + } + } + } + ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight); + + return v_weight; +} + +vTensor pack_weights_2d_q(api::Context* const context, const Tensor& weight) { + /* Source */ + const IntArrayRef src_filter = weight.sizes(); + const c10::quint8* const src_weight_ptr = weight.data_ptr(); + + const int64_t src_kw_sz = src_filter[Layout::Filter::width]; + const int64_t src_kh_sz = src_filter[Layout::Filter::height]; + const int64_t src_kernel_sz = src_kw_sz * src_kh_sz; + const int64_t src_block_sz = + src_kernel_sz * src_filter[Layout::Filter::input]; + + const int64_t num_stacks = + div_up(src_filter[Layout::Filter::output], INT64_C(4)); + const int64_t stack_depth = + api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4)); + + /* Destination */ + const int64_t dst_kw_sz = src_kw_sz * stack_depth; + const int64_t dst_kh_sz = src_kh_sz * num_stacks; + const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz; + + vTensor v_weight{ + context, + { + 4, + dst_kh_sz, + dst_kw_sz, + }, + weight.options(), + weight.q_scale(), + weight.q_zero_point(), + }; + + api::StorageBuffer staging(context, at::kFloat, v_weight.numcells()); + { + api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); + + c10::quint8* dst_weight_ptr = mapping.template data(); + + memset(dst_weight_ptr, 0, v_weight.nbytes()); + + for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) { + /* Source */ + const c10::quint8* const src_weight_oc_ptr = + src_weight_ptr + src_oc * src_block_sz; + + /* Destination */ + const int64_t dst_oh = src_oc / 4; + const int64_t dst_c = src_oc % 4; + + c10::quint8* const dst_weight_c_ptr = + dst_weight_ptr + dst_c * dst_kernel_sz; + for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) { const int64_t dst_ic4 = src_ic / 4; @@ -171,7 +337,7 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) { dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4, src_weight_oc_ptr + src_ic * src_kernel_sz + src_ih * src_kw_sz + src_iw, - sizeof(float)); + sizeof(c10::quint8)); } } } @@ -182,30 +348,48 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) { return v_weight; } -vTensor pack_weights(const Tensor& weight_arg, const Conv2dMethod conv_method) { +vTensor pack_weights( + const Tensor& weight_arg, + const bool transposed, + const bool quantized, + const Conv2dMethod conv_method) { if (weight_arg.is_vulkan()) { return convert(weight_arg); } api::Context* const context = api::context(); - const Tensor weight = weight_arg.contiguous(); + const Tensor weight = transposed + ? at::permute(weight_arg, {1, 0, 2, 3}).contiguous() + : weight_arg.contiguous(); + if (transposed) { + return pack_weights_2d(context, weight, true); + } + if (quantized) { + if (conv_method == QConv2dDepthwise) { + return pack_weights_dw_q(context, weight); + } + return pack_weights_2d_q(context, weight); + } if (conv_method == Conv2dDepthwise) { return pack_weights_dw(context, weight); } - - return pack_weights_2d(context, weight); + return pack_weights_2d(context, weight, false); } -vTensor pack_biases(const c10::optional& bias, const Tensor& weight) { +vTensor pack_biases_reg( + const c10::optional& bias, + const Tensor& weight, + const bool transposed) { if (bias && bias->is_vulkan()) { return convert(*bias); } api::Context* const context = api::context(); - const int64_t src_w = weight.size(Layout::Filter::output); + const int64_t src_w = weight.size( + transposed ? Layout::TransposedFilter::output : Layout::Filter::output); const int64_t packed_w = div_up(src_w, INT64_C(4)); vTensor v_bias{ context, @@ -217,7 +401,7 @@ vTensor pack_biases(const c10::optional& bias, const Tensor& weight) { weight.options(), }; - api::StagingBuffer staging(context, v_bias.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_bias.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -242,14 +426,78 @@ vTensor pack_biases(const c10::optional& bias, const Tensor& weight) { v_bias.nbytes()); } } + utils::pack_staging_to_vtensor(staging.buffer(), v_bias); + + return v_bias; +} + +vTensor pack_biases_q(const c10::optional& bias, const Tensor& weight) { + if (bias && bias->is_vulkan()) { + return convert(*bias); + } + + api::Context* const context = api::context(); + + const int64_t src_w = weight.size(Layout::Filter::output); + const int64_t packed_w = div_up(src_w, INT64_C(4)); + vTensor v_bias{ + context, + { + 4, + 1, + packed_w, + }, + weight.options(), + weight.q_scale(), + weight.q_zero_point(), + }; + + api::StorageBuffer staging(context, at::kFloat, v_bias.numcells()); + { + api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); + + c10::quint8* dst_bias_ptr = mapping.template data(); + + if (bias) { + const c10::quint8* const src_bias_ptr = + bias->contiguous().data_ptr(); + + memset(dst_bias_ptr, 0, v_bias.nbytes()); + for (const auto i : c10::irange(src_w)) { + const int64_t c = i % 4; + const int64_t x = i / 4; + dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i]; + } + } else { + memset( + dst_bias_ptr, + // 2's complement integers and IEEE-754 floating point numbers both + // have identical bit representations for 0, so can use memset which + // only accepts uint8_t parameter. + 0, + v_bias.nbytes()); + } + } ops::utils::pack_staging_to_vtensor(staging.buffer(), v_bias); return v_bias; } +vTensor pack_biases( + const c10::optional& bias, + const Tensor& weight, + const bool transposed, + const bool quantized) { + if (quantized) { + return pack_biases_q(bias, weight); + } + return pack_biases_reg(bias, weight, transposed); +} + std::array pack_filter( const Tensor& weight, - const IntArrayRef dilation) { + const IntArrayRef dilation, + const bool transposed) { const IntArrayRef filter = weight.sizes(); const auto effective = [](const int64_t k, const int64_t d) { @@ -257,8 +505,14 @@ std::array pack_filter( }; return { - align_up(filter[Layout::Filter::output], INT64_C(4)), - align_up(filter[Layout::Filter::input], INT64_C(4)), + align_up( + transposed ? filter[Layout::TransposedFilter::output] + : filter[Layout::Filter::output], + INT64_C(4)), + align_up( + transposed ? filter[Layout::TransposedFilter::input] + : filter[Layout::Filter::input], + INT64_C(4)), effective( filter[Layout::Filter::height], dilation[Layout::Parameter::height]), effective( @@ -275,6 +529,35 @@ std::array pack_params(const std::vector& vector) { }; } +bool weight_valid(const Tensor& weight, const bool quantized) { + return (4 == weight.ndimension()) && + (weight.size(Layout::Filter::height) > 0) && + (weight.size(Layout::Filter::width) > 0) && + ((weight.device().is_cpu()) || + (c10::DeviceType::Vulkan == weight.device().type())) && + (kFloat == weight.scalar_type() || + (quantized && c10::kQUInt8 == weight.scalar_type())); +} + +bool bias_valid( + const c10::optional& bias, + const Tensor& weight, + const bool transposed, + const bool quantized) { + if (bias && bias->defined()) { + return (1 == bias->ndimension()) && + ((bias->device().is_cpu()) || + (c10::DeviceType::Vulkan == bias->device().type())) && + (kFloat == bias->scalar_type() || + (quantized && c10::kQUInt8 == bias->scalar_type())) && + (transposed ? (weight.size(Layout::TransposedFilter::output) == + bias->size(Layout::Filter::output)) + : (weight.size(Layout::Filter::output) == + bias->size(Layout::Filter::output))); + } + return true; +} + bool available( const Tensor& weight, const c10::optional& bias, @@ -282,27 +565,16 @@ bool available( const IntArrayRef padding, const IntArrayRef dilation, const bool transposed, + const bool quantized, const IntArrayRef /* output_padding */, const int64_t groups, const c10::optional& output_min, const c10::optional& output_max) { return api::available() && // Weight - (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) && - (weight.size(Layout::Filter::width) > 0) && - ((weight.device().is_cpu()) || - (c10::DeviceType::Vulkan == weight.device().type())) && - (kFloat == weight.scalar_type()) && + weight_valid(weight, quantized) && // Bias - ((bias && bias->defined()) - ? ((1 == bias->ndimension()) && - ((bias->device().is_cpu()) || - (c10::DeviceType::Vulkan == bias->device().type())) && - (kFloat == bias->scalar_type()) && - (transposed ? false /* to be addded in the future */ - : (weight.size(Layout::Filter::output) == - bias->size(Layout::Filter::output)))) - : true) && + bias_valid(bias, weight, transposed, quantized) && // Stride (stride[Layout::Parameter::height] > 0) && (stride[Layout::Parameter::width] > 0) && @@ -310,8 +582,10 @@ bool available( (padding[Layout::Parameter::height] >= 0) && (padding[Layout::Parameter::width] >= 0) && // Dilation - (dilation[Layout::Parameter::height] > 0) && - (dilation[Layout::Parameter::width] > 0) && + (transposed ? (dilation[Layout::Parameter::height] == 1) && + (dilation[Layout::Parameter::width] == 1) + : (dilation[Layout::Parameter::height] > 0) && + (dilation[Layout::Parameter::width] > 0)) && // Groups (groups > 0) && // Input @@ -325,11 +599,12 @@ bool available( (!output_max || output_max->isFloatingPoint()) && true; } -bool usable(const Tensor& input) { +bool usable(const Tensor& input, const bool quantized) { // Input return (4 == input.ndimension()) && (c10::DeviceType::Vulkan == input.device().type()) && - (kFloat == input.scalar_type()) && + (kFloat == input.scalar_type() || + (quantized && c10::kQUInt8 == input.scalar_type())) && (input.size(Layout::Activation4D::batch) >= 0) && (input.size(Layout::Activation4D::channels) > 0) && (input.size(Layout::Activation4D::height) > 0) && @@ -337,77 +612,22 @@ bool usable(const Tensor& input) { true; } -} // namespace - -VulkanOpContext conv2d_context_create( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef dilation_arg, - const bool transposed, - const IntArrayRef output_padding_arg, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - const auto stride = expand_param_if_needed(stride_arg, "stride", 2); - const auto padding = expand_param_if_needed(padding_arg, "padding", 2); - const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2); - const auto output_padding = output_padding_arg; // TODO: Deconvolutions - - TORCH_CHECK( - available( - weight, - bias, - stride, - padding, - dilation, - transposed, - output_padding, - groups, - output_min, - output_max), - "Vulkan::convolution not available! " - "Reason: The provided (weight, bias, stride, padding, dilation, groups, " - "transposed, output_padding, output_min, output_max) parameters are either " - "invalid individually or their combination is not supported by Vulkan impl."); - - const auto method = - determine_method(weight.sizes(), stride, padding, dilation, groups); - - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(10); - packed_context.emplace_back(convert(pack_weights(weight, method))); - packed_context.emplace_back(convert(pack_biases(bias, weight))); - packed_context.emplace_back(pack_filter(weight, dilation)); - packed_context.emplace_back(pack_params(stride)); - packed_context.emplace_back(pack_params(padding)); - packed_context.emplace_back(output_padding); - packed_context.emplace_back(pack_params(dilation)); - packed_context.emplace_back(safe_downcast(groups)); - packed_context.emplace_back( - output_min ? output_min->template to() - : -std::numeric_limits::infinity()); - packed_context.emplace_back( - output_max ? output_max->template to() - : +std::numeric_limits::infinity()); - packed_context.emplace_back(method); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(10); - unpacked_context.emplace_back(weight); - unpacked_context.emplace_back(bias); - unpacked_context.emplace_back(weight.sizes().vec()); - unpacked_context.emplace_back(stride_arg.vec()); - unpacked_context.emplace_back(padding_arg.vec()); - unpacked_context.emplace_back(output_padding_arg.vec()); - unpacked_context.emplace_back(dilation_arg.vec()); - unpacked_context.emplace_back(groups); - unpacked_context.emplace_back(output_min); - unpacked_context.emplace_back(output_max); - unpacked_context.emplace_back(method); - - return VulkanOpContext::create(packed_context, unpacked_context); +static inline std::vector get_conv_transpose_output_size( + IntArrayRef input_size, + IntArrayRef weight_size, + IntArrayRef padding, + IntArrayRef output_padding, + IntArrayRef stride, + IntArrayRef dilation = IntArrayRef()) { + auto dim = input_size.size(); + std::vector output_size(dim); + output_size[0] = input_size[input_batch_size_dim]; + output_size[1] = weight_size[weight_input_channels_dim]; + for (const auto d : c10::irange(2, dim)) { + output_size[d] = stride[d - 2] * (input_size[d] - 1) + weight_size[d] - + 2 * padding[d - 2] + output_padding[d - 2]; + } + return output_size; } void conv2d_sliding_window( @@ -438,7 +658,8 @@ void conv2d_sliding_window( ivec4 src_filter; } block{ v_output.extents(), - safe_downcast(packed_filter[Layout::Filter::input]), + safe_downcast( + packed_filter[Layout::Filter::input]), /* this is aligned up */ { safe_downcast(packed_filter[Layout::Filter::width]), safe_downcast(packed_filter[Layout::Filter::height]), @@ -503,44 +724,441 @@ void conv2d_sliding_window( params.buffer()); } -Tensor conv2d_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context) { +void conv2d_sliding_window_q( + const api::ShaderSource& shader, + vTensor& v_output, + const vTensor& v_input, + const vTensor& packed_v_weight, + const vTensor& packed_v_bias, + const IntArrayRef packed_filter, + const IntArrayRef packed_stride, + const IntArrayRef packed_padding, + const IntArrayRef packed_dilation, + const float packed_output_min, + const float packed_output_max, + const IntArrayRef unpacked_filter, + const Conv2dMethod method_, + const double scale, + const int64_t zero_point) { api::Context* const context = api::context(); - const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); - const vTensor& v_input = convert(input); + const double scale_out = v_output.get_scale(); + const int64_t zero_point_out = v_output.get_zero_point(); - const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor()); - const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor()); + const double weight_scale = packed_v_weight.get_scale(); + const int64_t weight_zero_point = packed_v_weight.get_zero_point(); - const auto packed_filter = packed_context.get(2).toIntVector(); - const auto packed_stride = packed_context.get(3).toIntVector(); - const auto packed_padding = packed_context.get(4).toIntVector(); - const auto packed_dilation = packed_context.get(6).toIntVector(); - const float packed_output_min = packed_context.get(8).toDouble(); - const float packed_output_max = packed_context.get(9).toDouble(); - const auto unpacked_filter = unpacked_context.get(2).toIntVector(); - const Conv2dMethod method_ = (Conv2dMethod)unpacked_context.get(10).toInt(); - - TORCH_CHECK( - usable(input), - "Vulkan Convolution not usable! " - "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl."); + const double bias_scale = packed_v_bias.get_scale(); + const int64_t bias_zero_point = packed_v_bias.get_zero_point(); - vTensor v_output{ - context, - conv_output_size( - v_input.sizes(), - unpacked_filter, - packed_padding, - packed_stride, - packed_dilation), + const struct Block final { + uvec3 extents; + int32_t ic4; + ivec4 kernel; + float scale_out; + float scale; + int32_t zero_point_out; + int32_t zero_point; + float weight_scale; + float bias_scale; + int32_t weight_zero_point; + int32_t bias_zero_point; + ivec2 ikernel; + ivec2 stride; + ivec2 padding; + ivec2 dilate; + vec2 clamp; + } block{ + v_output.extents(), + safe_downcast(packed_filter[Layout::Filter::input]), + { + safe_downcast(packed_filter[Layout::Filter::width]), + safe_downcast(packed_filter[Layout::Filter::height]), + safe_downcast(v_input.sizes()[Layout::Activation4D::width]), + safe_downcast(v_input.sizes()[Layout::Activation4D::height]), + }, + safe_downcast(scale_out), + safe_downcast(scale), + safe_downcast(zero_point_out), + safe_downcast(zero_point), + safe_downcast(weight_scale), + safe_downcast(bias_scale), + safe_downcast(weight_zero_point), + safe_downcast(bias_zero_point), + { + safe_downcast(unpacked_filter[Layout::Filter::width]), + safe_downcast(unpacked_filter[Layout::Filter::height]), + }, + { + safe_downcast(packed_stride[Layout::Parameter::width]), + safe_downcast(packed_stride[Layout::Parameter::height]), + }, + { + safe_downcast(packed_padding[Layout::Parameter::width]), + safe_downcast(packed_padding[Layout::Parameter::height]), + }, + { + safe_downcast(packed_dilation[Layout::Parameter::width]), + safe_downcast(packed_dilation[Layout::Parameter::height]), + }, + { + packed_output_min, + packed_output_max, + }, + }; + + uvec3 global_size = v_output.extents(); + if (method_ == QConv2dPointwise) { + global_size = { + safe_downcast( + div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))), + safe_downcast( + div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))), + v_output.extents().data[2u]}; + } + + api::UniformParamsBuffer params(context, block); + api::PipelineBarrier pipeline_barrier{}; + + context->submit_compute_job( + // shader descriptor + shader, + // pipeline barrier + pipeline_barrier, + // global work group size + global_size, + // local work group size + adaptive_work_group_size(global_size), + // fence handle + VK_NULL_HANDLE, + // shader arguments + v_output.image( + pipeline_barrier, + api::PipelineStage::COMPUTE, + api::MemoryAccessType::WRITE), + v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE), + packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE), + packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE), + // params buffer + params.buffer()); +} + +Tensor convolution( + const Tensor& input, + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef stride, + const IntArrayRef padding, + const IntArrayRef dilation, + const bool transposed, + const IntArrayRef output_padding, + const int64_t groups) { + Conv2dPackedContext conv_context = Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + transposed, + false, + output_padding, + groups); + + return run_conv2d_context( + input, c10::make_intrusive(conv_context)); +} + +Tensor quantized_convolution( + const Tensor& input, + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef stride, + const IntArrayRef padding, + const IntArrayRef dilation, + const bool transposed, + const IntArrayRef output_padding, + const int64_t groups, + const double out_scale, + const int64_t out_zero_point) { + if (transposed) { + return run_tconv2d_context( + input, + c10::make_intrusive(Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + transposed, + false, + output_padding, + groups))); + } + + Conv2dPackedContext conv_context = Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + transposed, + true, + output_padding, + groups); + + return run_qconv2d_context( + input, + out_scale, + out_zero_point, + c10::make_intrusive(conv_context)); +} + +} // namespace + +Conv2dPackedContext::Conv2dPackedContext( + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef stride_arg, + const IntArrayRef padding_arg, + const IntArrayRef dilation_arg, + const bool transposed, + const bool quantized, + const IntArrayRef output_padding_arg, + const int64_t groups, + const c10::optional& output_min, + const c10::optional& output_max) + : unpacked_{c10::AnyType::get()} { + const auto stride = expand_param_if_needed(stride_arg, "stride", 2); + const auto padding = expand_param_if_needed(padding_arg, "padding", 2); + const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2); + const auto output_padding = + expand_param_if_needed(output_padding_arg, "output_padding", 2); + + TORCH_CHECK( + available( + weight, + bias, + stride, + padding, + dilation, + transposed, + quantized, + output_padding, + groups, + output_min, + output_max), + "Vulkan::convolution not available! " + "Reason: The provided (weight, bias, stride, padding, dilation, groups, " + "transposed, output_padding, output_min, output_max) parameters are either " + "invalid individually or their combination is not supported by Vulkan impl."); + + const auto method = determine_method( + weight.sizes(), stride, padding, dilation, groups, transposed, quantized); + + packed_.reserve(Packed::NumArgs); + packed_.emplace_back( + convert(pack_weights(weight, transposed, quantized, method))); + packed_.emplace_back( + convert(pack_biases(bias, weight, transposed, quantized))); + packed_.emplace_back(pack_filter(weight, dilation, transposed)); + packed_.emplace_back(pack_params(stride)); + packed_.emplace_back(pack_params(padding)); + packed_.emplace_back(output_padding); + packed_.emplace_back(pack_params(dilation)); + packed_.emplace_back(transposed); + packed_.emplace_back(quantized); + packed_.emplace_back(safe_downcast(groups)); + packed_.emplace_back( + output_min ? output_min->template to() + : -std::numeric_limits::infinity()); + packed_.emplace_back( + output_max ? output_max->template to() + : +std::numeric_limits::infinity()); + packed_.emplace_back(method); + packed_.emplace_back(weight.sizes().vec()); + + if (!at::globalContext().releaseWeightsWhenPrepacking()) { + unpacked_.reserve(Unpacked::NumArgs); + unpacked_.emplace_back(weight); + unpacked_.emplace_back(bias); + unpacked_.emplace_back(stride_arg.vec()); + unpacked_.emplace_back(padding_arg.vec()); + unpacked_.emplace_back(dilation_arg.vec()); + unpacked_.emplace_back(transposed); + unpacked_.emplace_back(quantized); + unpacked_.emplace_back(output_padding_arg.vec()); + unpacked_.emplace_back(groups); + unpacked_.emplace_back(output_min); + unpacked_.emplace_back(output_max); + } +} + +Conv2dPackedContext Conv2dPackedContext::pack(c10::impl::GenericList unpacked) { + return Conv2dPackedContext( + unpacked.get(Unpacked::Weight).toTensor(), + get_optional_tensor(unpacked, Unpacked::Bias), + unpacked.get(Unpacked::Stride).toIntVector(), + unpacked.get(Unpacked::Padding).toIntVector(), + unpacked.get(Unpacked::Dilation).toIntVector(), + unpacked.get(Unpacked::isTransposed).toBool(), + unpacked.get(Unpacked::isQuantized).toBool(), + unpacked.get(Unpacked::OutputPadding).toIntVector(), + unpacked.get(Unpacked::Groups).toInt(), + get_optional_scalar(unpacked, Unpacked::OutputMin), + get_optional_scalar(unpacked, Unpacked::OutputMax)); +} + +c10::intrusive_ptr create_conv2d_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& stride, + std::vector&& padding, + std::vector&& dilation, + const int64_t groups, + const c10::optional& output_min, + const c10::optional& output_max) { + return c10::make_intrusive(Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + /* transposed = */ false, + /* quantized = */ false, + /* output_padding_arg = */ {0}, + groups, + output_min, + output_max)); +} + +c10::intrusive_ptr create_tconv2d_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& stride, + std::vector&& padding, + std::vector&& output_padding, + std::vector&& dilation, + const int64_t groups, + const c10::optional& output_min, + const c10::optional& output_max) { + return c10::make_intrusive(Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + /* transposed = */ true, + /* quantized = */ false, + output_padding, + groups, + output_min, + output_max)); +} + +c10::intrusive_ptr create_qconv2d_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& stride, + std::vector&& padding, + std::vector&& dilation, + const int64_t groups, + const c10::optional& output_min, + const c10::optional& output_max) { + return c10::make_intrusive(Conv2dPackedContext( + weight, + bias, + stride, + padding, + dilation, + /* transposed = */ false, + /* quantized = */ true, + /* output_padding_arg = */ {}, + groups, + output_min, + output_max)); +} + +Tensor run_conv2d_context( + const Tensor& input_arg, + const c10::intrusive_ptr& conv_context) { + api::Context* const context = api::context(); + + const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); + const vTensor& v_input = convert(input); + + const vTensor& packed_v_weight = convert( + conv_context->get_val(Conv2dPackedContext::Packed::Weight).toTensor()); + const vTensor& packed_v_bias = convert( + conv_context->get_val(Conv2dPackedContext::Packed::Bias).toTensor()); + const auto packed_filter = + conv_context->get_val(Conv2dPackedContext::Packed::FilterSizes) + .toIntVector(); + const auto packed_stride = + conv_context->get_val(Conv2dPackedContext::Packed::Stride).toIntVector(); + const auto packed_padding = + conv_context->get_val(Conv2dPackedContext::Packed::Padding).toIntVector(); + const auto packed_output_padding = + conv_context->get_val(Conv2dPackedContext::Packed::OutputPadding) + .toIntVector(); + const auto packed_dilation = + conv_context->get_val(Conv2dPackedContext::Packed::Dilation) + .toIntVector(); + const auto transposed = + conv_context->get_val(Conv2dPackedContext::Packed::isTransposed).toBool(); + const auto quantized = + conv_context->get_val(Conv2dPackedContext::Packed::isQuantized).toBool(); + const float packed_output_min = safe_downcast( + conv_context->get_val(Conv2dPackedContext::Packed::OutputMin).toDouble()); + const float packed_output_max = safe_downcast( + conv_context->get_val(Conv2dPackedContext::Packed::OutputMax).toDouble()); + const Conv2dMethod method_ = + (Conv2dMethod)conv_context + ->get_val(Conv2dPackedContext::Packed::ConvMethod) + .toInt(); + const auto unpacked_filter = + conv_context->get_val(Conv2dPackedContext::Packed::WeightSizes) + .toIntVector(); + + TORCH_CHECK( + usable(input, quantized), + "Vulkan Convolution not usable! " + "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl."); + + vTensor v_output{ + context, + transposed ? get_conv_transpose_output_size( + v_input.sizes(), + unpacked_filter, + packed_padding, + packed_output_padding, + packed_stride, + packed_dilation) + : conv_output_size( + v_input.sizes(), + unpacked_filter, + packed_padding, + packed_stride, + packed_dilation), input.options(), }; switch (method_) { + case TConv2dSlidingWindow: + conv2d_sliding_window( + VK_KERNEL(conv_transpose2d), + v_output, + v_input, + packed_v_weight, + packed_v_bias, + packed_filter, + packed_stride, + packed_padding, + packed_dilation, + packed_output_min, + packed_output_max, + unpacked_filter, + method_); + break; case Conv2dDepthwise: conv2d_sliding_window( VK_KERNEL(conv2d_dw), @@ -594,38 +1212,161 @@ Tensor conv2d_context_run( return convert(v_output); } -c10::intrusive_ptr create_conv2d_clamp_context( - Tensor&& weight, - c10::optional&& bias, - std::vector&& stride, - std::vector&& padding, - std::vector&& dilation, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return c10::make_intrusive(conv2d_context_create( +Tensor run_tconv2d_context( + const Tensor& input_arg, + const c10::intrusive_ptr& conv_context) { + return run_conv2d_context(input_arg, conv_context); +} + +// TODO: this can probably be consolidated with the other run method +Tensor run_qconv2d_context( + const Tensor& input_arg, + double scale, + int64_t zero_point, + const c10::intrusive_ptr& conv_context) { + api::Context* const context = api::context(); + + const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); + const vTensor& v_input = convert(input); + + const vTensor& packed_v_weight = convert( + conv_context->get_val(Conv2dPackedContext::Packed::Weight).toTensor()); + const vTensor& packed_v_bias = convert( + conv_context->get_val(Conv2dPackedContext::Packed::Bias).toTensor()); + const auto packed_filter = + conv_context->get_val(Conv2dPackedContext::Packed::FilterSizes) + .toIntVector(); + const auto packed_stride = + conv_context->get_val(Conv2dPackedContext::Packed::Stride).toIntVector(); + const auto packed_padding = + conv_context->get_val(Conv2dPackedContext::Packed::Padding).toIntVector(); + const auto packed_output_padding = + conv_context->get_val(Conv2dPackedContext::Packed::OutputPadding) + .toIntVector(); + const auto packed_dilation = + conv_context->get_val(Conv2dPackedContext::Packed::Dilation) + .toIntVector(); + const auto quantized = + conv_context->get_val(Conv2dPackedContext::Packed::isQuantized).toBool(); + const float packed_output_min = safe_downcast( + conv_context->get_val(Conv2dPackedContext::Packed::OutputMin).toDouble()); + const float packed_output_max = safe_downcast( + conv_context->get_val(Conv2dPackedContext::Packed::OutputMax).toDouble()); + const Conv2dMethod method_ = + (Conv2dMethod)conv_context + ->get_val(Conv2dPackedContext::Packed::ConvMethod) + .toInt(); + const auto unpacked_filter = + conv_context->get_val(Conv2dPackedContext::Packed::WeightSizes) + .toIntVector(); + + TORCH_CHECK( + usable(input, quantized), + "Vulkan Convolution not usable! " + "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl."); + + vTensor v_output{ + context, + conv_output_size( + v_input.sizes(), + unpacked_filter, + packed_padding, + packed_stride, + packed_dilation), + input.options(), + scale, + zero_point, + }; + + switch (method_) { + case QConv2dSlidingWindow: + conv2d_sliding_window_q( + VK_KERNEL(quantized_conv2d), + v_output, + v_input, + packed_v_weight, + packed_v_bias, + packed_filter, + packed_stride, + packed_padding, + packed_dilation, + packed_output_min, + packed_output_max, + unpacked_filter, + method_, + v_input.get_scale(), + v_input.get_zero_point()); + break; + case QConv2dPointwise: + conv2d_sliding_window_q( + VK_KERNEL(quantized_conv2d_pw_2x2), + v_output, + v_input, + packed_v_weight, + packed_v_bias, + packed_filter, + packed_stride, + packed_padding, + packed_dilation, + packed_output_min, + packed_output_max, + unpacked_filter, + method_, + v_input.get_scale(), + v_input.get_zero_point()); + break; + case QConv2dDepthwise: + conv2d_sliding_window_q( + VK_KERNEL(quantized_conv2d_dw), + v_output, + v_input, + packed_v_weight, + packed_v_bias, + packed_filter, + packed_stride, + packed_padding, + packed_dilation, + packed_output_min, + packed_output_max, + unpacked_filter, + method_, + v_input.get_scale(), + v_input.get_zero_point()); + break; + default: + TORCH_CHECK(false, "Invalid Method"); + } + + return convert_quantized(v_output); +} + +Tensor conv2d( + const Tensor& input, + const Tensor& weight, + const c10::optional& bias, + IntArrayRef stride, + IntArrayRef padding, + IntArrayRef dilation, + int64_t groups, + double out_scale, + int64_t out_zero_point) { + return quantized_convolution( + input, weight, bias, stride, padding, dilation, - /* transposed = */ false, - /* output_padding_arg = */ {}, + false, + {{0, 0}}, groups, - output_min, - output_max)); -} - -Tensor run_conv2d_clamp_context( - const Tensor& input, - const c10::intrusive_ptr& vulkan_context) { - return conv2d_context_run( - input, vulkan_context->get_packed(), vulkan_context->get_unpacked()); + out_scale, + out_zero_point); } /* Backwards compatibility */ -Conv2dOpContext::Conv2dOpContext(VulkanOpContext vulkan_context) - : vulkan_context_{std::move(vulkan_context)} {} +Conv2dOpContext::Conv2dOpContext(Conv2dPackedContext conv_context) + : conv_context_{std::move(conv_context)} {} Conv2dOpContext Conv2dOpContext::create( const Tensor& weight, @@ -638,13 +1379,14 @@ Conv2dOpContext Conv2dOpContext::create( const int64_t groups, const c10::optional& output_min, const c10::optional& output_max) { - return Conv2dOpContext{conv2d_context_create( + return Conv2dOpContext{Conv2dPackedContext( weight, bias, stride_arg, padding_arg, dilation_arg, transposed, + /* quantized = */ false, output_padding_arg, groups, output_min, @@ -652,36 +1394,24 @@ Conv2dOpContext Conv2dOpContext::create( } Tensor Conv2dOpContext::run(const Tensor& input_arg) const { - return conv2d_context_run( - input_arg, vulkan_context_.get_packed(), vulkan_context_.get_unpacked()); + return run_conv2d_context( + input_arg, c10::make_intrusive(conv_context_)); } Conv2dOpContext::State Conv2dOpContext::unpack() const { - const c10::impl::GenericList unpacked_ = - std::get<1>(vulkan_context_.get_state()); - const Tensor unpacked_weight = unpacked_.get(0).toTensor(); - const c10::optional unpacked_bias = unpacked_.get(1).isTensor() - ? unpacked_.get(1).toTensor() - : (c10::optional&)c10::nullopt; - const std::vector unpacked_stride = unpacked_.get(2).toIntVector(); - const std::vector unpacked_padding = unpacked_.get(3).toIntVector(); - const std::vector unpacked_dilation = unpacked_.get(4).toIntVector(); - const int64_t unpacked_groups = unpacked_.get(5).toInt(); - const c10::optional unpacked_output_min = unpacked_.get(6).isScalar() - ? unpacked_.get(6).toScalar() - : (c10::optional)c10::nullopt; - const c10::optional unpacked_output_max = unpacked_.get(6).isScalar() - ? unpacked_.get(7).toScalar() - : (c10::optional)c10::nullopt; - return Conv2dOpContext::State{ - unpacked_weight, - unpacked_bias, - unpacked_stride, - unpacked_padding, - unpacked_dilation, - unpacked_groups, - unpacked_output_min, - unpacked_output_max}; + const c10::impl::GenericList unpacked_ = conv_context_.unpack(); + + TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!"); + + return Conv2dOpContext::State( + unpacked_.get(Conv2dPackedContext::Unpacked::Weight).toTensor(), + get_optional_tensor(unpacked_, Conv2dPackedContext::Unpacked::Bias), + unpacked_.get(Conv2dPackedContext::Unpacked::Stride).toIntVector(), + unpacked_.get(Conv2dPackedContext::Unpacked::Padding).toIntVector(), + unpacked_.get(Conv2dPackedContext::Unpacked::Dilation).toIntVector(), + unpacked_.get(Conv2dPackedContext::Unpacked::Groups).toInt(), + get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMin), + get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMax)); } c10::intrusive_ptr conv2d_clamp_prepack( @@ -700,7 +1430,7 @@ c10::intrusive_ptr conv2d_clamp_prepack( std::move(padding), std::move(dilation), /* transposed = */ false, - /* output_padding = */ {}, + /* output_padding = */ {0}, groups, output_min, output_max)); @@ -712,6 +1442,10 @@ Tensor conv2d_clamp_run( return context->run(input); } +TORCH_LIBRARY_IMPL(aten, Vulkan, m) { + m.impl("convolution_overrideable", convolution); +} + } // namespace ops } // namespace vulkan } // namespace native diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.h b/aten/src/ATen/native/vulkan/ops/Convolution.h index 69680a4b167b4..745d6064def17 100644 --- a/aten/src/ATen/native/vulkan/ops/Convolution.h +++ b/aten/src/ATen/native/vulkan/ops/Convolution.h @@ -3,7 +3,7 @@ #ifdef USE_VULKAN_API #include -#include +#include namespace at { namespace native { @@ -14,61 +14,125 @@ enum Conv2dMethod { Conv2dDepthwise, Conv2dPointwise, Conv2dSlidingWindow, + TConv2dSlidingWindow, + QConv2dDepthwise, + QConv2dPointwise, + QConv2dSlidingWindow, }; -// private: -// packed -// vTensor v_weight -// vTensor v_bias -// std::array filter -// std::array stride -// std::array padding -// std::array dilation -// int32_t groups -// float output_min -// float output_max - -// unpacked -// Tensor weight -// c10::optional bias -// std::vector filter -// std::vector stride -// std::vector padding -// std::vector dilation -// int64_t groups -// c10::optional output_min -// c10::optional output_max - -VulkanOpContext conv2d_context_create( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef dilation_arg, - const bool transposed, - const IntArrayRef output_padding_arg, +class Conv2dPackedContext final : virtual public VulkanPackedContext, + public torch::jit::CustomClassHolder { + private: + c10::impl::GenericList unpacked_; + + public: + Conv2dPackedContext( + const Tensor& weight, + const c10::optional& bias, + const IntArrayRef stride_arg, + const IntArrayRef padding_arg, + const IntArrayRef dilation_arg, + const bool transposed, + const bool quantized, + const IntArrayRef output_padding_arg, + const int64_t groups, + const c10::optional& output_min = c10::nullopt, + const c10::optional& output_max = c10::nullopt); + + /* + * Assigns a name to each index in the unpacked list. + */ + struct Unpacked final { + static constexpr uint32_t Weight = 0u; + static constexpr uint32_t Bias = 1u; + static constexpr uint32_t Stride = 2u; + static constexpr uint32_t Padding = 3u; + static constexpr uint32_t Dilation = 4u; + static constexpr uint32_t isTransposed = 5u; + static constexpr uint32_t isQuantized = 6u; + static constexpr uint32_t OutputPadding = 7u; + static constexpr uint32_t Groups = 8u; + static constexpr uint32_t OutputMin = 9u; + static constexpr uint32_t OutputMax = 10u; + + static constexpr uint32_t NumArgs = 11u; + }; + + /* + * Assigns a name to each index in the packed list. + */ + struct Packed final { + static constexpr uint32_t Weight = 0u; + static constexpr uint32_t Bias = 1u; + static constexpr uint32_t FilterSizes = 2u; + static constexpr uint32_t Stride = 3u; + static constexpr uint32_t Padding = 4u; + static constexpr uint32_t OutputPadding = 5u; + static constexpr uint32_t Dilation = 6u; + static constexpr uint32_t isTransposed = 7u; + static constexpr uint32_t isQuantized = 8u; + static constexpr uint32_t Groups = 9u; + static constexpr uint32_t OutputMin = 10u; + static constexpr uint32_t OutputMax = 11u; + static constexpr uint32_t ConvMethod = 12u; + static constexpr uint32_t WeightSizes = 13u; + + static constexpr uint32_t NumArgs = 14u; + }; + + static Conv2dPackedContext pack(c10::impl::GenericList); + + const c10::impl::GenericList unpack() const override { + TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!"); + + return unpacked_; + } +}; + +c10::intrusive_ptr create_conv2d_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& stride, + std::vector&& padding, + std::vector&& dilation, const int64_t groups, const c10::optional& output_min = c10::nullopt, const c10::optional& output_max = c10::nullopt); -Tensor conv2d_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context); +Tensor run_conv2d_context( + const Tensor& input, + const c10::intrusive_ptr& context); + +c10::intrusive_ptr create_tconv2d_context( + Tensor&& weight, + c10::optional&& bias, + std::vector&& stride, + std::vector&& padding, + std::vector&& output_padding, + std::vector&& dilation, + const int64_t groups, + const c10::optional& output_min = c10::nullopt, + const c10::optional& output_max = c10::nullopt); -Tensor run_conv2d_clamp_context( +Tensor run_tconv2d_context( const Tensor& input, - const c10::intrusive_ptr& context); + const c10::intrusive_ptr& context); -c10::intrusive_ptr create_conv2d_clamp_context( +c10::intrusive_ptr create_qconv2d_context( Tensor&& weight, c10::optional&& bias, std::vector&& stride, std::vector&& padding, std::vector&& dilation, const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max); + const c10::optional& output_min = c10::nullopt, + const c10::optional& output_max = c10::nullopt); + +Tensor run_qconv2d_context( + const Tensor& input_arg, + double scale, + int64_t zero_point, + const c10::intrusive_ptr& conv_context); // Backwards compatibility class Conv2dOpContext final : public torch::jit::CustomClassHolder { @@ -99,8 +163,8 @@ class Conv2dOpContext final : public torch::jit::CustomClassHolder { State unpack() const; private: - explicit Conv2dOpContext(VulkanOpContext vulkan_context); - VulkanOpContext vulkan_context_; + explicit Conv2dOpContext(Conv2dPackedContext conv_context); + Conv2dPackedContext conv_context_; }; Tensor conv2d_clamp_run( diff --git a/aten/src/ATen/native/vulkan/ops/Copy.cpp b/aten/src/ATen/native/vulkan/ops/Copy.cpp index fb4db712a8ad9..dbac25e0c7ee3 100644 --- a/aten/src/ATen/native/vulkan/ops/Copy.cpp +++ b/aten/src/ATen/native/vulkan/ops/Copy.cpp @@ -1,4 +1,4 @@ -#include +#include #include namespace at { @@ -6,12 +6,112 @@ namespace native { namespace vulkan { namespace ops { -void copy_vulkan_to_vulkan(vTensor& src, vTensor& dst) { +// +// Utility functions for memcpy +// + +void memcpy_to_mapping(const Tensor& src, api::MemoryMap& dst_mapping) { + if (src.dtype() == at::kFloat) { + memcpy_to_mapping_impl(src, dst_mapping); + } else if (src.dtype() == at::kHalf) { + memcpy_to_mapping_impl(src, dst_mapping); + } else if (src.dtype() == c10::kQUInt8) { + memcpy_to_mapping_impl(src, dst_mapping); + } else { + TORCH_CHECK( + false, + "Invalid Data Type: expected c10::QUint8, at::kHalf or at::Float but got ", + src.dtype()); + } +} + +void memcpy_from_mapping(api::MemoryMap& src_mapping, Tensor& dst) { + if (dst.dtype() == at::kFloat) { + memcpy_from_mapping_impl(src_mapping, dst); + } else if (dst.dtype() == at::kHalf) { + memcpy_from_mapping_impl(src_mapping, dst); + } else if (dst.dtype() == c10::kQUInt8) { + memcpy_from_mapping_impl(src_mapping, dst); + } else { + TORCH_CHECK( + false, + "Invalid Data Type: expected c10::QUint8, at::kHalf or Float but got ", + dst.dtype()); + } +} + +// +// CPU <-> GPU copy implementations (these functions use Transfer commands) +// + +void transfer_cpu_to_vulkan(const Tensor& src, vTensor& v_dst) { + api::Context* const context = api::context(); + + // Convert to dtype corresponding to the image format of the texture to + // ensure that byte alignment is consistent when copying. In some cases + // a 16 bit format will be used for at::kFloat. + Tensor src_nc4hw = utils::nchw_to_nc4hw(src).to(v_dst.texture_dtype()); + + api::StorageBuffer staging(context, v_dst.texture_dtype(), v_dst.numcells()); + // Copy data into the staging buffer + { + api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); + mapping.invalidate(); + + memcpy_to_mapping(src_nc4hw, mapping); + } + + api::PipelineBarrier pipeline_barrier{}; + utils::copy_buffer_to_vtensor(staging.buffer(), v_dst, pipeline_barrier); +} + +void transfer_vulkan_to_cpu(vTensor& v_src, Tensor& dst) { + api::Context* const context = api::context(); + + // Temporary tensor to receive copied NC4HW data + at::Tensor dst_tmp = utils::create_staging_tensor(v_src); + + api::StorageBuffer staging(context, v_src.texture_dtype(), v_src.numcells()); + + api::VulkanFence fence = context->fences().get_fence(); + + { + // Refer to comment in submit_compute_job. When syncing with the GPU, the + // context must not allow other threads to record dispatches into it between + // between calling vkQueueSubmit and flushing the context. Therefore, + // cmd_mutex_ must be manually managed by the calling thread. + std::unique_lock context_lock(context->dispatch_lock()); + + api::PipelineBarrier pipeline_barrier{}; + utils::copy_vtensor_to_buffer( + v_src, staging.buffer(), pipeline_barrier, fence.get_submit_handle()); + + fence.wait(); + + context->flush(); + // cmd_mutex_ will be released when exiting this scope. + } + + // Copy data from buffer back to CPU tensor. + { + api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::READ); + mapping.invalidate(); + + memcpy_from_mapping(mapping, dst_tmp); + } + + context->fences().return_fence(fence); + + dst = + utils::nc4hw_to_nchw(dst_tmp, v_src.sizes()).to(v_src.options().dtype()); +} + +void transfer_vulkan_to_vulkan(vTensor& src, vTensor& dst) { api::Context* const context = api::context(); api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images @@ -28,34 +128,42 @@ void copy_vulkan_to_vulkan(vTensor& src, vTensor& dst) { VK_NULL_HANDLE); } -void copy_cpu_to_vulkan(const Tensor& src, vTensor& dst) { +// +// CPU <-> GPU copy implementations (these functions use compute shaders) +// + +void pack_cpu_to_vulkan(const Tensor& src, vTensor& dst) { api::Context* const context = api::context(); - api::StagingBuffer staging(context, dst.buffer_bytes()); + // Note that the float data type has been enforced for the storage buffer + // below. The reason for this is that the nchw_to_image and image_to_nchw + // shaders which perform the transfer to/from an image texture expect a buffer + // of floats as input. GLSL/Vulkan does not natively support 16 bit arithmetic + // types, so for now storage buffers created for compute shaders must define + // floats as their base data type. + api::StorageBuffer staging(context, at::kFloat, dst.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - if (src.dtype() == c10::kQUInt8) { - c10::quint8* data_ptr = mapping.template data(); - memcpy( - data_ptr, - src.contiguous().data_ptr(), - std::min(src.nbytes(), src.nbytes())); + // If the dtype() of src is at::kHalf, then first convert it to 32 bit + // float. This is required since the nchw_to_image shader uses a float + // buffer as input (note that at::kFloat is used to create the StorageBuffer + // above). + if (src.dtype() == at::kHalf) { + memcpy_to_mapping(src.to(at::kFloat), mapping); } else { - float* data_ptr = mapping.template data(); - memcpy( - data_ptr, - src.contiguous().data_ptr(), - std::min(src.nbytes(), src.nbytes())); + memcpy_to_mapping(src, mapping); } } utils::pack_staging_to_vtensor(staging.buffer(), dst); } -void copy_vulkan_to_cpu(vTensor& src, Tensor& dst) { +void pack_vulkan_to_cpu(vTensor& src, Tensor& dst) { api::Context* const context = api::context(); - api::StagingBuffer staging(context, src.buffer_bytes()); + // Refer to the comment in pack_cpu_to_vulkan for why at::kFloat is specified + // for the storage buffer below. + api::StorageBuffer staging(context, at::kFloat, src.numcells()); api::VulkanFence fence = context->fences().get_fence(); @@ -80,46 +188,41 @@ void copy_vulkan_to_cpu(vTensor& src, Tensor& dst) { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::READ); mapping.invalidate(); - if (dst.is_quantized()) { - c10::quint8* data_ptr = mapping.template data(); - memcpy( - dst.data_ptr(), - data_ptr, - std::min(src.nbytes(), dst.nbytes())); + // If the dtype() of dst is at::kHalf, then copy the data into a float + // version of it first, similar to pack_cpu_to_vulkan(). + if (dst.dtype() == at::kHalf) { + Tensor dst_float = dst.to(at::kFloat); + memcpy_from_mapping(mapping, dst_float); + dst = dst_float.to(at::kHalf); } else { - float* data_ptr = mapping.template data(); - memcpy( - dst.data_ptr(), - data_ptr, - std::min(src.nbytes(), dst.nbytes())); + memcpy_from_mapping(mapping, dst); } } context->fences().return_fence(fence); } -Tensor& copy_(Tensor& self, const Tensor& src) { +// +// Copy op implementations +// + +Tensor& copy_(Tensor& dst, const Tensor& src) { // Check that sizes are equal TORCH_CHECK( - self.sizes() == src.sizes(), - "Vulkan copy_: Tensor sizes are mismatched!"); + dst.sizes() == src.sizes(), "Vulkan copy_: Tensor sizes are mismatched!"); // X -> Vulkan - if (at::kVulkan == self.device().type()) { - vTensor& v_self = convert(self); + if (at::kVulkan == dst.device().type()) { + vTensor& v_self = convert(dst); // Vulkan -> Vulkan if (at::kVulkan == src.device().type()) { vTensor& v_src = convert(src); - copy_vulkan_to_vulkan(v_src, v_self); + transfer_vulkan_to_vulkan(v_src, v_self); } // CPU -> Vulkan else { - TORCH_CHECK( - src.dtype() == c10::kQUInt8 || src.dtype() == at::kFloat, - "Invalid Data Type: expected QUint8 or Float but got ", - src.dtype()); - copy_cpu_to_vulkan(src, v_self); + pack_cpu_to_vulkan(src, v_self); } } // Vulkan -> X @@ -127,12 +230,8 @@ Tensor& copy_(Tensor& self, const Tensor& src) { vTensor& v_src = convert(src); // Vulkan -> CPU - if (self.device().is_cpu()) { - TORCH_CHECK( - self.dtype() == c10::kQUInt8 || self.dtype() == at::kFloat, - "Invalid Data Type: expected QUint8 or Float but got ", - self.dtype()); - copy_vulkan_to_cpu(v_src, self); + if (dst.device().is_cpu()) { + pack_vulkan_to_cpu(v_src, dst); } else { TORCH_CHECK(false, "Unsupported!"); } @@ -143,7 +242,7 @@ Tensor& copy_(Tensor& self, const Tensor& src) { "was expected to be Vulkan a tensor! Incorrect dispatch?"); } - return self; + return dst; } } // namespace ops diff --git a/aten/src/ATen/native/vulkan/ops/Copy.h b/aten/src/ATen/native/vulkan/ops/Copy.h index e69af06357c5a..1493af6e629bd 100644 --- a/aten/src/ATen/native/vulkan/ops/Copy.h +++ b/aten/src/ATen/native/vulkan/ops/Copy.h @@ -9,7 +9,37 @@ namespace native { namespace vulkan { namespace ops { -Tensor& copy_(Tensor& self, const Tensor& src); +void transfer_cpu_to_vulkan(const Tensor&, vTensor&); + +void transfer_vulkan_to_cpu(vTensor&, Tensor&); + +Tensor& copy_(Tensor& dst, const Tensor& src); + +// +// Utility functions for memcpy +// + +template +void memcpy_to_mapping_impl(const Tensor& src, api::MemoryMap& dst_mapping) { + T* data_ptr = dst_mapping.template data(); + memcpy( + data_ptr, + src.contiguous().data_ptr(), + std::min(src.nbytes(), dst_mapping.nbytes())); +} + +template +void memcpy_from_mapping_impl(api::MemoryMap& src_mapping, Tensor& dst) { + T* data_ptr = src_mapping.template data(); + memcpy( + dst.data_ptr(), + data_ptr, + std::min(src_mapping.nbytes(), dst.nbytes())); +} + +void memcpy_to_mapping(const Tensor& src, api::MemoryMap& dst_mapping); + +void memcpy_from_mapping(api::MemoryMap& src_mapping, Tensor& dst); } // namespace ops } // namespace vulkan diff --git a/aten/src/ATen/native/vulkan/ops/Glu.cpp b/aten/src/ATen/native/vulkan/ops/Glu.cpp index 1778813bce57b..1a1f58b6dce5d 100644 --- a/aten/src/ATen/native/vulkan/ops/Glu.cpp +++ b/aten/src/ATen/native/vulkan/ops/Glu.cpp @@ -16,7 +16,7 @@ Tensor glu(const at::Tensor& input_arg, const int64_t dim = -1) { "Vulkan glu only supports GLU for dim = 1, but got dim = ", dim); TORCH_CHECK( - channels_size(input_arg) % 2 == 0, + get_dim(input_arg) % 2 == 0, "Vulkan glu expects channel dim to be multiple of 2!"); const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); diff --git a/aten/src/ATen/native/vulkan/ops/Gru.cpp b/aten/src/ATen/native/vulkan/ops/Gru.cpp index e29c6b59fd9fb..9be247499d416 100644 --- a/aten/src/ATen/native/vulkan/ops/Gru.cpp +++ b/aten/src/ATen/native/vulkan/ops/Gru.cpp @@ -1,6 +1,5 @@ #include #include -#include #include namespace at { @@ -9,14 +8,19 @@ namespace vulkan { namespace ops { namespace { // -// input_vk: input tensor of shape (L, N, H_in) when batch_first=False -// (N, L, H_in) when batch_first=True containing -// the features of the input sequence -// hx_vk: initial hidden state for each element in the batch. tensor of shape (D -// * num_layers, N, H_out) output: tensor of shape (N, L, D * H_out)) when -// batch_first=True h_n: tensor of shape (D * num_layers, N, H_out) +// input_vk: input tensor containing the features of the input sequence +// tensor of shape (N, L, H_in) when batch_first=True +// (L, N, H_in) when batch_first=False // -// where +// hx_vk: initial hidden state for each element in the batch. +// tensor of shape (D * num_layers, N, H_out) +// +// output: tensor of shape (N, L, D * H_out) when batch_first=True +// (L, N, D * H_out) when batch_first=False +// +// h_n: tensor of shape (D * num_layers, N, H_out) +// +// where // L = sequence length // N = batch size // D = 2 if bidirectional=True otherwise 1 @@ -46,18 +50,22 @@ std::tuple gru_input( TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false."); TORCH_INTERNAL_ASSERT( !bidirectional, "Vulkan gru expects 'bidirectional' to be false."); - TORCH_INTERNAL_ASSERT( - batch_first, "Vulkan gru expects 'batch_first' to be true."); TORCH_INTERNAL_ASSERT( dropout < std::numeric_limits::epsilon() * 1000, "Vulkan gru expects 'dropout' to be 0.0."); + const auto batch_size = input_vk.size(0); + const auto seq_length = input_vk.size(1); + + TORCH_INTERNAL_ASSERT( + (batch_size == 1 && seq_length == 1) || batch_first, + "Vulkan gru expects batch-first input"); + const auto hidden_size = hx_vk.size(2); std::vector h_n_list; // hidden output // reshape to 2D due to Vulkan at::mm op accepts only 2D - auto x = - input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)}); + auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)}); for (int64_t i = 0; i < num_layers; ++i) { // extract each hidden state and squeeze into 2D dim @@ -100,6 +108,7 @@ std::tuple gru_input( } auto h_n = at::cat(h_n_list, 1); + x = x.reshape({batch_size, seq_length, x.size(1)}); h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)}); return std::tuple(x, h_n); } @@ -114,13 +123,19 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) { } // namespace -std::vector> pack_linear_op_contexts( +std::vector> pack_linear_op_contexts( const std::vector& params_cpu, int64_t num_layers) { TORCH_CHECK( static_cast(params_cpu.size()) == 4 * num_layers, - "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'."); - std::vector> linear_op_contexts; + "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'." + " But 'params_cpu' has size: ", + params_cpu.size(), + " and 'num_layers' is: ", + num_layers); + std::vector> linear_op_contexts; + linear_op_contexts.reserve(num_layers * 6); + for (int64_t i = 0; i < num_layers; ++i) { const auto& w_ih = params_cpu.at(i * 4); const auto& w_hh = params_cpu.at(i * 4 + 1); @@ -156,7 +171,7 @@ std::vector> pack_linear_op_contexts( return linear_op_contexts; } -VulkanOpContext gru_context_create( +GruPackedContext::GruPackedContext( const std::vector& params_cpu, // weights/biases (cpu) bool has_biases, int64_t num_layers, @@ -169,99 +184,151 @@ VulkanOpContext gru_context_create( TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false."); TORCH_INTERNAL_ASSERT( !bidirectional, "Vulkan gru expects 'bidirectional' to be false."); - TORCH_INTERNAL_ASSERT( - batch_first, "Vulkan gru expects 'batch_first' to be true."); TORCH_INTERNAL_ASSERT( dropout < std::numeric_limits::epsilon() * 1000, "Vulkan gru expects 'dropout' to be 0.0."); - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(7); - packed_context.emplace_back(pack_linear_op_contexts(params_cpu, num_layers)); - packed_context.emplace_back(has_biases); - packed_context.emplace_back(num_layers); - packed_context.emplace_back(dropout); - packed_context.emplace_back(train); - packed_context.emplace_back(bidirectional); - packed_context.emplace_back(batch_first); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(7); - unpacked_context.emplace_back(params_cpu); - unpacked_context.emplace_back(has_biases); - unpacked_context.emplace_back(num_layers); - unpacked_context.emplace_back(dropout); - unpacked_context.emplace_back(train); - unpacked_context.emplace_back(bidirectional); - unpacked_context.emplace_back(batch_first); - - return VulkanOpContext::create(packed_context, unpacked_context); + packed_.reserve(Packed::NumArgs); + packed_.emplace_back(pack_linear_op_contexts(params_cpu, num_layers)); + packed_.emplace_back(has_biases); + packed_.emplace_back(num_layers); + packed_.emplace_back(dropout); + packed_.emplace_back(train); + packed_.emplace_back(bidirectional); + packed_.emplace_back(batch_first); +} + +GruPackedContext GruPackedContext::pack(c10::impl::GenericList unpacked) { + return GruPackedContext( + unpacked.get(Unpacked::Params).toTensorVector(), + unpacked.get(Unpacked::hasBiases).toBool(), + unpacked.get(Unpacked::NumLayers).toInt(), + unpacked.get(Unpacked::Dropout).toDouble(), + unpacked.get(Unpacked::Train).toBool(), + unpacked.get(Unpacked::Bidirectional).toBool(), + unpacked.get(Unpacked::BatchFirst).toBool()); +} + +const c10::impl::GenericList GruPackedContext::unpack() const { + c10::impl::GenericList unpacked_gru_context{c10::AnyType::get()}; + unpacked_gru_context.reserve(Unpacked::NumArgs); + + const c10::List packed_linear_contexts = + get_val(Packed::LinearContexts).toList(); + + const int64_t num_layers = get_val(Packed::NumLayers).toInt(); + const int64_t linear_contexts_per_layer = 6; + + std::vector params_cpu; + params_cpu.reserve(num_layers * linear_contexts_per_layer); + + for (c10::IValue packed_linear_context : packed_linear_contexts) { + const c10::impl::GenericList unpacked_linear_context = + packed_linear_context.toCustomClass()->unpack(); + + TORCH_CHECK( + unpacked_linear_context.size() > 0u, + "unpacked_linear_context does not have any elements!"); + + params_cpu.emplace_back( + unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight) + .toTensor() + .t()); + params_cpu.emplace_back( + unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias) + .toTensor()); + } + unpacked_gru_context.emplace_back(params_cpu); + for (int64_t i = 1; i < Unpacked::NumArgs; ++i) { + unpacked_gru_context.emplace_back(get_val(i)); + } + + return unpacked_gru_context; } -std::tuple gru_context_run( +c10::intrusive_ptr create_gru_context( + std::vector&& params_cpu, + bool has_biases, + int64_t num_layers, + double dropout, + bool train, + bool bidirectional, + bool batch_first) { + return c10::make_intrusive(GruPackedContext( + params_cpu, + has_biases, + num_layers, + dropout, + train, + bidirectional, + batch_first)); +} + +std::tuple run_gru_context( const Tensor& input_vk, // input sequence (vulkan) const Tensor& hx_vk, // initial hidden state (vulkan) - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context) { + const c10::intrusive_ptr& gru_context) { TORCH_INTERNAL_ASSERT( input_vk.sizes().size() == 3, "Vulkan gru expects 'input_vk' dims to be 3."); TORCH_INTERNAL_ASSERT( hx_vk.sizes().size() == 3, "Vulkan gru expects 'hx_vk' dims to be 3."); - const c10::List packed_linear_op_contexts = - packed_context.get(0).toList(); - const int64_t packed_num_layers = packed_context.get(2).toInt(); + const int64_t num_layers = + gru_context->get_val(GruPackedContext::Packed::NumLayers).toInt(); + const bool batch_first = + gru_context->get_val(GruPackedContext::Packed::BatchFirst).toBool(); + const auto batch_size = input_vk.size(0); + const auto seq_length = input_vk.size(1); + + TORCH_INTERNAL_ASSERT( + (batch_size == 1 && seq_length == 1) || batch_first, + "Vulkan gru expects batch-first input"); + + const c10::List packed_linear_contexts = + gru_context->get_val(GruPackedContext::Packed::LinearContexts).toList(); - const int64_t linear_op_contexts_per_layer = - 6; // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz), (b_hz, w_hz), (b_in, - // w_in), (b_hn, w_hn) + const int64_t linear_contexts_per_layer = 6; + // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz), + // (b_hz, w_hz), (b_in,cw_in), (b_hn, w_hn) std::vector h_n_list; // hidden output // reshape to 2D due to Vulkan at::mm op accepts only 2D - auto x = - input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)}); + auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)}); - for (int64_t i = 0; i < packed_num_layers; ++i) { + for (int64_t i = 0; i < num_layers; ++i) { // extract each hidden state and squeeze into 2D dim auto h = at::slice(hx_vk, 0, i, i + 1, 1); h = h.reshape({h.size(0) * h.size(1), h.size(2)}); const auto& cxt_ir = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 0] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 0] + .toCustomClass(); const auto& cxt_hr = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 1] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 1] + .toCustomClass(); const auto& cxt_iz = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 2] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 2] + .toCustomClass(); const auto& cxt_hz = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 3] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 3] + .toCustomClass(); const auto& cxt_in = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 4] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 4] + .toCustomClass(); const auto& cxt_hn = - packed_linear_op_contexts[i * linear_op_contexts_per_layer + 5] - .toCustomClass(); + packed_linear_contexts[i * linear_contexts_per_layer + 5] + .toCustomClass(); const auto& r = at::sigmoid( - linear_context_run( - x, cxt_ir->get_packed(), cxt_ir->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_hr->get_packed(), cxt_hr->get_unpacked(), 1.0f, 1.0f)); + run_linear_context(x, cxt_ir) + run_linear_context(h, cxt_hr)); + // cxt_ir->run(x, 1.0f, 1.0f) + cxt_hr->run(h, 1.0f, 1.0f)); const auto& z = at::sigmoid( - linear_context_run( - x, cxt_iz->get_packed(), cxt_iz->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_hz->get_packed(), cxt_hz->get_unpacked(), 1.0f, 1.0f)); + run_linear_context(x, cxt_iz) + run_linear_context(h, cxt_hz)); + // cxt_iz->run(x, 1.0f, 1.0f) + cxt_hz->run(h, 1.0f, 1.0f)); const auto& n = at::tanh( - linear_context_run( - x, cxt_in->get_packed(), cxt_in->get_unpacked(), 1.0f, 1.0f) + - r * - (linear_context_run( - h, cxt_hn->get_packed(), cxt_hn->get_unpacked(), 1.0f, 1.0f))); + run_linear_context(x, cxt_in) + r * run_linear_context(h, cxt_hn)); + // cxt_in->run(x, 1.0f, 1.0f) + r * (cxt_hn->run(h, 1.0f, 1.0f))); h = (z * (-1) + 1) * n + z * h; x = h; // next input h_n_list.emplace_back( @@ -269,118 +336,11 @@ std::tuple gru_context_run( } auto h_n = at::cat(h_n_list, 1); + x = x.reshape({batch_size, seq_length, x.size(1)}); h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)}); return std::tuple(x, h_n); } -c10::intrusive_ptr create_gru_context( - std::vector&& params_cpu, - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first) { - return c10::make_intrusive(gru_context_create( - params_cpu, - has_biases, - num_layers, - dropout, - train, - bidirectional, - batch_first)); -} - -std::tuple run_gru_context( - const Tensor& input_vk, - const Tensor& hx_vk, - const c10::intrusive_ptr& vulkan_context) { - return gru_context_run( - input_vk, - hx_vk, - vulkan_context->get_packed(), - vulkan_context->get_unpacked()); -} - -/* Backwards compatibility */ -GruOpContext::GruOpContext(VulkanOpContext vulkan_context) - : vulkan_context_{std::move(vulkan_context)} {} - -GruOpContext GruOpContext::create( - const std::vector& params_cpu, // weights/biases (cpu) - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first) { - return GruOpContext{gru_context_create( - params_cpu, - has_biases, - num_layers, - dropout, - train, - bidirectional, - batch_first)}; -} - -std::tuple GruOpContext::run( - const Tensor& input_vk, // input sequence (vulkan) - const Tensor& hx_vk) const { // initial hidden state (vulkan) - return gru_context_run( - input_vk, - hx_vk, - vulkan_context_.get_packed(), - vulkan_context_.get_unpacked()); -} - -GruOpContext::State GruOpContext::unpack() const { - const c10::impl::GenericList unpacked_ = - std::get<1>(vulkan_context_.get_state()); - const std::vector unpacked_params_cpu = - unpacked_.get(0).toTensorVector(); - const bool unpacked_has_biases = unpacked_.get(1).toBool(); - const int64_t unpacked_num_layers = unpacked_.get(2).toInt(); - const double unpacked_dropout = unpacked_.get(3).toDouble(); - const bool unpacked_train = unpacked_.get(4).toBool(); - const bool unpacked_bidirectional = unpacked_.get(5).toBool(); - const bool unpacked_batch_first = unpacked_.get(6).toBool(); - return GruOpContext::State{ - unpacked_params_cpu, - unpacked_has_biases, - unpacked_num_layers, - unpacked_dropout, - unpacked_train, - unpacked_bidirectional, - unpacked_batch_first, - }; -} - -c10::intrusive_ptr gru_prepack( - std::vector&& params_cpu, - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first) { - return c10::make_intrusive(GruOpContext::create( - params_cpu, - has_biases, - num_layers, - dropout, - train, - bidirectional, - batch_first)); -} - -std::tuple gru_run( - const Tensor& input_vk, - const Tensor& hx_vk, - const c10::intrusive_ptr& context) { - return context->run(input_vk, hx_vk); -} - } // namespace ops } // namespace vulkan } // namespace native diff --git a/aten/src/ATen/native/vulkan/ops/Gru.h b/aten/src/ATen/native/vulkan/ops/Gru.h index 304ce822a0e9a..922ac02fc2d09 100644 --- a/aten/src/ATen/native/vulkan/ops/Gru.h +++ b/aten/src/ATen/native/vulkan/ops/Gru.h @@ -3,7 +3,7 @@ #ifdef USE_VULKAN_API #include -#include +#include #include namespace at { @@ -11,81 +11,10 @@ namespace native { namespace vulkan { namespace ops { -// packed -// std::vector> linear_op_contexts; // -// {{ op context for b_ir, w_ir, op context for b_hr, w_hr, -// // -// op -// context -// for -// b_iz, -// w_iz, -// op -// context -// for -// b_hz, -// w_hz, -// // -// op -// context -// for -// b_in, -// w_in, -// op -// context -// for -// b_hn, -// w_hn,}, -// ...} -// bool has_biases{}; -// int64_t num_layers{}; -// double dropout{}; -// bool train{}; -// bool bidirectional{}; -// bool batch_first{}; - -// unpacked -// std::vector params_cpu // weights/biases (cpu) -// bool has_biases -// int64_t num_layers -// double dropout -// bool train -// bool bidirectional -// bool batch_first - -VulkanOpContext gru_context_create( - const std::vector& params_cpu, // weights/biases (cpu) - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first); - -std::tuple gru_context_run( - const Tensor& input_vk, // input sequence (vulkan) - const Tensor& hx_vk, // initial hidden state (vulkan) - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context); - -c10::intrusive_ptr create_gru_context( - std::vector&& params_cpu, // weights/biases (cpu) - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first); - -std::tuple run_gru_context( - const Tensor& input_vk, - const Tensor& hx_vk, - const c10::intrusive_ptr& vulkan_context); - -// Backwards compatibility -class GruOpContext final : public torch::jit::CustomClassHolder { +class GruPackedContext final : virtual public VulkanPackedContext, + public torch::jit::CustomClassHolder { public: - static GruOpContext create( + GruPackedContext( const std::vector& params_cpu, // weights/biases (cpu) bool has_biases, int64_t num_layers, @@ -94,19 +23,42 @@ class GruOpContext final : public torch::jit::CustomClassHolder { bool bidirectional, bool batch_first); - using State = - std::tuple, bool, int64_t, double, bool, bool, bool>; - - std::tuple run(const Tensor& input_vk, const Tensor& hx_vk) - const; - State unpack() const; - - private: - explicit GruOpContext(VulkanOpContext vulkan_context); - VulkanOpContext vulkan_context_; + /* + * Assigns a name to each index in the unpacked list. + */ + struct Unpacked final { + static constexpr uint32_t Params = 0u; + static constexpr uint32_t hasBiases = 1u; + static constexpr uint32_t NumLayers = 2u; + static constexpr uint32_t Dropout = 3u; + static constexpr uint32_t Train = 4u; + static constexpr uint32_t Bidirectional = 5u; + static constexpr uint32_t BatchFirst = 6u; + + static constexpr uint32_t NumArgs = 7u; + }; + + /* + * Assigns a name to each index in the packed list. + */ + struct Packed final { + static constexpr uint32_t LinearContexts = 0u; + static constexpr uint32_t hasBiases = 1u; + static constexpr uint32_t NumLayers = 2u; + static constexpr uint32_t Dropout = 3u; + static constexpr uint32_t Train = 4u; + static constexpr uint32_t Bidirectional = 5u; + static constexpr uint32_t BatchFirst = 6u; + + static constexpr uint32_t NumArgs = 7u; + }; + + static GruPackedContext pack(c10::impl::GenericList); + + const c10::impl::GenericList unpack() const override; }; -c10::intrusive_ptr gru_prepack( +c10::intrusive_ptr create_gru_context( std::vector&& params_cpu, // weights/biases (cpu) bool has_biases, int64_t num_layers, @@ -115,10 +67,10 @@ c10::intrusive_ptr gru_prepack( bool bidirectional, bool batch_first); -std::tuple gru_run( +std::tuple run_gru_context( const Tensor& input_vk, const Tensor& hx_vk, - const c10::intrusive_ptr& context); + const c10::intrusive_ptr& vulkan_context); } // namespace ops } // namespace vulkan diff --git a/aten/src/ATen/native/vulkan/ops/Lerp.cpp b/aten/src/ATen/native/vulkan/ops/Lerp.cpp index 67240f64b2ccd..28921ef897640 100644 --- a/aten/src/ATen/native/vulkan/ops/Lerp.cpp +++ b/aten/src/ATen/native/vulkan/ops/Lerp.cpp @@ -11,18 +11,18 @@ using namespace api::utils; void check_inputs_elementwise_op(const Tensor& input1, const Tensor& input2) { TORCH_CHECK( - channels_size(input1) == channels_size(input2), + get_dim(input1) == get_dim(input2), "Vulkan elementwise ops require channel dimension to be equal!"); - if (batch_size(input1) != batch_size(input2)) { + if (get_dim(input1) != get_dim(input2)) { TORCH_CHECK( - channels_size(input1) % 4 == 0, + get_dim(input1) % 4 == 0, "Vulkan elementwise ops require channel to be a multiple of 4 to broadcast along batch dimension!") } - const uint32_t input1_h = height_size(input1); - const uint32_t input1_w = width_size(input1); - const uint32_t input2_h = height_size(input2); - const uint32_t input2_w = width_size(input2); + const uint32_t input1_h = get_dim(input1); + const uint32_t input1_w = get_dim(input1); + const uint32_t input2_h = get_dim(input2); + const uint32_t input2_w = get_dim(input2); const std::string broadcast_error_msg = "Incompatible input dimensions for broadcasting for Vulkan elementwise op!"; diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.cpp b/aten/src/ATen/native/vulkan/ops/Lstm.cpp index c86583621c3bb..831b97175d45c 100644 --- a/aten/src/ATen/native/vulkan/ops/Lstm.cpp +++ b/aten/src/ATen/native/vulkan/ops/Lstm.cpp @@ -1,6 +1,5 @@ -#include +#include #include -#include #include namespace at { @@ -9,19 +8,25 @@ namespace vulkan { namespace ops { namespace { // -// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or (N, L, -// H_in) when batch_first=True -// containing the features of the input sequence +// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or +// (N, L, H_in) when batch_first=True containing the features of the input +// sequence +// // hx_vk: tensor of shape (D * num_layers, N, H_out) containing the initial -// hidden state for each element in the input sequence. cx_vk: tensor of shape -// (D * num_layers, N, H_cell) containing the initial cell state for each -// element in the input sequence. output: tensor of shape (L, N, D * H_out) when -// batch_first=False or (N, L, D * H_out) when batch_first=True -// containing the output features (h_t) from the last layer of the LSTM, -// for each t +// hidden state for each element in the input sequence. +// +// cx_vk: tensor of shape (D * num_layers, N, H_cell) containing the initial +// cell state for each element in the input sequence. +// +// output: tensor of shape (L, N, D * H_out) when batch_first=False or +// (N, L, D * H_out) when batch_first=True, containing the output features +// (h_t) from the last layer of the LSTM, for each t +// // h_n: tensor of shape (D * num_layers, N, H_out) containing the final hidden -// state for each element in the sequence. c_n: tensor of shape (D * num_layers, -// N, H_cell) containing the final cell state for each element in the sequence. +// state for each element in the sequence. +// +// c_n: tensor of shape (D * num_layers, N, H_cell) containing the final cell +// state for each element in the sequence. // // where // L = sequence length @@ -61,12 +66,17 @@ std::tuple lstm_input( TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false."); TORCH_INTERNAL_ASSERT( !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false."); - TORCH_INTERNAL_ASSERT( - batch_first, "Vulkan LSTM expects 'batch_first' to be true."); TORCH_INTERNAL_ASSERT( dropout < std::numeric_limits::epsilon() * 1000, "Vulkan LSTM expects 'dropout' to be 0.0."); + const auto batch_size = input_vk.size(0); + const auto seq_length = input_vk.size(1); + + TORCH_INTERNAL_ASSERT( + (batch_size == 1 && seq_length == 1) || batch_first, + "Vulkan gru expects batch-first input"); + const Tensor& hx_vk = hx[0]; const Tensor& cx_vk = hx[1]; @@ -75,8 +85,7 @@ std::tuple lstm_input( std::vector c_n_list; // cell state output // reshape to 2D due to Vulkan at::mm op accepts only 2D - auto x = - input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)}); + auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)}); h_n_list.reserve(num_layers); c_n_list.reserve(num_layers); @@ -135,6 +144,7 @@ std::tuple lstm_input( auto h_n = at::cat(h_n_list, 1); auto c_n = at::cat(c_n_list, 1); + x = x.reshape({batch_size, seq_length, x.size(1)}); h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)}); c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)}); return std::tuple(x, h_n, c_n); @@ -150,13 +160,18 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) { } // namespace -std::vector> pack_lstm_linear_op_contexts( +std::vector> +pack_lstm_linear_op_contexts( const std::vector& params_cpu, int64_t num_layers) { TORCH_CHECK( static_cast(params_cpu.size()) == 4 * num_layers, - "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'."); - std::vector> linear_op_contexts; + "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'." + " But 'params_cpu' has size: ", + params_cpu.size(), + " and 'num_layers' is: ", + num_layers); + std::vector> linear_op_contexts; linear_op_contexts.reserve(num_layers * 8); for (int64_t l = 0; l < num_layers; ++l) { @@ -200,7 +215,7 @@ std::vector> pack_lstm_linear_op_contexts( return linear_op_contexts; } -VulkanOpContext lstm_context_create( +LstmPackedContext::LstmPackedContext( const std::vector& params_cpu, // weights/biases (cpu) bool has_biases, int64_t num_layers, @@ -213,42 +228,91 @@ VulkanOpContext lstm_context_create( TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false."); TORCH_INTERNAL_ASSERT( !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false."); - TORCH_INTERNAL_ASSERT( - batch_first, "Vulkan LSTM expects 'batch_first' to be true."); TORCH_INTERNAL_ASSERT( dropout < std::numeric_limits::epsilon() * 1000, "Vulkan LSTM expects 'dropout' to be 0.0."); - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(7); - packed_context.emplace_back( - pack_lstm_linear_op_contexts(params_cpu, num_layers)); - packed_context.emplace_back(has_biases); - packed_context.emplace_back(num_layers); - packed_context.emplace_back(dropout); - packed_context.emplace_back(train); - packed_context.emplace_back(bidirectional); - packed_context.emplace_back(batch_first); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(7); - unpacked_context.emplace_back(params_cpu); - unpacked_context.emplace_back(has_biases); - unpacked_context.emplace_back(num_layers); - unpacked_context.emplace_back(dropout); - unpacked_context.emplace_back(train); - unpacked_context.emplace_back(bidirectional); - unpacked_context.emplace_back(batch_first); - - return VulkanOpContext::create(packed_context, unpacked_context); + packed_.reserve(Packed::NumArgs); + packed_.emplace_back(pack_lstm_linear_op_contexts(params_cpu, num_layers)); + packed_.emplace_back(has_biases); + packed_.emplace_back(num_layers); + packed_.emplace_back(dropout); + packed_.emplace_back(train); + packed_.emplace_back(bidirectional); + packed_.emplace_back(batch_first); +} + +LstmPackedContext LstmPackedContext::pack(c10::impl::GenericList unpacked) { + return LstmPackedContext( + unpacked.get(Unpacked::Params).toTensorVector(), + unpacked.get(Unpacked::hasBiases).toBool(), + unpacked.get(Unpacked::NumLayers).toInt(), + unpacked.get(Unpacked::Dropout).toDouble(), + unpacked.get(Unpacked::Train).toBool(), + unpacked.get(Unpacked::Bidirectional).toBool(), + unpacked.get(Unpacked::BatchFirst).toBool()); +} + +const c10::impl::GenericList LstmPackedContext::unpack() const { + c10::impl::GenericList unpacked_lstm_context{c10::AnyType::get()}; + unpacked_lstm_context.reserve(Unpacked::NumArgs); + + const c10::List packed_linear_contexts = + get_val(Packed::LinearContexts).toList(); + + const int64_t num_layers = get_val(Packed::NumLayers).toInt(); + const int64_t linear_contexts_per_layer = 8; + + std::vector params_cpu; + params_cpu.reserve(num_layers * linear_contexts_per_layer); + + for (c10::IValue packed_linear_context : packed_linear_contexts) { + const c10::impl::GenericList unpacked_linear_context = + packed_linear_context.toCustomClass()->unpack(); + + TORCH_CHECK( + unpacked_linear_context.size() > 0u, + "unpacked_linear_context does not have any elements!"); + + params_cpu.emplace_back( + unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight) + .toTensor() + .t()); + params_cpu.emplace_back( + unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias) + .toTensor()); + } + unpacked_lstm_context.emplace_back(params_cpu); + for (int64_t i = 1; i < 7; ++i) { + unpacked_lstm_context.emplace_back(get_val(i)); + } + + return unpacked_lstm_context; } -std::tuple lstm_context_run( +c10::intrusive_ptr create_lstm_context( + std::vector&& params_cpu, + bool has_biases, + int64_t num_layers, + double dropout, + bool train, + bool bidirectional, + bool batch_first) { + return c10::make_intrusive(LstmPackedContext( + params_cpu, + has_biases, + num_layers, + dropout, + train, + bidirectional, + batch_first)); +} + +std::tuple run_lstm_context( const Tensor& input_vk, // input sequence (vulkan) const Tensor& hx_vk, // initial hidden state (vulkan) const Tensor& cx_vk, // initial cell state (vulkan) - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context) { + const c10::intrusive_ptr& lstm_context) { TORCH_INTERNAL_ASSERT( input_vk.sizes().size() == 3, "Vulkan LSTM expects input dims to be 3."); TORCH_INTERNAL_ASSERT( @@ -258,24 +322,34 @@ std::tuple lstm_context_run( cx_vk.sizes().size() == 3, "Vulkan LSTM expects cell state dims to be 3."); + const int64_t num_layers = + lstm_context->get_val(LstmPackedContext::Packed::NumLayers).toInt(); + const bool batch_first = + lstm_context->get_val(LstmPackedContext::Packed::BatchFirst).toBool(); + const auto batch_size = input_vk.size(0); + const auto seq_length = input_vk.size(1); + + TORCH_INTERNAL_ASSERT( + (batch_size == 1 && seq_length == 1) || batch_first, + "Vulkan gru expects batch-first input"); + const c10::List packed_linear_op_contexts = - packed_context.get(0).toList(); - const int64_t packed_num_layers = packed_context.get(2).toInt(); + lstm_context->get_val(LstmPackedContext::Packed::LinearContexts).toList(); + + const int64_t linear_op_contexts_per_layer = 8; + // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf), + // (b_ig, w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho) - const int64_t linear_op_contexts_per_layer = - 8; // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf), (b_ig, - // w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho) std::vector h_n_list; // hidden state output std::vector c_n_list; // cell state output // reshape to 2D due to Vulkan at::mm op accepts only 2D - auto x = - input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)}); + auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)}); - h_n_list.reserve(packed_num_layers); - c_n_list.reserve(packed_num_layers); + h_n_list.reserve(num_layers); + c_n_list.reserve(num_layers); - for (int64_t l = 0; l < packed_num_layers; ++l) { + for (int64_t l = 0; l < num_layers; ++l) { // extract each hidden state and squeeze into 2D dim auto h = at::slice(hx_vk, 0, l, l + 1, 1); h = h.reshape({h.size(0) * h.size(1), h.size(2)}); @@ -285,49 +359,41 @@ std::tuple lstm_context_run( const auto& cxt_ii = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 0] - .toCustomClass(); + .toCustomClass(); const auto& cxt_hi = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 1] - .toCustomClass(); + .toCustomClass(); const auto& cxt_if = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 2] - .toCustomClass(); + .toCustomClass(); const auto& cxt_hf = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 3] - .toCustomClass(); + .toCustomClass(); const auto& cxt_ig = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 4] - .toCustomClass(); + .toCustomClass(); const auto& cxt_hg = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 5] - .toCustomClass(); + .toCustomClass(); const auto& cxt_io = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 6] - .toCustomClass(); + .toCustomClass(); const auto& cxt_ho = packed_linear_op_contexts[l * linear_op_contexts_per_layer + 7] - .toCustomClass(); + .toCustomClass(); const auto& i = at::sigmoid( - linear_context_run( - x, cxt_ii->get_packed(), cxt_ii->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_hi->get_packed(), cxt_hi->get_unpacked(), 1.0f, 1.0f)); + run_linear_context(x, cxt_ii) + run_linear_context(h, cxt_hi)); + // cxt_ii->run(x, 1.0f, 1.0f) + cxt_hi->run(h, 1.0f, 1.0f)); const auto& f = at::sigmoid( - linear_context_run( - x, cxt_if->get_packed(), cxt_if->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_hf->get_packed(), cxt_hf->get_unpacked(), 1.0f, 1.0f)); - const auto& g = at::tanh( - linear_context_run( - x, cxt_ig->get_packed(), cxt_ig->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_hg->get_packed(), cxt_hg->get_unpacked(), 1.0f, 1.0f)); + run_linear_context(x, cxt_if) + run_linear_context(h, cxt_hf)); + // cxt_if->run(x, 1.0f, 1.0f) + cxt_hf->run(h, 1.0f, 1.0f)); + const auto& g = + at::tanh(run_linear_context(x, cxt_ig) + run_linear_context(h, cxt_hg)); + // cxt_ig->run(x, 1.0f, 1.0f) + cxt_hg->run(h, 1.0f, 1.0f)); const auto& o = at::sigmoid( - linear_context_run( - x, cxt_io->get_packed(), cxt_io->get_unpacked(), 1.0f, 1.0f) + - linear_context_run( - h, cxt_ho->get_packed(), cxt_ho->get_unpacked(), 1.0f, 1.0f)); + run_linear_context(x, cxt_io) + run_linear_context(h, cxt_ho)); + // cxt_io->run(x, 1.0f, 1.0f) + cxt_ho->run(h, 1.0f, 1.0f)); c = f * c + i * g; h = o * at::tanh(c); x = h; // next input @@ -339,42 +405,12 @@ std::tuple lstm_context_run( auto h_n = at::cat(h_n_list, 1); auto c_n = at::cat(c_n_list, 1); + x = x.reshape({batch_size, seq_length, x.size(1)}); h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)}); c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)}); return std::tuple(x, h_n, c_n); } -c10::intrusive_ptr create_lstm_context( - std::vector&& params_cpu, - bool has_biases, - int64_t num_layers, - double dropout, - bool train, - bool bidirectional, - bool batch_first) { - return c10::make_intrusive(lstm_context_create( - params_cpu, - has_biases, - num_layers, - dropout, - train, - bidirectional, - batch_first)); -} - -std::tuple run_lstm_context( - const Tensor& input_vk, // input sequence (vulkan) - const Tensor& hx_vk, // initial hidden state (vulkan) - const Tensor& cx_vk, // initial cell state (vulkan) - const c10::intrusive_ptr& vulkan_context) { - return lstm_context_run( - input_vk, - hx_vk, - cx_vk, - vulkan_context->get_packed(), - vulkan_context->get_unpacked()); -} - } // namespace ops } // namespace vulkan } // namespace native diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.h b/aten/src/ATen/native/vulkan/ops/Lstm.h index e793ad1d00a75..5f4006c67d2f3 100644 --- a/aten/src/ATen/native/vulkan/ops/Lstm.h +++ b/aten/src/ATen/native/vulkan/ops/Lstm.h @@ -3,7 +3,7 @@ #ifdef USE_VULKAN_API #include -#include +#include #include namespace at { @@ -11,60 +11,54 @@ namespace native { namespace vulkan { namespace ops { -// packed -// std::vector> linear_op_contexts; // -// {{ op context for b_ii, w_ii, op context for b_hi, w_hi, -// // -// op -// context -// for -// b_if, -// w_if, -// op -// context -// for -// b_hf, -// w_hf, -// // -// op -// context -// for -// b_ig, -// w_ig, -// op -// context -// for -// b_hg, -// w_hg, -// // -// op -// context -// for -// b_io, -// w_io, -// op -// context -// for -// b_ho, -// w_ho,}, -// ...} -// bool has_biases{}; -// int64_t num_layers{}; -// double dropout{}; -// bool train{}; -// bool bidirectional{}; -// bool batch_first{}; +class LstmPackedContext final : virtual public VulkanPackedContext, + public torch::jit::CustomClassHolder { + public: + LstmPackedContext( + const std::vector& params_cpu, // weights/biases (cpu) + bool has_biases, + int64_t num_layers, + double dropout, + bool train, + bool bidirectional, + bool batch_first); -// unpacked -// std::vector params_cpu // weights/biases (cpu) -// bool has_biases -// int64_t num_layers -// double dropout -// bool train -// bool bidirectional -// bool batch_first + /* + * Assigns a name to each index in the unpacked list. + */ + struct Unpacked final { + static constexpr uint32_t Params = 0u; + static constexpr uint32_t hasBiases = 1u; + static constexpr uint32_t NumLayers = 2u; + static constexpr uint32_t Dropout = 3u; + static constexpr uint32_t Train = 4u; + static constexpr uint32_t Bidirectional = 5u; + static constexpr uint32_t BatchFirst = 6u; -c10::intrusive_ptr create_lstm_context( + static constexpr uint32_t NumArgs = 7u; + }; + + /* + * Assigns a name to each index in the packed list. + */ + struct Packed final { + static constexpr uint32_t LinearContexts = 0u; + static constexpr uint32_t hasBiases = 1u; + static constexpr uint32_t NumLayers = 2u; + static constexpr uint32_t Dropout = 3u; + static constexpr uint32_t Train = 4u; + static constexpr uint32_t Bidirectional = 5u; + static constexpr uint32_t BatchFirst = 6u; + + static constexpr uint32_t NumArgs = 7u; + }; + + static LstmPackedContext pack(c10::impl::GenericList); + + const c10::impl::GenericList unpack() const override; +}; + +c10::intrusive_ptr create_lstm_context( std::vector&& params_cpu, // weights/biases (cpu) bool has_biases, int64_t num_layers, @@ -77,7 +71,7 @@ std::tuple run_lstm_context( const Tensor& input_vk, // input sequence (vulkan) const Tensor& hx_vk, // initial hidden state (vulkan) const Tensor& cx_vk, // initial cell state (vulkan) - const c10::intrusive_ptr& vulkan_context); + const c10::intrusive_ptr& vulkan_context); } // namespace ops } // namespace vulkan diff --git a/aten/src/ATen/native/vulkan/ops/Mm.cpp b/aten/src/ATen/native/vulkan/ops/Mm.cpp index 0587a0a95a0ae..80b6ccb34ade6 100644 --- a/aten/src/ATen/native/vulkan/ops/Mm.cpp +++ b/aten/src/ATen/native/vulkan/ops/Mm.cpp @@ -1,6 +1,5 @@ #include #include -#include #include namespace at { @@ -42,7 +41,7 @@ vTensor pack_weights(const Tensor& weight_arg) { weight.options(), }; - api::StagingBuffer staging(context, v_weight.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_weight.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -105,7 +104,7 @@ vTensor pack_biases( bias_arg->options(), }; - api::StagingBuffer staging(context, v_bias.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_bias.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -134,7 +133,7 @@ vTensor pack_biases( weight_arg.options(), }; - api::StagingBuffer staging(context, v_bias.buffer_bytes()); + api::StorageBuffer staging(context, at::kFloat, v_bias.numcells()); { api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); @@ -178,40 +177,15 @@ bool available(const Tensor& weight, const c10::optional& bias) { true; } -bool usable( - const Tensor& input, - const Tensor& weight, - const c10::optional& /* bias */) { +bool usable(const Tensor& input, const IntArrayRef unpacked_weight_sizes) { return (2 == input.ndimension()) && (c10::DeviceType::Vulkan == input.device().type()) && (kFloat == input.scalar_type()) && (input.size(Layout::Parameter::width) == - weight.size(Layout::Parameter::height)) && + unpacked_weight_sizes[Layout::Parameter::height]) && !input.requires_grad() && true; } -VulkanOpContext context_create( - const Tensor& weight, - const c10::optional& bias) { - TORCH_CHECK( - available(weight, bias), - "Vulkan Linear not available! " - "Reason: The provided (weight, bias) parameters are either invalid " - "individually or their combination is not supported by Vulkan Impl."); - - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(2); - packed_context.emplace_back(convert(pack_weights(weight))); - packed_context.emplace_back(convert(pack_biases(weight, bias))); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(2); - unpacked_context.emplace_back(weight); - unpacked_context.emplace_back(bias); - - return VulkanOpContext::create(packed_context, unpacked_context); -} - static Tensor reshape_to_2d(const Tensor& input_arg) { TORCH_CHECK( input_arg.dim() >= 2, @@ -222,12 +196,11 @@ static Tensor reshape_to_2d(const Tensor& input_arg) { return input_arg.reshape({d, input_arg.size(-1)}); } -Tensor context_run( +Tensor run_addmm_context( const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context, const float alpha, - const float beta) { + const float beta, + const c10::intrusive_ptr& linear_context) { api::Context* const context = api::context(); const Tensor input_arg_2d = @@ -236,15 +209,19 @@ Tensor context_run( input_arg_2d.is_vulkan() ? input_arg_2d : input_arg_2d.vulkan(); const vTensor& v_input = convert(input); - const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor()); - const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor()); - const Tensor& unpacked_weight = unpacked_context.get(0).toTensor(); - const c10::optional& unpacked_bias = - unpacked_context.get(1).isTensor() ? unpacked_context.get(1).toTensor() - : c10::optional(); + const vTensor& packed_v_weight = convert( + linear_context->get_val(LinearPackedContext::Packed::Weight).toTensor()); + const vTensor& packed_v_bias = convert( + linear_context->get_val(LinearPackedContext::Packed::Bias).toTensor()); + const std::vector unpacked_weight_sizes = + linear_context->get_val(LinearPackedContext::Packed::WeightSizes) + .toIntVector(); + const bool bias_defined = + linear_context->get_val(LinearPackedContext::Packed::BiasDefined) + .toBool(); TORCH_CHECK( - usable(input, unpacked_weight, unpacked_bias), + usable(input, unpacked_weight_sizes), "Vulkan Linear not usable! " "Reason: The provided input tensor is either invalid on its own, or its " "combination with the provided weight and bias tensors are unsupported by " @@ -254,12 +231,12 @@ Tensor context_run( context, { v_input.sizes()[Layout::Parameter::height], - unpacked_weight.sizes()[Layout::Parameter::width], + unpacked_weight_sizes[Layout::Parameter::width], }, input.options(), }; - if (unpacked_bias && unpacked_bias->defined()) { + if (bias_defined) { const struct { uvec3 size; int32_t K; @@ -285,7 +262,7 @@ Tensor context_run( // global work group size { safe_downcast(div_up( - unpacked_weight.sizes()[Layout::Parameter::width], INT64_C(2))), + unpacked_weight_sizes[Layout::Parameter::width], INT64_C(2))), safe_downcast( div_up(v_input.sizes()[Layout::Parameter::height], INT64_C(2))), 1, @@ -325,7 +302,7 @@ Tensor context_run( // global work group size { safe_downcast(div_up( - unpacked_weight.sizes()[Layout::Parameter::width], INT64_C(2))), + unpacked_weight_sizes[Layout::Parameter::width], INT64_C(2))), safe_downcast( div_up(v_input.sizes()[Layout::Parameter::height], INT64_C(2))), 1, @@ -364,26 +341,21 @@ Tensor addmm( const Tensor& weight, const Scalar& beta, const Scalar& alpha) { - VulkanOpContext vulkan_context = context_create(weight, bias); - - return context_run( + return run_addmm_context( input, - vulkan_context.get_packed(), - vulkan_context.get_unpacked(), alpha.to(), - beta.to()); + beta.to(), + c10::make_intrusive( + LinearPackedContext(weight, bias))); } Tensor mm(const Tensor& mat1_arg, const Tensor& mat2_arg) { - VulkanOpContext vulkan_context = - context_create(mat2_arg, c10::optional()); - - return context_run( + return run_addmm_context( mat1_arg, - vulkan_context.get_packed(), - vulkan_context.get_unpacked(), 1.0f, - 1.0f); + 1.0f, + c10::make_intrusive( + LinearPackedContext(mat2_arg, c10::optional()))); } #ifdef USE_VULKAN_API @@ -397,82 +369,46 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) { } // namespace -VulkanOpContext linear_context_create( +LinearPackedContext::LinearPackedContext( const Tensor& weight, - const c10::optional& bias) { - return context_create(weight, bias); -} - -Tensor linear_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context, - const float alpha, - const float beta) { - return context_run(input_arg, packed_context, unpacked_context, alpha, beta); -} - -c10::intrusive_ptr create_linear_context( - Tensor&& weight, - c10::optional&& bias) { - return c10::make_intrusive( - linear_context_create(weight, bias)); -} - -Tensor run_linear_context( - const Tensor& input, - const c10::intrusive_ptr& vulkan_context) { - return linear_context_run( - input, - vulkan_context->get_packed(), - vulkan_context->get_unpacked(), - 1.0f, - 1.0f); -} - -/* Backwards compatibility */ -LinearOpContext::LinearOpContext(VulkanOpContext vulkan_context) - : vulkan_context_{std::move(vulkan_context)} {} + const c10::optional& bias) + : unpacked_{c10::AnyType::get()} { + TORCH_CHECK( + available(weight, bias), + "Vulkan Linear not available! " + "Reason: The provided (weight, bias) parameters are either invalid " + "individually or their combination is not supported by Vulkan Impl."); -LinearOpContext LinearOpContext::create( - const Tensor& weight, - const c10::optional& bias) { - return LinearOpContext{linear_context_create(weight, bias)}; -} + packed_.reserve(Packed::NumArgs); + packed_.emplace_back(convert(pack_weights(weight))); + packed_.emplace_back(convert(pack_biases(weight, bias))); + packed_.emplace_back(weight.sizes()); + packed_.emplace_back(bias && bias->defined()); -Tensor LinearOpContext::run( - const Tensor& input_arg, - const float alpha, - const float beta) const { - return linear_context_run( - input_arg, - vulkan_context_.get_packed(), - vulkan_context_.get_unpacked(), - alpha, - beta); + if (!at::globalContext().releaseWeightsWhenPrepacking()) { + unpacked_.reserve(Unpacked::NumArgs); + unpacked_.emplace_back(weight); + unpacked_.emplace_back(bias); + } } -LinearOpContext::State LinearOpContext::unpack() const { - const c10::impl::GenericList unpacked_ = - std::get<1>(vulkan_context_.get_state()); - const Tensor unpacked_weight = unpacked_.get(0).toTensor(); - const c10::optional unpacked_bias = unpacked_.get(1).isTensor() - ? unpacked_.get(1).toTensor() - : c10::optional(); - return LinearOpContext::State{unpacked_weight, unpacked_bias}; +LinearPackedContext LinearPackedContext::pack(c10::impl::GenericList unpacked) { + return LinearPackedContext( + unpacked.get(Unpacked::Weight).toTensor(), + get_optional_tensor(unpacked, Unpacked::Bias)); } -c10::intrusive_ptr linear_prepack( +c10::intrusive_ptr create_linear_context( Tensor&& weight, c10::optional&& bias) { - return c10::make_intrusive( - LinearOpContext::create(std::move(weight), std::move(bias))); + return c10::make_intrusive( + LinearPackedContext(weight, bias)); } -Tensor linear_run( +Tensor run_linear_context( const Tensor& input, - const c10::intrusive_ptr& context) { - return context->run(input, 1.0, 1.0); + const c10::intrusive_ptr& linear_context) { + return run_addmm_context(input, 1.0f, 1.0f, linear_context); } } // namespace ops diff --git a/aten/src/ATen/native/vulkan/ops/Mm.h b/aten/src/ATen/native/vulkan/ops/Mm.h index 4d573b575bd40..17909eab6d4e6 100644 --- a/aten/src/ATen/native/vulkan/ops/Mm.h +++ b/aten/src/ATen/native/vulkan/ops/Mm.h @@ -3,7 +3,7 @@ #ifdef USE_VULKAN_API #include -#include +#include #include namespace at { @@ -11,57 +11,52 @@ namespace native { namespace vulkan { namespace ops { -// packed -// vTensor v_weight -// vTensor v_bias - -// unpacked -// Tensor weight -// c10::optional bias +class LinearPackedContext final : virtual public VulkanPackedContext, + public torch::jit::CustomClassHolder { + private: + c10::impl::GenericList unpacked_; -VulkanOpContext linear_context_create( - const Tensor& weight, - const c10::optional& bias); + public: + LinearPackedContext(const Tensor& weight, const c10::optional& bias); -Tensor linear_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context, - const float alpha, - const float beta); + /* + * Assigns a name to each index in the unpacked list. + */ + struct Unpacked final { + static constexpr uint32_t Weight = 0u; + static constexpr uint32_t Bias = 1u; -c10::intrusive_ptr create_linear_context( - Tensor&& weight, - c10::optional&& bias); + static constexpr uint32_t NumArgs = 2u; + }; -Tensor run_linear_context( - const Tensor& input, - const c10::intrusive_ptr& context); + /* + * Assigns a name to each index in the packed list. + */ + struct Packed final { + static constexpr uint32_t Weight = 0u; + static constexpr uint32_t Bias = 1u; + static constexpr uint32_t WeightSizes = 2u; + static constexpr uint32_t BiasDefined = 3u; -// Backwards compatibility -class LinearOpContext final : public torch::jit::CustomClassHolder { - public: - static LinearOpContext create( - const Tensor& weight, - const c10::optional& bias); + static constexpr uint32_t NumArgs = 4u; + }; - using State = std::tuple>; + static LinearPackedContext pack(c10::impl::GenericList); - Tensor run(const Tensor& input, float beta, float alpha) const; - State unpack() const; + const c10::impl::GenericList unpack() const override { + TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!"); - private: - explicit LinearOpContext(VulkanOpContext vulkan_context); - VulkanOpContext vulkan_context_; + return unpacked_; + } }; -c10::intrusive_ptr linear_prepack( +c10::intrusive_ptr create_linear_context( Tensor&& weight, c10::optional&& bias); -Tensor linear_run( +Tensor run_linear_context( const Tensor& input, - const c10::intrusive_ptr& context); + const c10::intrusive_ptr& context); } // namespace ops } // namespace vulkan diff --git a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp b/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp deleted file mode 100644 index 283967fb9087a..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp +++ /dev/null @@ -1,648 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { -namespace { - -using namespace api::utils; -using namespace at::native::vulkan::ops; - -inline bool is_depthwise(const IntArrayRef filter, const int64_t groups) { - return (filter[Layout::Filter::output] == groups) && - // Only K == 1 supported. - (filter[Layout::Filter::input] == 1); -} - -inline bool is_pointwise(const IntArrayRef filter) { - return (1 == filter[Layout::Filter::height]) && - (1 == filter[Layout::Filter::width]); -} - -bool all_lessthan(const IntArrayRef arr, const int t) { - bool retval = true; - for (const auto i : c10::irange(arr.size())) { - retval = retval && (arr[i] < t); - } - return retval; -} - -Conv2dQMethod determine_method( - const IntArrayRef filter, - const IntArrayRef stride, - const IntArrayRef padding, - const IntArrayRef dilation, - const int64_t groups) { - if (is_depthwise(filter, groups)) - return Conv2dQDepthwise; - if (is_pointwise(filter)) - return Conv2dQPointwise; - return Conv2dQSlidingWindow; -} - -vTensor pack_weights_dw_q(api::Context* const context, const Tensor& weight) { - /* Source */ - const IntArrayRef src_filter = weight.sizes(); - const c10::quint8* const src_weight_ptr = weight.data_ptr(); - - const int64_t src_kw_sz = src_filter[Layout::Filter::width]; - const int64_t src_kh_sz = src_filter[Layout::Filter::height]; - const int64_t src_kernel_sz = src_kw_sz * src_kh_sz; - const int64_t src_block_sz = - src_kernel_sz * src_filter[Layout::Filter::input]; - const int64_t num_stacks = - div_up(src_filter[Layout::Filter::output], INT64_C(4)); - - /* Destination */ - const int64_t dst_kw_sz = src_kernel_sz; - const int64_t dst_kh_sz = num_stacks; - const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz; - - vTensor v_weight{ - context, - { - 4, - dst_kh_sz, - dst_kw_sz, - }, - weight.options(), - weight.q_scale(), - weight.q_zero_point(), - }; - api::StagingBuffer staging(context, v_weight.buffer_bytes()); - { - api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - - c10::quint8* dst_weight_ptr = mapping.template data(); - - memset(dst_weight_ptr, 0, v_weight.nbytes()); - - for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) { - /* Source */ - const c10::quint8* const src_weight_oc_ptr = - src_weight_ptr + src_oc * src_block_sz; - - /* Destination */ - const int64_t dst_oh = src_oc / 4; - const int64_t dst_c = src_oc % 4; - - c10::quint8* const dst_weight_c_ptr = - dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz; - - for (const auto src_ih : - c10::irange(src_filter[Layout::Filter::height])) { - memcpy( - dst_weight_c_ptr + src_ih * src_kw_sz, - src_weight_oc_ptr + src_ih * src_kw_sz, - sizeof(c10::quint8) * src_kw_sz); - } - } - } - ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight); - - return v_weight; -} - -vTensor pack_weights_2d_q(api::Context* const context, const Tensor& weight) { - /* Source */ - const IntArrayRef src_filter = weight.sizes(); - const c10::quint8* const src_weight_ptr = weight.data_ptr(); - - const int64_t src_kw_sz = src_filter[Layout::Filter::width]; - const int64_t src_kh_sz = src_filter[Layout::Filter::height]; - const int64_t src_kernel_sz = src_kw_sz * src_kh_sz; - const int64_t src_block_sz = - src_kernel_sz * src_filter[Layout::Filter::input]; - - const int64_t num_stacks = - div_up(src_filter[Layout::Filter::output], INT64_C(4)); - const int64_t stack_depth = - api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4)); - - /* Destination */ - const int64_t dst_kw_sz = src_kw_sz * stack_depth; - const int64_t dst_kh_sz = src_kh_sz * num_stacks; - const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz; - - vTensor v_weight{ - context, - { - 4, - dst_kh_sz, - dst_kw_sz, - }, - weight.options(), - weight.q_scale(), - weight.q_zero_point(), - }; - - api::StagingBuffer staging(context, v_weight.buffer_bytes()); - { - api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - - c10::quint8* dst_weight_ptr = mapping.template data(); - - memset(dst_weight_ptr, 0, v_weight.nbytes()); - - for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) { - /* Source */ - const c10::quint8* const src_weight_oc_ptr = - src_weight_ptr + src_oc * src_block_sz; - - /* Destination */ - const int64_t dst_oh = src_oc / 4; - const int64_t dst_c = src_oc % 4; - - c10::quint8* const dst_weight_c_ptr = - dst_weight_ptr + dst_c * dst_kernel_sz; - - for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) { - const int64_t dst_ic4 = src_ic / 4; - - for (const auto src_ih : c10::irange(src_kh_sz)) { - for (const auto src_iw : c10::irange(src_kw_sz)) { - memcpy( - dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz + - dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4, - src_weight_oc_ptr + src_ic * src_kernel_sz + - src_ih * src_kw_sz + src_iw, - sizeof(c10::quint8)); - } - } - } - } - } - ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight); - - return v_weight; -} - -vTensor pack_weights_q( - const Tensor& weight_arg, - const Conv2dQMethod conv_method) { - if (weight_arg.is_vulkan()) { - return convert(weight_arg); - } - - api::Context* const context = api::context(); - - const Tensor weight = weight_arg.contiguous(); - - if (conv_method == Conv2dQDepthwise) { - return pack_weights_dw_q(context, weight); - } - - return pack_weights_2d_q(context, weight); -} - -vTensor pack_biases_q(const c10::optional& bias, const Tensor& weight) { - if (bias && bias->is_vulkan()) { - return convert(*bias); - } - - api::Context* const context = api::context(); - - const int64_t src_w = weight.size(Layout::Filter::output); - const int64_t packed_w = div_up(src_w, INT64_C(4)); - vTensor v_bias{ - context, - { - 4, - 1, - packed_w, - }, - weight.options(), - weight.q_scale(), - weight.q_zero_point(), - }; - - api::StagingBuffer staging(context, v_bias.buffer_bytes()); - { - api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - - c10::quint8* dst_bias_ptr = mapping.template data(); - - if (bias) { - const c10::quint8* const src_bias_ptr = - bias->contiguous().data_ptr(); - - memset(dst_bias_ptr, 0, v_bias.nbytes()); - for (const auto i : c10::irange(src_w)) { - const int64_t c = i % 4; - const int64_t x = i / 4; - dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i]; - } - } else { - memset( - dst_bias_ptr, - // 2's complement integers and IEEE-754 floating point numbers both - // have identical bit representations for 0, so can use memset which - // only accepts uint8_t parameter. - 0, - v_bias.nbytes()); - } - } - ops::utils::pack_staging_to_vtensor(staging.buffer(), v_bias); - - return v_bias; -} - -std::array pack_filter( - const Tensor& weight, - const IntArrayRef dilation) { - const IntArrayRef filter = weight.sizes(); - - const auto effective = [](const int64_t k, const int64_t d) { - return k + (k - 1) * (d - 1); - }; - - return { - align_up(filter[Layout::Filter::output], INT64_C(4)), - align_up(filter[Layout::Filter::input], INT64_C(4)), - effective( - filter[Layout::Filter::height], dilation[Layout::Parameter::height]), - effective( - filter[Layout::Filter::width], dilation[Layout::Parameter::width]), - }; -} - -std::array pack_params(const std::vector& vector) { - TORCH_INTERNAL_ASSERT(2u == vector.size(), "Invalid usage!"); - - return { - vector[0], - vector[1], - }; -} - -bool available( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride, - const IntArrayRef padding, - const IntArrayRef dilation, - const bool transposed, - const IntArrayRef /* output_padding */, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return api::available() && - // Weight - (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) && - (weight.size(Layout::Filter::width) > 0) && - ((weight.device().is_cpu()) || - (c10::DeviceType::Vulkan == weight.device().type())) && - (kFloat == weight.scalar_type() || - c10::kQUInt8 == weight.scalar_type()) && - // Bias - ((bias && bias->defined()) - ? ((1 == bias->ndimension()) && - ((bias->device().is_cpu()) || - (c10::DeviceType::Vulkan == bias->device().type())) && - (kFloat == bias->scalar_type() || - c10::kQUInt8 == bias->scalar_type()) && - (transposed ? false /* to be addded in the future */ - : (weight.size(Layout::Filter::output) == - bias->size(Layout::Filter::output)))) - : true) && - // Stride - (stride[Layout::Parameter::height] > 0) && - (stride[Layout::Parameter::width] > 0) && - // Padding - (padding[Layout::Parameter::height] >= 0) && - (padding[Layout::Parameter::width] >= 0) && - // Dilation - (dilation[Layout::Parameter::height] > 0) && - (dilation[Layout::Parameter::width] > 0) && - // Groups - (groups > 0) && - // Input - (weight.size(Layout::Filter::input) > 0) && - // Output - (weight.size(Layout::Filter::output) > 0) && - // Output - Groups - ((weight.size(Layout::Filter::output) % groups) == 0) && - // Output Min / Max - (!output_min || output_min->isFloatingPoint()) && - (!output_max || output_max->isFloatingPoint()) && true; -} - -bool usable(const Tensor& input) { - // Input - return (4 == input.ndimension()) && - (c10::DeviceType::Vulkan == input.device().type()) && - (kFloat == input.scalar_type() || c10::kQUInt8 == input.scalar_type()) && - (input.size(Layout::Activation4D::batch) >= 0) && - (input.size(Layout::Activation4D::channels) > 0) && - (input.size(Layout::Activation4D::height) > 0) && - (input.size(Layout::Activation4D::width) > 0) && !input.requires_grad() && - true; -} - -} // namespace - -VulkanOpContext conv2d_context_create_q( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef dilation_arg, - const bool transposed, - const IntArrayRef output_padding_arg, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - const auto stride = expand_param_if_needed(stride_arg, "stride", 2); - const auto padding = expand_param_if_needed(padding_arg, "padding", 2); - const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2); - const auto output_padding = output_padding_arg; // TODO: Deconvolutions - - TORCH_CHECK( - available( - weight, - bias, - stride, - padding, - dilation, - transposed, - output_padding, - groups, - output_min, - output_max), - "Vulkan::convolution not available! " - "Reason: The provided (weight, bias, stride, padding, dilation, groups, " - "transposed, output_padding, output_min, output_max) parameters are either " - "invalid individually or their combination is not supported by Vulkan impl."); - - TORCH_CHECK(weight.is_quantized(), "Weight Tensor is not Quantized"); - TORCH_CHECK(bias->is_quantized(), "Bias Tensor is not Quantized"); - - auto method = - determine_method(weight.sizes(), stride, padding, dilation, groups); - - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(10); - packed_context.emplace_back(convert(pack_weights_q(weight, method))); - packed_context.emplace_back(convert(pack_biases_q(bias, weight))); - packed_context.emplace_back(pack_filter(weight, dilation)); - packed_context.emplace_back(pack_params(stride)); - packed_context.emplace_back(pack_params(padding)); - packed_context.emplace_back(output_padding); - packed_context.emplace_back(pack_params(dilation)); - packed_context.emplace_back(safe_downcast(groups)); - packed_context.emplace_back( - output_min ? output_min->template to() - : -std::numeric_limits::infinity()); - packed_context.emplace_back( - output_max ? output_max->template to() - : +std::numeric_limits::infinity()); - packed_context.emplace_back(method); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(10); - unpacked_context.emplace_back(weight); - unpacked_context.emplace_back(bias); - unpacked_context.emplace_back(weight.sizes().vec()); - unpacked_context.emplace_back(stride_arg.vec()); - unpacked_context.emplace_back(padding_arg.vec()); - unpacked_context.emplace_back(output_padding_arg.vec()); - unpacked_context.emplace_back(dilation_arg.vec()); - unpacked_context.emplace_back(groups); - unpacked_context.emplace_back(output_min); - unpacked_context.emplace_back(output_max); - unpacked_context.emplace_back(method); - return VulkanOpContext::create(packed_context, unpacked_context); -} - -void conv2d_sliding_window_q( - const api::ShaderSource& shader, - vTensor& v_output, - const vTensor& v_input, - const vTensor& packed_v_weight, - const vTensor& packed_v_bias, - const IntArrayRef packed_filter, - const IntArrayRef packed_stride, - const IntArrayRef packed_padding, - const IntArrayRef packed_dilation, - const float packed_output_min, - const float packed_output_max, - const IntArrayRef unpacked_filter, - const Conv2dQMethod method_, - const double scale, - const int64_t zero_point) { - api::Context* const context = api::context(); - - const double scale_out = v_output.get_scale(); - const int64_t zero_point_out = v_output.get_zero_point(); - - const double weight_scale = packed_v_weight.get_scale(); - const int64_t weight_zero_point = packed_v_weight.get_zero_point(); - - const double bias_scale = packed_v_bias.get_scale(); - const int64_t bias_zero_point = packed_v_bias.get_zero_point(); - - const struct Block final { - uvec3 extents; - int32_t ic4; - ivec4 kernel; - float scale_out; - float scale; - int32_t zero_point_out; - int32_t zero_point; - float weight_scale; - float bias_scale; - int32_t weight_zero_point; - int32_t bias_zero_point; - ivec2 ikernel; - ivec2 stride; - ivec2 padding; - ivec2 dilate; - vec2 clamp; - } block{ - v_output.extents(), - safe_downcast(packed_filter[Layout::Filter::input]), - { - safe_downcast(packed_filter[Layout::Filter::width]), - safe_downcast(packed_filter[Layout::Filter::height]), - safe_downcast(v_input.sizes()[Layout::Activation4D::width]), - safe_downcast(v_input.sizes()[Layout::Activation4D::height]), - }, - safe_downcast(scale_out), - safe_downcast(scale), - safe_downcast(zero_point_out), - safe_downcast(zero_point), - safe_downcast(weight_scale), - safe_downcast(bias_scale), - safe_downcast(weight_zero_point), - safe_downcast(bias_zero_point), - { - safe_downcast(unpacked_filter[Layout::Filter::width]), - safe_downcast(unpacked_filter[Layout::Filter::height]), - }, - { - safe_downcast(packed_stride[Layout::Parameter::width]), - safe_downcast(packed_stride[Layout::Parameter::height]), - }, - { - safe_downcast(packed_padding[Layout::Parameter::width]), - safe_downcast(packed_padding[Layout::Parameter::height]), - }, - { - safe_downcast(packed_dilation[Layout::Parameter::width]), - safe_downcast(packed_dilation[Layout::Parameter::height]), - }, - { - packed_output_min, - packed_output_max, - }, - }; - - uvec3 global_size = v_output.extents(); - if (method_ == Conv2dQPointwise) { - global_size = { - safe_downcast( - div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))), - safe_downcast( - div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))), - v_output.extents().data[2u]}; - } - - api::UniformParamsBuffer params(context, block); - api::PipelineBarrier pipeline_barrier{}; - - context->submit_compute_job( - // shader descriptor - shader, - // pipeline barrier - pipeline_barrier, - // global work group size - global_size, - // local work group size - adaptive_work_group_size(global_size), - // fence handle - VK_NULL_HANDLE, - // shader arguments - v_output.image( - pipeline_barrier, - api::PipelineStage::COMPUTE, - api::MemoryAccessType::WRITE), - v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE), - packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE), - packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE), - // params buffer - params.buffer()); -} - -Tensor conv2d_context_run_q( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context, - double scale, - int64_t zero_point) { - api::Context* const context = api::context(); - - const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); - const vTensor& v_input = convert(input); - - const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor()); - const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor()); - - const auto packed_filter = packed_context.get(2).toIntVector(); - const auto packed_stride = packed_context.get(3).toIntVector(); - const auto packed_padding = packed_context.get(4).toIntVector(); - const auto packed_dilation = packed_context.get(6).toIntVector(); - const float packed_output_min = - safe_downcast(packed_context.get(8).toDouble()); - const float packed_output_max = - safe_downcast(packed_context.get(9).toDouble()); - const auto unpacked_filter = unpacked_context.get(2).toIntVector(); - const Conv2dQMethod method_ = (Conv2dQMethod)unpacked_context.get(10).toInt(); - - TORCH_CHECK( - usable(input), - "Vulkan Convolution not usable! " - "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl."); - - vTensor v_output{ - context, - conv_output_size( - v_input.sizes(), - unpacked_filter, - packed_padding, - packed_stride, - packed_dilation), - input.options(), - scale, - zero_point, - }; - - if (method_ == Conv2dQSlidingWindow) { - conv2d_sliding_window_q( - VK_KERNEL(quantized_conv2d), - v_output, - v_input, - packed_v_weight, - packed_v_bias, - packed_filter, - packed_stride, - packed_padding, - packed_dilation, - packed_output_min, - packed_output_max, - unpacked_filter, - method_, - v_input.get_scale(), - v_input.get_zero_point()); - } else if (method_ == Conv2dQPointwise) { - conv2d_sliding_window_q( - VK_KERNEL(quantized_conv2d_pw_2x2), - v_output, - v_input, - packed_v_weight, - packed_v_bias, - packed_filter, - packed_stride, - packed_padding, - packed_dilation, - packed_output_min, - packed_output_max, - unpacked_filter, - method_, - v_input.get_scale(), - v_input.get_zero_point()); - } else if (method_ == Conv2dQDepthwise) { - conv2d_sliding_window_q( - VK_KERNEL(quantized_conv2d_dw), - v_output, - v_input, - packed_v_weight, - packed_v_bias, - packed_filter, - packed_stride, - packed_padding, - packed_dilation, - packed_output_min, - packed_output_max, - unpacked_filter, - method_, - v_input.get_scale(), - v_input.get_zero_point()); - } else { - TORCH_CHECK(false, "Invalid Method"); - } - - return convert_quantized(v_output); -} - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at diff --git a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h b/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h deleted file mode 100644 index 4853623a7fa37..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h +++ /dev/null @@ -1,44 +0,0 @@ -#pragma once - -#ifdef USE_VULKAN_API - -#include -#include -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { - -enum Conv2dQMethod { - Conv2dQDepthwise, - Conv2dQPointwise, - Conv2dQSlidingWindow, -}; - -VulkanOpContext conv2d_context_create_q( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef dilation_arg, - const bool transposed, - const IntArrayRef output_padding_arg, - const int64_t groups, - const c10::optional& output_min = c10::nullopt, - const c10::optional& output_max = c10::nullopt); - -Tensor conv2d_context_run_q( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context, - double scale, - int64_t zero_point); - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at - -#endif /* USE_VULKAN_API */ diff --git a/aten/src/ATen/native/vulkan/ops/Register.cpp b/aten/src/ATen/native/vulkan/ops/Register.cpp index 4cc1ba4e8bb6b..18d5a6facfaed 100644 --- a/aten/src/ATen/native/vulkan/ops/Register.cpp +++ b/aten/src/ATen/native/vulkan/ops/Register.cpp @@ -5,10 +5,7 @@ #include #include #include -#include #include -#include -#include #include #include @@ -19,133 +16,110 @@ namespace ops { namespace { TORCH_LIBRARY(vulkan, m) { - m.class_("VulkanOpContext") + m.class_("LinearPackedContext") .def_pickle( // __getstate__ - [](const c10::intrusive_ptr& context) { - return context->get_state(); + [](const c10::intrusive_ptr& context) { + // context is packed + return context->unpack(); }, // __setstate__ - [](VulkanOpContext::State state) { - return c10::make_intrusive(VulkanOpContext::create( - std::get<0>(state), std::get<1>(state))); + [](c10::impl::GenericList state) { + // state is unpacked + return c10::make_intrusive( + LinearPackedContext::pack(state)); }); - // To maintain backwards compatibility. - m.class_("Conv2dOpContext") + m.class_("GruPackedContext") .def_pickle( // __getstate__ - [](const c10::intrusive_ptr& context) { + [](const c10::intrusive_ptr& context) { + // context is packed return context->unpack(); }, // __setstate__ - [](Conv2dOpContext::State state) { - return conv2d_clamp_prepack( - std::move(std::get<0>(state)), - std::move(std::get<1>(state)), - std::move(std::get<2>(state)), - std::move(std::get<3>(state)), - std::move(std::get<4>(state)), - std::get<5>(state), - std::get<6>(state), - std::get<7>(state)); + [](c10::impl::GenericList state) { + // state is unpacked + return c10::make_intrusive( + GruPackedContext::pack(state)); }); - // To maintain backwards compatibility. - m.class_("TransposeConv2dOpContext") + m.class_("LstmPackedContext") .def_pickle( // __getstate__ - [](const c10::intrusive_ptr& context) { + [](const c10::intrusive_ptr& context) { + // context is packed return context->unpack(); }, // __setstate__ - [](TransposeConv2dOpContext::State state) { - return conv2d_transpose_clamp_prepack( - std::move(std::get<0>(state)), - std::move(std::get<1>(state)), - std::move(std::get<2>(state)), - std::move(std::get<3>(state)), - std::move(std::get<4>(state)), - std::move(std::get<5>(state)), - std::get<6>(state), - std::get<7>(state), - std::get<8>(state)); + [](c10::impl::GenericList state) { + // state is unpacked + return c10::make_intrusive( + LstmPackedContext::pack(state)); }); - // To maintain backwards compatibility. - m.class_("LinearOpContext") + m.class_("Conv2dPackedContext") .def_pickle( // __getstate__ - [](const c10::intrusive_ptr& context) { + [](const c10::intrusive_ptr& context) { + // context is packed return context->unpack(); }, // __setstate__ - [](LinearOpContext::State state) { - return linear_prepack( - std::move(std::get<0>(state)), std::move(std::get<1>(state))); + [](c10::impl::GenericList state) { + // state is unpacked + return c10::make_intrusive( + Conv2dPackedContext::pack(state)); }); // To maintain backwards compatibility. - m.class_("GruOpContext") + m.class_("Conv2dOpContext") .def_pickle( // __getstate__ - [](const c10::intrusive_ptr& context) { + [](const c10::intrusive_ptr& context) { return context->unpack(); }, // __setstate__ - [](GruOpContext::State state) { - return gru_prepack( + [](Conv2dOpContext::State state) { + return conv2d_clamp_prepack( std::move(std::get<0>(state)), - std::get<1>(state), - std::get<2>(state), - std::get<3>(state), - std::get<4>(state), + std::move(std::get<1>(state)), + std::move(std::get<2>(state)), + std::move(std::get<3>(state)), + std::move(std::get<4>(state)), std::get<5>(state), - std::get<6>(state)); + std::get<6>(state), + std::get<7>(state)); }); } TORCH_LIBRARY(vulkan_prepack, m) { m.def(TORCH_SELECTIVE_SCHEMA( - "vulkan_prepack::create_conv2d_clamp_context(Tensor W, Tensor? B, int[2] stride, " + "vulkan_prepack::create_conv2d_context(Tensor W, Tensor? B, int[2] stride, " "int[2] padding, int[2] dilation, int groups, " "Scalar? output_min=None, Scalar? output_max=None) " - "-> __torch__.torch.classes.vulkan.VulkanOpContext")); + "-> __torch__.torch.classes.vulkan.Conv2dPackedContext")); m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility "vulkan_prepack::conv2d_clamp_prepack(Tensor W, Tensor? B, int[2] stride, " "int[2] padding, int[2] dilation, int groups, " "Scalar? output_min=None, Scalar? output_max=None) " "-> __torch__.torch.classes.vulkan.Conv2dOpContext")); m.def(TORCH_SELECTIVE_SCHEMA( - "vulkan_prepack::run_conv2d_clamp_context(Tensor X, " - "__torch__.torch.classes.vulkan.VulkanOpContext W_prepack) -> Tensor Y")); + "vulkan_prepack::run_conv2d_context(Tensor X, " + "__torch__.torch.classes.vulkan.Conv2dPackedContext W_prepack) -> Tensor Y")); m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility "vulkan_prepack::conv2d_clamp_run(Tensor X, " "__torch__.torch.classes.vulkan.Conv2dOpContext W_prepack) -> Tensor Y")); m.def(TORCH_SELECTIVE_SCHEMA( - "vulkan_prepack::create_conv2d_transpose_clamp_context(Tensor W, Tensor? B, int[2] stride, " + "vulkan_prepack::create_tconv2d_context(Tensor W, Tensor? B, int[2] stride, " "int[2] padding, int[2] output_padding, int[2] dilation, int groups, " "Scalar? output_min=None, Scalar? output_max=None) " - "-> __torch__.torch.classes.vulkan.VulkanOpContext")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::conv2d_transpose_clamp_prepack(Tensor W, Tensor? B, int[2] stride, " - "int[2] padding, int[2] output_padding, int[2] dilation, int groups, " - "Scalar? output_min=None, Scalar? output_max=None) " - "-> __torch__.torch.classes.vulkan.TransposeConv2dOpContext")); + "-> __torch__.torch.classes.vulkan.Conv2dPackedContext")); m.def(TORCH_SELECTIVE_SCHEMA( - "vulkan_prepack::run_conv2d_transpose_clamp_context(Tensor X, " - "__torch__.torch.classes.vulkan.VulkanOpContext W_prepack) -> Tensor Y")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::conv2d_transpose_clamp_run(Tensor X, " - "__torch__.torch.classes.vulkan.TransposeConv2dOpContext W_prepack) -> Tensor Y")); + "vulkan_prepack::run_tconv2d_context(Tensor X, " + "__torch__.torch.classes.vulkan.Conv2dPackedContext W_prepack) -> Tensor Y")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::create_linear_context(Tensor W, Tensor? B) " - "-> __torch__.torch.classes.vulkan.VulkanOpContext")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::linear_prepack(Tensor W, Tensor? B) " - "-> __torch__.torch.classes.vulkan.LinearOpContext")); + "-> __torch__.torch.classes.vulkan.LinearPackedContext")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::run_linear_context(Tensor X, " - "__torch__.torch.classes.vulkan.VulkanOpContext BW_prepack) -> Tensor Y")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::linear_run(Tensor X, " - "__torch__.torch.classes.vulkan.LinearOpContext BW_prepack) -> Tensor Y")); + "__torch__.torch.classes.vulkan.LinearPackedContext BW_prepack) -> Tensor Y")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::create_gru_context(Tensor[] params_cpu, " "bool has_biases, " @@ -154,24 +128,11 @@ TORCH_LIBRARY(vulkan_prepack, m) { "bool train, " "bool bidirectional, " "bool batch_first) " - "-> __torch__.torch.classes.vulkan.VulkanOpContext")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::gru_prepack(Tensor[] params_cpu, " - "bool has_biases, " - "int num_layers, " - "float dropout, " - "bool train, " - "bool bidirectional, " - "bool batch_first) " - "-> __torch__.torch.classes.vulkan.GruOpContext")); + "-> __torch__.torch.classes.vulkan.GruPackedContext")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::run_gru_context(Tensor input_vk, " "Tensor hx_vk, " - "__torch__.torch.classes.vulkan.VulkanOpContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)")); - m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility - "vulkan_prepack::gru_run(Tensor input_vk, " - "Tensor hx_vk, " - "__torch__.torch.classes.vulkan.GruOpContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)")); + "__torch__.torch.classes.vulkan.GruPackedContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::create_lstm_context(Tensor[] params_cpu, " "bool has_biases, " @@ -180,40 +141,30 @@ TORCH_LIBRARY(vulkan_prepack, m) { "bool train, " "bool bidirectional, " "bool batch_first) " - "-> __torch__.torch.classes.vulkan.VulkanOpContext")); + "-> __torch__.torch.classes.vulkan.LstmPackedContext")); m.def(TORCH_SELECTIVE_SCHEMA( "vulkan_prepack::run_lstm_context(Tensor input_vk, " "Tensor hx_vk, " "Tensor cx_vk, " - "__torch__.torch.classes.vulkan.VulkanOpContext L_prepack) -> (Tensor next_input, Tensor hidden_state, Tensor cell_state)")); + "__torch__.torch.classes.vulkan.LstmPackedContext L_prepack) -> (Tensor next_input, Tensor hidden_state, Tensor cell_state)")); } TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) { m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::create_conv2d_clamp_context"), - TORCH_FN(create_conv2d_clamp_context)); + TORCH_SELECTIVE_NAME("vulkan_prepack::create_conv2d_context"), + TORCH_FN(create_conv2d_context)); m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_prepack"), TORCH_FN(conv2d_clamp_prepack)); // Backwards compatibility m.impl( - TORCH_SELECTIVE_NAME( - "vulkan_prepack::create_conv2d_transpose_clamp_context"), - TORCH_FN(create_conv2d_transpose_clamp_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_prepack"), - TORCH_FN(conv2d_transpose_clamp_prepack)); // Backwards compatibility + TORCH_SELECTIVE_NAME("vulkan_prepack::create_tconv2d_context"), + TORCH_FN(create_tconv2d_context)); m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::create_linear_context"), TORCH_FN(create_linear_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::linear_prepack"), - TORCH_FN(linear_prepack)); // Backwards compatibility m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::create_gru_context"), TORCH_FN(create_gru_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::gru_prepack"), - TORCH_FN(gru_prepack)); // Backwards compatibility m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::create_lstm_context"), TORCH_FN(create_lstm_context)); @@ -221,161 +172,26 @@ TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) { TORCH_LIBRARY_IMPL(vulkan_prepack, Vulkan, m) { m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::run_conv2d_clamp_context"), - TORCH_FN(run_conv2d_clamp_context)); + TORCH_SELECTIVE_NAME("vulkan_prepack::run_conv2d_context"), + TORCH_FN(run_conv2d_context)); m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_run"), TORCH_FN(conv2d_clamp_run)); // Backwards compatibility m.impl( - TORCH_SELECTIVE_NAME( - "vulkan_prepack::run_conv2d_transpose_clamp_context"), - TORCH_FN(run_conv2d_transpose_clamp_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_run"), - TORCH_FN(conv2d_transpose_clamp_run)); // Backwards compatibility + TORCH_SELECTIVE_NAME("vulkan_prepack::run_tconv2d_context"), + TORCH_FN(run_tconv2d_context)); m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::run_linear_context"), TORCH_FN(run_linear_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::linear_run"), - TORCH_FN(linear_run)); // Backwards compatibility m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::run_gru_context"), TORCH_FN(run_gru_context)); - m.impl( - TORCH_SELECTIVE_NAME("vulkan_prepack::gru_run"), - TORCH_FN(gru_run)); // Backwards compatibility m.impl( TORCH_SELECTIVE_NAME("vulkan_prepack::run_lstm_context"), TORCH_FN(run_lstm_context)); } -Tensor convolution( - const Tensor& input, - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride, - const IntArrayRef padding, - const IntArrayRef dilation, - const bool transposed, - const IntArrayRef output_padding, - const int64_t groups) { - if (transposed) { - VulkanOpContext vulkan_context = conv2d_transpose_context_create( - weight, bias, stride, padding, output_padding, dilation, groups); - return conv2d_transpose_context_run( - input, vulkan_context.get_packed(), vulkan_context.get_unpacked()); - } - VulkanOpContext vulkan_context = conv2d_context_create( - weight, - bias, - stride, - padding, - dilation, - transposed, - output_padding, - groups); - return conv2d_context_run( - input, vulkan_context.get_packed(), vulkan_context.get_unpacked()); -} - -Tensor quantized_convolution( - const Tensor& input, - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride, - const IntArrayRef padding, - const IntArrayRef dilation, - const bool transposed, - const IntArrayRef output_padding, - const int64_t groups, - const double out_scale, - const int64_t out_zero_point) { - if (transposed) { - VulkanOpContext vulkan_context = conv2d_transpose_context_create( - weight, bias, stride, padding, output_padding, dilation, groups); - return conv2d_transpose_context_run( - input, vulkan_context.get_packed(), vulkan_context.get_unpacked()); - } - VulkanOpContext vulkan_context = conv2d_context_create_q( - weight, - bias, - stride, - padding, - dilation, - transposed, - output_padding, - groups, - c10::nullopt, - c10::nullopt); - return conv2d_context_run_q( - input, - vulkan_context.get_packed(), - vulkan_context.get_unpacked(), - out_scale, - out_zero_point); -} } // namespace - -static std::tuple batchify( - const Tensor& input, - const int64_t num_spatial_dims, - const std::string& func_name) { - const auto dim_count_no_batch = num_spatial_dims + 1; - const auto dim_count_batch = dim_count_no_batch + 1; - const auto is_batched = (input.dim() == dim_count_batch); - TORCH_CHECK( - input.dim() == dim_count_no_batch || is_batched, - "Expected ", - dim_count_no_batch, - "D (unbatched) or ", - dim_count_batch, - "D (batched) input to ", - func_name, - ", but got input of size: ", - input.sizes()); - return std::make_tuple(is_batched ? input : input.unsqueeze(0), is_batched); -} - -Tensor conv2d( - const Tensor& input_, - const Tensor& weight, - const c10::optional& bias_opt, - IntArrayRef stride, - IntArrayRef padding, - IntArrayRef dilation, - int64_t groups, - double out_scale, - int64_t out_zero_point) { - // See [Note: hacky wrapper removal for optional tensor] - c10::MaybeOwned bias_maybe_owned = - at::borrow_from_optional_tensor(bias_opt); - const Tensor& bias = *bias_maybe_owned; - - Tensor input; - bool is_batched; - std::tie(input, is_batched) = - batchify(input_, /*num_spatial_dims=*/2, "conv2d"); - Tensor output; - output = quantized_convolution( - input, - weight, - bias, - stride, - padding, - dilation, - false, - {{0, 0}}, - groups, - out_scale, - out_zero_point); - return is_batched ? output : output.squeeze(0); -} - -TORCH_LIBRARY_IMPL(aten, Vulkan, m) { - m.impl("convolution_overrideable", convolution); -} - } // namespace ops } // namespace vulkan } // namespace native diff --git a/aten/src/ATen/native/vulkan/ops/Shape.cpp b/aten/src/ATen/native/vulkan/ops/Shape.cpp index f5c47187c3bda..14b32c2eea179 100644 --- a/aten/src/ATen/native/vulkan/ops/Shape.cpp +++ b/aten/src/ATen/native/vulkan/ops/Shape.cpp @@ -22,7 +22,7 @@ Tensor view_internal(const Tensor& self_arg, const IntArrayRef shape) { self.options(), }; - api::StagingBuffer buffer(context, v_self.buffer_bytes(), true); + api::StorageBuffer buffer(context, at::kFloat, v_self.numcells(), true); utils::pack_vtensor_to_staging(v_self, buffer.buffer()); diff --git a/aten/src/ATen/native/vulkan/ops/Slice.cpp b/aten/src/ATen/native/vulkan/ops/Slice.cpp index d45bff6af4066..a6c0beb965b42 100644 --- a/aten/src/ATen/native/vulkan/ops/Slice.cpp +++ b/aten/src/ATen/native/vulkan/ops/Slice.cpp @@ -95,7 +95,7 @@ Tensor slice_width( api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images @@ -126,7 +126,7 @@ Tensor slice_width( api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images @@ -171,7 +171,7 @@ Tensor slice_height( api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images @@ -200,7 +200,7 @@ Tensor slice_height( api::PipelineBarrier pipeline_barrier{}; - context->submit_texture_copy( + context->submit_copy( // pipeline barrier pipeline_barrier, // images diff --git a/aten/src/ATen/native/vulkan/ops/Tensor.h b/aten/src/ATen/native/vulkan/ops/Tensor.h index ecf99ceb9f375..16dbc9887355c 100644 --- a/aten/src/ATen/native/vulkan/ops/Tensor.h +++ b/aten/src/ATen/native/vulkan/ops/Tensor.h @@ -80,6 +80,11 @@ class vTensorStorage final { // Validation void verify() const; + + public: + inline VkFormat texture_format() { + return image_.format(); + } }; class vTensor final { @@ -139,6 +144,13 @@ class vTensor final { return view_->extents_; } + /* + * Get a c10::ScalarType that corresponds to the image format of the texture + */ + inline c10::ScalarType texture_dtype() const { + return api::c10_scalartype(view_->texture_format()); + } + inline const TensorOptions& options() const { return view_->options_; } @@ -168,11 +180,29 @@ class vTensor final { c10::multiply_integers(sizes()); } - inline VkDeviceSize buffer_bytes() { - return c10::elementSize(c10::typeMetaToScalarType(options().dtype())) * - view_->extents_.data[0u] * view_->extents_.data[1u] * + /* + * Number of texels in the image texture. + */ + inline VkDeviceSize numtexels() { + return view_->extents_.data[0u] * view_->extents_.data[1u] * + view_->extents_.data[2u]; + } + + /* + * Number of "cells" in the image texture. 4 cells make up a texel. + */ + inline VkDeviceSize numcells() { + return view_->extents_.data[0u] * view_->extents_.data[1u] * (4u * view_->extents_.data[2u]); } + + /* + * Number of bytes needed for a buffer to receive all data in the texture + */ + inline VkDeviceSize buffer_bytes() { + return c10::elementSize(this->texture_dtype()) * view_->extents_.data[0u] * + view_->extents_.data[1u] * (4u * view_->extents_.data[2u]); + } }; void add_buffer_barrier( diff --git a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp b/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp deleted file mode 100644 index 125efa803f3c1..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp +++ /dev/null @@ -1,600 +0,0 @@ -#include -#include - -#include -#include -#include -#include -#include -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { -namespace { - -using namespace api::utils; -using namespace at::native::vulkan::ops; - -vTensor pack_weights_2d_reverse( - api::Context* const context, - const Tensor& weight, - bool reversed) { - /* Source */ - const IntArrayRef src_filter = weight.sizes(); - const float* const src_weight_ptr = weight.data_ptr(); - - const int64_t src_kw_sz = src_filter[Layout::Filter::width]; - const int64_t src_kh_sz = src_filter[Layout::Filter::height]; - const int64_t src_kernel_sz = src_kw_sz * src_kh_sz; - const int64_t src_block_sz = - src_kernel_sz * src_filter[Layout::Filter::input]; - - const int64_t num_stacks = - div_up(src_filter[Layout::Filter::output], INT64_C(4)); - const int64_t stack_depth = - api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4)); - - /* Destination */ - const int64_t dst_kw_sz = src_kw_sz * stack_depth; - const int64_t dst_kh_sz = src_kh_sz * num_stacks; - const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz; - - vTensor v_weight{ - context, - { - 4, - dst_kh_sz, - dst_kw_sz, - }, - weight.options(), - }; - - api::StagingBuffer staging(context, v_weight.buffer_bytes()); - { - api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - - float* dst_weight_ptr = mapping.template data(); - - memset(dst_weight_ptr, 0, v_weight.nbytes()); - - for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) { - /* Source */ - const float* const src_weight_oc_ptr = - src_weight_ptr + src_oc * src_block_sz; - - /* Destination */ - const int64_t dst_oh = src_oc / 4; - const int64_t dst_c = src_oc % 4; - - float* const dst_weight_c_ptr = dst_weight_ptr + dst_c * dst_kernel_sz; - - for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) { - for (const auto src_ih : c10::irange(src_kh_sz)) { - const int64_t dst_h = reversed ? (src_kh_sz - 1 - src_ih) : src_ih; - for (const auto src_iw : c10::irange(src_kw_sz)) { - const int64_t dst_w = reversed ? (src_kw_sz - 1 - src_iw) : src_iw; - const int64_t dst_w_offset = dst_w * stack_depth; - memcpy( - dst_weight_c_ptr + (dst_oh * src_kh_sz + dst_h) * dst_kw_sz + - src_ic + dst_w_offset, - src_weight_oc_ptr + src_ic * src_kernel_sz + - src_ih * src_kw_sz + src_iw, - sizeof(float)); - } - } - } - } - } - utils::pack_staging_to_vtensor(staging.buffer(), v_weight); - - return v_weight; -} - -vTensor pack_weights(const Tensor& weight_arg) { - if (weight_arg.is_vulkan()) { - return convert(weight_arg); - } - - api::Context* const context = api::context(); - - const Tensor weight = at::permute(weight_arg, {1, 0, 2, 3}).contiguous(); - - return pack_weights_2d_reverse(context, weight, true); -} - -vTensor pack_biases(const c10::optional& bias, const Tensor& weight) { - if (bias && bias->is_vulkan()) { - return convert(*bias); - } - - api::Context* const context = api::context(); - - const int64_t src_w = weight.size(Layout::TransposedFilter::output); - const int64_t packed_w = div_up(src_w, INT64_C(4)); - vTensor v_bias{ - context, - { - 4, - 1, - packed_w, - }, - weight.options(), - }; - - api::StagingBuffer staging(context, v_bias.buffer_bytes()); - { - api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE); - - float* dst_bias_ptr = mapping.template data(); - - if (bias) { - const float* const src_bias_ptr = bias->contiguous().data_ptr(); - - memset(dst_bias_ptr, 0, v_bias.nbytes()); - for (const auto i : c10::irange(src_w)) { - const int64_t c = i % 4; - const int64_t x = i / 4; - dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i]; - } - } else { - memset( - dst_bias_ptr, - // 2's complement integers and IEEE-754 floating point numbers both - // have identical bit representations for 0, so can use memset which - // only accepts uint8_t parameter. - 0, - v_bias.nbytes()); - } - } - utils::pack_staging_to_vtensor(staging.buffer(), v_bias); - - return v_bias; -} - -std::array pack_filter( - const Tensor& weight, - const IntArrayRef dilation) { - const IntArrayRef filter = weight.sizes(); - - const auto effective = [](const int64_t k, const int64_t d) { - return k + (k - 1) * (d - 1); - }; - - return { - align_up(filter[Layout::TransposedFilter::output], INT64_C(4)), - align_up(filter[Layout::TransposedFilter::input], INT64_C(4)), - effective( - filter[Layout::Filter::height], dilation[Layout::Parameter::height]), - effective( - filter[Layout::Filter::width], dilation[Layout::Parameter::width]), - }; -} - -std::array pack_params(const std::vector& vector) { - TORCH_INTERNAL_ASSERT(2u == vector.size(), "Invalid usage!"); - - return { - vector[0], - vector[1], - }; -} - -bool available( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride, - const IntArrayRef padding, - const IntArrayRef /* output_padding */, - const IntArrayRef dilation, - const bool transposed, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return api::available() && - // Weight - (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) && - (weight.size(Layout::Filter::width) > 0) && - ((weight.device().is_cpu()) || - (c10::DeviceType::Vulkan == weight.device().type())) && - (kFloat == weight.scalar_type()) && - // Bias - ((bias && bias->defined()) - ? ((1 == bias->ndimension()) && - ((bias->device().is_cpu()) || - (c10::DeviceType::Vulkan == bias->device().type())) && - (kFloat == bias->scalar_type()) && - (transposed ? (weight.size(Layout::TransposedFilter::output) == - bias->size(Layout::Filter::output)) - : (weight.size(Layout::Filter::output) == - bias->size(Layout::Filter::output)))) - : true) && - // Stride - (stride[Layout::Parameter::height] > 0) && - (stride[Layout::Parameter::width] > 0) && - // Padding - (padding[Layout::Parameter::height] >= 0) && - (padding[Layout::Parameter::width] >= 0) && - // Dilation - (transposed ? (dilation[Layout::Parameter::height] == 1) && - (dilation[Layout::Parameter::width] == 1) - : (dilation[Layout::Parameter::height] > 0) && - (dilation[Layout::Parameter::width] > 0)) && - // Groups - (groups > 0) && - // Input - (weight.size(Layout::Filter::input) > 0) && - // Output - (weight.size(Layout::Filter::output) > 0) && - // Output - Groups - ((weight.size(Layout::Filter::output) % groups) == 0) && - // Output Min / Max - (!output_min || output_min->isFloatingPoint()) && - (!output_max || output_max->isFloatingPoint()) && true; -} - -bool usable(const Tensor& input) { - // Input - return (4 == input.ndimension()) && - (c10::DeviceType::Vulkan == input.device().type()) && - (kFloat == input.scalar_type()) && - (input.size(Layout::Activation4D::batch) >= 0) && - (input.size(Layout::Activation4D::channels) > 0) && - (input.size(Layout::Activation4D::height) > 0) && - (input.size(Layout::Activation4D::width) > 0) && !input.requires_grad() && - true; -} - -static inline std::vector get_conv_transpose_output_size( - IntArrayRef input_size, - IntArrayRef weight_size, - IntArrayRef padding, - IntArrayRef output_padding, - IntArrayRef stride, - IntArrayRef dilation = IntArrayRef()) { - auto dim = input_size.size(); - std::vector output_size(dim); - output_size[0] = input_size[input_batch_size_dim]; - output_size[1] = weight_size[weight_input_channels_dim]; - for (const auto d : c10::irange(2, dim)) { - output_size[d] = stride[d - 2] * (input_size[d] - 1) + weight_size[d] - - 2 * padding[d - 2] + output_padding[d - 2]; - } - return output_size; -} - -} // namespace - -VulkanOpContext conv2d_transpose_context_create( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef output_padding_arg, - const IntArrayRef dilation_arg, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - const auto stride = expand_param_if_needed(stride_arg, "stride", 2); - const auto padding = expand_param_if_needed(padding_arg, "padding", 2); - const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2); - const auto output_padding = - expand_param_if_needed(output_padding_arg, "output_padding", 2); - - TORCH_CHECK( - available( - weight, - bias, - stride, - padding, - output_padding, - dilation, - true, - groups, - output_min, - output_max), - "Vulkan::convolution not available! " - "Reason: The provided (weight, bias, stride, padding, dilation, groups, " - "transposed, output_padding, output_min, output_max) parameters are either " - "invalid individually or their combination is not supported by Vulkan impl."); - - c10::impl::GenericList packed_context{c10::AnyType::get()}; - packed_context.reserve(10); - packed_context.emplace_back(convert(pack_weights(weight))); - packed_context.emplace_back(convert(pack_biases(bias, weight))); - packed_context.emplace_back(pack_filter(weight, dilation)); - packed_context.emplace_back(pack_params(stride)); - packed_context.emplace_back(pack_params(padding)); - packed_context.emplace_back(pack_params(output_padding)); - packed_context.emplace_back(pack_params(dilation)); - packed_context.emplace_back(safe_downcast(groups)); - packed_context.emplace_back( - output_min ? output_min->template to() - : -std::numeric_limits::infinity()); - packed_context.emplace_back( - output_max ? output_max->template to() - : +std::numeric_limits::infinity()); - - c10::impl::GenericList unpacked_context{c10::AnyType::get()}; - unpacked_context.reserve(10); - unpacked_context.emplace_back(weight); - unpacked_context.emplace_back(bias); - unpacked_context.emplace_back(weight.sizes().vec()); - unpacked_context.emplace_back(stride_arg.vec()); - unpacked_context.emplace_back(padding_arg.vec()); - unpacked_context.emplace_back(output_padding_arg.vec()); - unpacked_context.emplace_back(dilation_arg.vec()); - unpacked_context.emplace_back(groups); - unpacked_context.emplace_back(output_min); - unpacked_context.emplace_back(output_max); - - return VulkanOpContext::create(packed_context, unpacked_context); -} - -void conv2d_transpose_sliding_window( - const api::ShaderSource& shader, - vTensor& v_output, - const vTensor& v_input, - const vTensor& packed_v_weight, - const vTensor& packed_v_bias, - const IntArrayRef packed_filter, - const IntArrayRef packed_stride, - const IntArrayRef packed_padding, - const IntArrayRef packed_dilation, - const float packed_output_min, - const float packed_output_max, - const IntArrayRef unpacked_filter) { - api::Context* const context = api::context(); - - const struct Block final { - uvec3 extents; - int32_t ic4; - ivec4 kernel; - ivec2 ikernel; - ivec2 stride; - ivec2 padding; - ivec2 dilate; - vec2 clamp; - ivec4 src_filter; - } block{ - v_output.extents(), - safe_downcast( - packed_filter[Layout::Filter::input]), /* this is aligned up */ - { - safe_downcast(packed_filter[Layout::Filter::width]), - safe_downcast(packed_filter[Layout::Filter::height]), - safe_downcast(v_input.sizes()[Layout::Activation4D::width]), - safe_downcast(v_input.sizes()[Layout::Activation4D::height]), - }, - { - safe_downcast(unpacked_filter[Layout::Filter::width]), - safe_downcast(unpacked_filter[Layout::Filter::height]), - }, - { - safe_downcast(packed_stride[Layout::Parameter::width]), - safe_downcast(packed_stride[Layout::Parameter::height]), - }, - { - safe_downcast(packed_padding[Layout::Parameter::width]), - safe_downcast(packed_padding[Layout::Parameter::height]), - }, - { - safe_downcast(packed_dilation[Layout::Parameter::width]), - safe_downcast(packed_dilation[Layout::Parameter::height]), - }, - { - packed_output_min, - packed_output_max, - }, - }; - - uvec3 global_size = v_output.extents(); - - api::UniformParamsBuffer params(context, block); - api::PipelineBarrier pipeline_barrier{}; - - context->submit_compute_job( - // shader descriptor - shader, - // pipeline barrier - pipeline_barrier, - // global work group size - global_size, - // local work group size - adaptive_work_group_size(global_size), - // fence handle - VK_NULL_HANDLE, - // shader arguments - v_output.image( - pipeline_barrier, - api::PipelineStage::COMPUTE, - api::MemoryAccessType::WRITE), - v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE), - packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE), - packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE), - // params buffer - params.buffer()); -} - -Tensor conv2d_transpose_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context) { - api::Context* const context = api::context(); - - const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan(); - const vTensor& v_input = convert(input); - - const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor()); - const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor()); - - const auto packed_filter = packed_context.get(2).toIntVector(); - const auto packed_stride = packed_context.get(3).toIntVector(); - const auto packed_padding = packed_context.get(4).toIntVector(); - const auto packed_output_padding = packed_context.get(5).toIntVector(); - const auto packed_dilation = packed_context.get(6).toIntVector(); - const float packed_output_min = packed_context.get(8).toDouble(); - const float packed_output_max = packed_context.get(9).toDouble(); - const auto unpacked_filter = unpacked_context.get(2).toIntVector(); - - TORCH_CHECK( - usable(input), - "Vulkan Convolution not usable! " - "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl."); - - vTensor v_output{ - context, - get_conv_transpose_output_size( - v_input.sizes(), - unpacked_filter, - packed_padding, - packed_output_padding, - packed_stride, - packed_dilation), - input.options(), - }; - - conv2d_transpose_sliding_window( - VK_KERNEL(conv_transpose2d), - v_output, - v_input, - packed_v_weight, - packed_v_bias, - packed_filter, - packed_stride, - packed_padding, - packed_dilation, - packed_output_min, - packed_output_max, - unpacked_filter); - - return convert(v_output); -} - -c10::intrusive_ptr create_conv2d_transpose_clamp_context( - Tensor&& weight, - c10::optional&& bias, - std::vector&& stride, - std::vector&& padding, - std::vector&& output_padding, - std::vector&& dilation, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return c10::make_intrusive(conv2d_transpose_context_create( - weight, - bias, - stride, - padding, - output_padding, - dilation, - groups, - output_min, - output_max)); -} - -Tensor run_conv2d_transpose_clamp_context( - const Tensor& input, - const c10::intrusive_ptr& vulkan_context) { - return conv2d_transpose_context_run( - input, vulkan_context->get_packed(), vulkan_context->get_unpacked()); -} - -/* Backwards compatibility */ -TransposeConv2dOpContext::TransposeConv2dOpContext( - VulkanOpContext vulkan_context) - : vulkan_context_{std::move(vulkan_context)} {} - -TransposeConv2dOpContext TransposeConv2dOpContext::create( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef output_padding_arg, - const IntArrayRef dilation_arg, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return TransposeConv2dOpContext{conv2d_transpose_context_create( - weight, - bias, - stride_arg, - padding_arg, - output_padding_arg, - dilation_arg, - groups, - output_min, - output_max)}; -} - -Tensor TransposeConv2dOpContext::run(const Tensor& input_arg) const { - return conv2d_transpose_context_run( - input_arg, vulkan_context_.get_packed(), vulkan_context_.get_unpacked()); -} - -TransposeConv2dOpContext::State TransposeConv2dOpContext::unpack() const { - const c10::impl::GenericList unpacked_ = - std::get<1>(vulkan_context_.get_state()); - const Tensor unpacked_weight = unpacked_.get(0).toTensor(); - const c10::optional unpacked_bias = unpacked_.get(1).isTensor() - ? unpacked_.get(1).toTensor() - : (c10::optional&)c10::nullopt; - const std::vector unpacked_stride = unpacked_.get(3).toIntVector(); - const std::vector unpacked_padding = unpacked_.get(4).toIntVector(); - const std::vector unpacked_output_padding = - unpacked_.get(5).toIntVector(); - const std::vector unpacked_dilation = unpacked_.get(6).toIntVector(); - const int64_t unpacked_groups = unpacked_.get(7).toInt(); - const c10::optional unpacked_output_min = unpacked_.get(6).isScalar() - ? unpacked_.get(8).toScalar() - : (c10::optional)c10::nullopt; - const c10::optional unpacked_output_max = unpacked_.get(6).isScalar() - ? unpacked_.get(9).toScalar() - : (c10::optional)c10::nullopt; - return TransposeConv2dOpContext::State{ - unpacked_weight, - unpacked_bias, - unpacked_stride, - unpacked_padding, - unpacked_output_padding, - unpacked_dilation, - unpacked_groups, - unpacked_output_min, - unpacked_output_max, - }; -} - -c10::intrusive_ptr conv2d_transpose_clamp_prepack( - Tensor&& weight, - c10::optional&& bias, - std::vector&& stride, - std::vector&& padding, - std::vector&& output_padding, - std::vector&& dilation, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max) { - return c10::make_intrusive( - TransposeConv2dOpContext::create( - std::move(weight), - std::move(bias), - std::move(stride), - std::move(padding), - std::move(output_padding), - std::move(dilation), - groups, - output_min, - output_max)); -} - -Tensor conv2d_transpose_clamp_run( - const Tensor& input, - const c10::intrusive_ptr& context) { - return context->run(input); -} - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at diff --git a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h b/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h deleted file mode 100644 index b440e243b57d0..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h +++ /dev/null @@ -1,125 +0,0 @@ -#pragma once - -#ifdef USE_VULKAN_API - -#include -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { - -enum TransposeConv2dMethod { - TransposeConv2dSlidingWindow, -}; - -// packed -// vTensor v_weight -// vTensor v_bias -// std::array filter -// std::array stride -// std::array padding -// std::array output_padding -// std::array dilation -// int32_t groups -// float output_min -// float output_max - -// unpacked -// Tensor weight -// c10::optional bias -// std::vector filter -// std::vector stride -// std::vector padding -// std::vector output_padding -// std::vector dilation -// int64_t groups -// c10::optional output_min -// c10::optional output_max - -Tensor conv2d_transpose_context_run( - const Tensor& input_arg, - const c10::impl::GenericList& packed_context, - const c10::impl::GenericList& unpacked_context); - -VulkanOpContext conv2d_transpose_context_create( - const Tensor& weight, - const c10::optional& bias, - const IntArrayRef stride_arg, - const IntArrayRef padding_arg, - const IntArrayRef output_padding_arg, - const IntArrayRef dilation_arg, - const int64_t groups, - const c10::optional& output_min = c10::nullopt, - const c10::optional& output_max = c10::nullopt); - -Tensor run_conv2d_transpose_clamp_context( - const Tensor& input, - const c10::intrusive_ptr& context); - -c10::intrusive_ptr create_conv2d_transpose_clamp_context( - Tensor&& weight, - c10::optional&& bias, - std::vector&& stride, - std::vector&& padding, - std::vector&& output_padding, - std::vector&& dilation, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max); - -// Backwards compatibility -class TransposeConv2dOpContext final : public torch::jit::CustomClassHolder { - public: - static TransposeConv2dOpContext create( - const Tensor& weight, - const c10::optional& bias, - IntArrayRef stride, - IntArrayRef padding, - IntArrayRef output_padding, - IntArrayRef dilation, - int64_t groups, - const c10::optional& output_min = c10::nullopt, - const c10::optional& output_max = c10::nullopt); - - using State = std::tuple< - Tensor, - c10::optional, - std::vector, - std::vector, - std::vector, - std::vector, - int64_t, - c10::optional, - c10::optional>; - - Tensor run(const Tensor& input) const; - State unpack() const; - - private: - explicit TransposeConv2dOpContext(VulkanOpContext vulkan_context); - VulkanOpContext vulkan_context_; -}; - -Tensor conv2d_transpose_clamp_run( - const Tensor& input, - const c10::intrusive_ptr& context); - -c10::intrusive_ptr conv2d_transpose_clamp_prepack( - Tensor&& weight, - c10::optional&& bias, - std::vector&& stride, - std::vector&& padding, - std::vector&& output_padding, - std::vector&& dilation, - const int64_t groups, - const c10::optional& output_min, - const c10::optional& output_max); - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at - -#endif /* USE_VULKAN_API */ diff --git a/aten/src/ATen/native/vulkan/ops/Utils.cpp b/aten/src/ATen/native/vulkan/ops/Utils.cpp index 0a255f9915bd3..0ad893db11a7d 100644 --- a/aten/src/ATen/native/vulkan/ops/Utils.cpp +++ b/aten/src/ATen/native/vulkan/ops/Utils.cpp @@ -6,6 +6,156 @@ namespace vulkan { namespace ops { namespace utils { +/* + * This function formats an input tensor in NCHW layout to NC4HW layout such + * that the buffer of the formatted tensor can be directly copied into a GPU + * texture. Conceptually, the formatting can be achieved via the following + * steps: + * + * 1. Given that the src tensor has size {N,C,H,W} + * + * 2. Combine the batch and channel dims by reshaping to {N*C, H, W} + * + * 3. Determine the amount of padding to add: determine how many channels to add + * in order to align N*C to the next multiple of 4 + * + * 4. Add padding to the tensor so that the batch-channel dimension is a + * multiple of four; the shape of the tensor is now {NC_aligned, H, W} + * + * 5. Split the batch-channel dimension into groups of 4 by reshaping the tensor + * to size {NC_aligned/4, 4, H, W} + * + * 6. The groups of 4 channels (dim 1) should be contiguous. Therefore, permute + * the dims of the tensor in the order {0, 2, 3, 1} + * + * 7. Finally, return a contiguous version of the tensor. The final shape of the + * tensor would be {NC_aligned/4, H, W, 4} + */ +Tensor nchw_to_nc4hw(const Tensor& src) { + uint32_t N = get_dim(src.sizes()); + uint32_t C = get_dim(src.sizes()); + uint32_t H = get_dim(src.sizes()); + uint32_t W = get_dim(src.sizes()); + + uint32_t NC4 = api::utils::div_up(N * C, 4u); + uint32_t NC_aligned = api::utils::align_up(N * C, 4u); + + // Add padding to the tensor so that the batch-channel dim is a multiple of 4 + Tensor padding = at::zeros({NC_aligned - N * C, H, W}, src.options()); + Tensor src_padded = at::cat({src.reshape({N * C, H, W}), padding}); + // Reshape to group channels into groups of 4 and permute so that the groups + // are in the first dimension so that they are contiguous + Tensor src_NC4HW = src_padded.reshape({NC4, 4, H, W}).permute({0, 2, 3, 1}); + + // Return a contiguous version of the tensor + return src_NC4HW.contiguous(); +} + +/* + * Creates a staging tensor into which texture data, which will be in NC4HW + * format, can be copied directly. The shape of the staging tensor will be the + * same as the tensor produced by a call to format_src_tensor(). + */ +Tensor create_staging_tensor(const vTensor& v_in) { + uint32_t N = get_dim(v_in.sizes()); + uint32_t C = get_dim(v_in.sizes()); + uint32_t H = get_dim(v_in.sizes()); + uint32_t W = get_dim(v_in.sizes()); + + uint32_t NC4 = api::utils::div_up(N * C, 4u); + + // Note that the dtype corresponding with the texture format of the vTensor is + // used instead of options().dtype(). This is to ensure the number of bytes in + // the staging tensor matches the number of bytes in the image texture. Refer + // to comments for api::vk_format() + return at::empty( + {NC4, H, W, 4}, at::device(at::kCPU).dtype(v_in.texture_dtype())); +} + +/* + * After copying texture data, which will be in NC4HW format, to a staging + * tensor created in create_staging_tensor(), this function reformats the tensor + * to NCHW format. It essentially reverses the transformations made by + * format_src_tensor(). + * + * Note that the sizes of the original tensor must be passed in to fully restore + * the properties of the original tensor. + */ +Tensor nc4hw_to_nchw(const Tensor& t_in, IntArrayRef sizes) { + uint32_t N = get_dim(sizes); + uint32_t C = get_dim(sizes); + uint32_t H = get_dim(sizes); + uint32_t W = get_dim(sizes); + + uint32_t NC_aligned = api::utils::align_up(N * C, 4u); + + // Undo the permute step and channel grouping step + Tensor t_in_padded = t_in.permute({0, 3, 1, 2}).reshape({NC_aligned, H, W}); + // Remove the padding channels + Tensor t_in_shaved = + at::narrow(t_in_padded, /*dim=*/0, /*start*/ 0, /*end*/ N * C); + + // Reshape to original sizing and dtype and return a contiguous Tensor + return t_in_shaved.reshape(sizes).contiguous(); +} + +void copy_buffer_to_vtensor( + api::VulkanBuffer& src_buffer, + vTensor& v_dst, + api::PipelineBarrier& pipeline_barrier) { + api::Context* const context = api::context(); + + TORCH_CHECK( + src_buffer.mem_size() == v_dst.buffer_bytes(), + "Vulkan copy_buffer_to_vtensor: source buffer and destination texture " + "do not have the same number of bytes"); + + context->submit_copy( + // pipeline barrier + pipeline_barrier, + // resources + src_buffer, + v_dst.image( + pipeline_barrier, + api::PipelineStage::TRANSFER, + api::MemoryAccessType::WRITE), + // copy details + v_dst.extents(), + {0u, 0u, 0u}, + {0u, 0u, 0u}, + // fence handle + VK_NULL_HANDLE); +} + +void copy_vtensor_to_buffer( + vTensor& v_src, + api::VulkanBuffer& dst_buffer, + api::PipelineBarrier& pipeline_barrier, + const VkFence fence_handle) { + api::Context* const context = api::context(); + + TORCH_CHECK( + v_src.buffer_bytes() == dst_buffer.mem_size(), + "Vulkan copy_vtensor_to_buffer: source texture and destination buffer " + "do not have the same number of bytes"); + + context->submit_copy( + // pipeline barrier + pipeline_barrier, + // resources + v_src.image( + pipeline_barrier, + api::PipelineStage::TRANSFER, + api::MemoryAccessType::READ), + dst_buffer, + // copy details + v_src.extents(), + {0u, 0u, 0u}, + {0u, 0u, 0u}, + // fence handle + fence_handle); +} + void pack_buffer_to_vtensor( api::VulkanBuffer& buffer, vTensor& v_self, @@ -85,17 +235,21 @@ void pack_vtensor_to_staging( }, }; + bool is_quantized = v_self.is_quantized(); + + api::utils::uvec3 pack_extents = extents; + if (is_quantized) { + pack_extents.data[0u] = 1; + pack_extents.data[1u] = 1; + pack_extents.data[2u] = + api::utils::safe_downcast(v_self.numtexels()); + } + api::UniformParamsBuffer params(context, block); api::PipelineBarrier pipeline_barrier{}; - bool is_quantized = v_self.is_quantized(); - api::utils::uvec3 copy_extents; - copy_extents.data[0u] = 1; - copy_extents.data[1u] = 1; - copy_extents.data[2u] = - ((v_self.sizes()[1] * v_self.sizes()[2] * v_self.sizes()[3]) / 4); + api::ShaderSource kernel = is_quantized ? VK_KERNEL(image_to_nchw_quantized) : VK_KERNEL(image_to_nchw); - api::utils::uvec3 extents_to_use = is_quantized ? copy_extents : extents; context->submit_compute_job( // shader descriptor @@ -103,9 +257,9 @@ void pack_vtensor_to_staging( // pipeline barrier pipeline_barrier, // global work group size - extents_to_use, + pack_extents, // local work group size - adaptive_work_group_size(extents_to_use), + adaptive_work_group_size(pack_extents), // fence handle fence_handle, // shader arguments diff --git a/aten/src/ATen/native/vulkan/ops/Utils.h b/aten/src/ATen/native/vulkan/ops/Utils.h index 59358ee173eb0..f9b85521bab0a 100644 --- a/aten/src/ATen/native/vulkan/ops/Utils.h +++ b/aten/src/ATen/native/vulkan/ops/Utils.h @@ -10,6 +10,23 @@ namespace vulkan { namespace ops { namespace utils { +Tensor nchw_to_nc4hw(const Tensor&); + +Tensor create_staging_tensor(const vTensor&); + +Tensor nc4hw_to_nchw(const Tensor&, IntArrayRef); + +void copy_buffer_to_vtensor( + api::VulkanBuffer&, + vTensor&, + api::PipelineBarrier&); + +void copy_vtensor_to_buffer( + vTensor&, + api::VulkanBuffer&, + api::PipelineBarrier&, + const VkFence fence_handle = VK_NULL_HANDLE); + inline int64_t normalize(const int64_t dimension, const int64_t n) { return (dimension % n + n) % n; } diff --git a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp b/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp deleted file mode 100644 index 58f07b0d43c4f..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp +++ /dev/null @@ -1,34 +0,0 @@ -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { - -VulkanOpContext::VulkanOpContext( - c10::impl::GenericList packed_context, - c10::impl::GenericList unpacked_context) - : packed_(packed_context), unpacked_(unpacked_context) {} - -VulkanOpContext VulkanOpContext::create( - c10::impl::GenericList packed_context, - c10::impl::GenericList unpacked_context) { - return VulkanOpContext{packed_context, unpacked_context}; -} - -VulkanOpContext::State VulkanOpContext::get_state() const { - return VulkanOpContext::State{packed_, unpacked_}; -} - -const c10::impl::GenericList& VulkanOpContext::get_packed() const { - return packed_; -} - -const c10::impl::GenericList& VulkanOpContext::get_unpacked() const { - return unpacked_; -} - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at diff --git a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h b/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h deleted file mode 100644 index 8907b486d50ca..0000000000000 --- a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h +++ /dev/null @@ -1,35 +0,0 @@ -#pragma once - -#ifdef USE_VULKAN_API - -#include - -namespace at { -namespace native { -namespace vulkan { -namespace ops { - -class VulkanOpContext final : public torch::jit::CustomClassHolder { - public: - static VulkanOpContext create( - c10::impl::GenericList packed_context, - c10::impl::GenericList unpacked_context); - using State = std::tuple; - State get_state() const; - const c10::impl::GenericList& get_packed() const; - const c10::impl::GenericList& get_unpacked() const; - - private: - VulkanOpContext( - c10::impl::GenericList packed_context, - c10::impl::GenericList unpacked_context); - c10::impl::GenericList packed_; - c10::impl::GenericList unpacked_; -}; - -} // namespace ops -} // namespace vulkan -} // namespace native -} // namespace at - -#endif /* USE_VULKAN_API */ diff --git a/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h b/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h new file mode 100644 index 0000000000000..f137bf5d5e785 --- /dev/null +++ b/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h @@ -0,0 +1,33 @@ +#pragma once + +#ifdef USE_VULKAN_API + +#include + +namespace at { +namespace native { +namespace vulkan { +namespace ops { + +class VulkanPackedContext { + protected: + c10::impl::GenericList packed_; + + public: + VulkanPackedContext() : packed_{c10::AnyType::get()} {} + + inline const c10::IValue get_val(int64_t i) const { + return packed_.get(i); + } + + virtual const c10::impl::GenericList unpack() const = 0; + + virtual ~VulkanPackedContext() = default; +}; + +} // namespace ops +} // namespace vulkan +} // namespace native +} // namespace at + +#endif /* USE_VULKAN_API */ diff --git a/aten/src/ATen/native/vulkan/ops/cumsum.cpp b/aten/src/ATen/native/vulkan/ops/cumsum.cpp index fd84d3304f396..679201532c21e 100644 --- a/aten/src/ATen/native/vulkan/ops/cumsum.cpp +++ b/aten/src/ATen/native/vulkan/ops/cumsum.cpp @@ -18,7 +18,8 @@ Tensor cumsum( input_arg.dim() <= 4, "Vulkan cumsum expects input dimension <= 4!"); TORCH_CHECK( - batch_size(input_arg) == 1, "Vulkan cumsum expects batch size <= 1!"); + get_dim(input_arg) == 1, + "Vulkan cumsum expects batch size <= 1!"); TORCH_CHECK(dim < 4, "Vulkan cumsum expects dim < 4!"); diff --git a/aten/src/ATen/templates/DispatchKeyFunctions_inl.h b/aten/src/ATen/templates/DispatchKeyFunctions_inl.h index 73bc1008a4f54..fbb71c2cb123c 100644 --- a/aten/src/ATen/templates/DispatchKeyFunctions_inl.h +++ b/aten/src/ATen/templates/DispatchKeyFunctions_inl.h @@ -18,10 +18,5 @@ ${DispatchKeyFunctions_inl_includes} -namespace at { -namespace ${dispatch_namespace} { ${dispatch_namespaced_declarations} - -} // namespace ${dispatch_namespace} -} // namespace at diff --git a/aten/src/ATen/templates/RegisterDispatchDefinitions.ini b/aten/src/ATen/templates/RegisterDispatchDefinitions.ini new file mode 100644 index 0000000000000..3bf7f9b1bb321 --- /dev/null +++ b/aten/src/ATen/templates/RegisterDispatchDefinitions.ini @@ -0,0 +1,24 @@ +${ns_prologue} + +// NB: TORCH_LIBRARY_IMPL must be in an anonymous namespace to avoid +// ambiguity with conflicting identifiers that may have been defined in +// at namespace already. +namespace { + +${dispatch_helpers} + +${dispatch_anonymous_definitions} + +${static_init_dispatch_registrations} + +} // anonymous namespace + +${deferred_dispatch_registrations} + +namespace ${dispatch_namespace} { + +${dispatch_namespaced_definitions} + +} // namespace ${dispatch_namespace} + +${ns_epilogue} diff --git a/aten/src/ATen/templates/RegisterDispatchKey.cpp b/aten/src/ATen/templates/RegisterDispatchKey.cpp index df00c0d0e4a32..7a1584d505f5a 100644 --- a/aten/src/ATen/templates/RegisterDispatchKey.cpp +++ b/aten/src/ATen/templates/RegisterDispatchKey.cpp @@ -50,28 +50,5 @@ $dispatch_headers $ops_headers - -namespace at { - -// NB: TORCH_LIBRARY_IMPL must be in an anonymous namespace to avoid -// ambiguity with conflicting identifiers that may have been defined in -// at namespace already. -namespace { - -${dispatch_helpers} - -${dispatch_anonymous_definitions} - -${static_init_dispatch_registrations} - -} // anonymous namespace - -${deferred_dispatch_registrations} - -namespace ${dispatch_namespace} { - -${dispatch_namespaced_definitions} - -} // namespace ${dispatch_namespace} - -} // namespace at +// See template file RegisterDispatchDefinitions.ini +$dispatch_definitions diff --git a/aten/src/ATen/test/cpu_generator_test.cpp b/aten/src/ATen/test/cpu_generator_test.cpp index db392b6ead260..6cf3431c66c0e 100644 --- a/aten/src/ATen/test/cpu_generator_test.cpp +++ b/aten/src/ATen/test/cpu_generator_test.cpp @@ -144,8 +144,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineReproducibility) { // launch on same thread index and create two engines. // Given same seed, idx and offset, assert that the engines // should be aligned and have the same sequence. - at::Philox4_32_10 engine1(0, 0, 4); - at::Philox4_32_10 engine2(0, 0, 4); + at::Philox4_32 engine1(0, 0, 4); + at::Philox4_32 engine2(0, 0, 4); ASSERT_EQ(engine1(), engine2()); } @@ -156,11 +156,11 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset1) { // make another engine increment to until the // first 8 values. Assert that the first call // of engine2 and the 9th call of engine1 are equal. - at::Philox4_32_10 engine1(123, 1, 0); + at::Philox4_32 engine1(123, 1, 0); // Note: offset is a multiple of 4. // So if you want to skip 8 values, offset would // be 2, since 2*4=8. - at::Philox4_32_10 engine2(123, 1, 2); + at::Philox4_32 engine2(123, 1, 2); for (const auto i : c10::irange(8)) { (void)i; // Suppress unused variable warning // Note: instead of using the engine() call 8 times @@ -179,8 +179,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset2) { // make engine2 skip to the 2^64th 128 bit while being at 2^64th thread // Assert that engine2 should be increment_val+1 steps behind engine1. unsigned long long increment_val = std::numeric_limits::max(); - at::Philox4_32_10 engine1(123, 0, increment_val); - at::Philox4_32_10 engine2(123, increment_val, increment_val); + at::Philox4_32 engine1(123, 0, increment_val); + at::Philox4_32 engine2(123, increment_val, increment_val); engine2.incr_n(increment_val); engine2.incr(); @@ -195,8 +195,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset3) { // start engine2 at thread 1, with offset 0 // Assert that engine1 is 1 step behind engine2. unsigned long long increment_val = std::numeric_limits::max(); - at::Philox4_32_10 engine1(123, 0, increment_val); - at::Philox4_32_10 engine2(123, 1, 0); + at::Philox4_32 engine1(123, 0, increment_val); + at::Philox4_32 engine2(123, 1, 0); engine1.incr(); ASSERT_EQ(engine1(), engine2()); } @@ -206,8 +206,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineIndex) { // Tests if thread indexing is working properly. // create two engines with different thread index but same offset. // Assert that the engines have different sequences. - at::Philox4_32_10 engine1(123456, 0, 4); - at::Philox4_32_10 engine2(123456, 1, 4); + at::Philox4_32 engine1(123456, 0, 4); + at::Philox4_32 engine2(123456, 1, 4); ASSERT_NE(engine1(), engine2()); } @@ -247,3 +247,19 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) { } } + +TEST(CPUGeneratorImpl, TestPhiloxEngineReproducibilityRandN) { + at::Philox4_32 engine1(0, 0, 4); + at::Philox4_32 engine2(0, 0, 4); + ASSERT_EQ(engine1.randn(1), engine2.randn(1)); +} + +TEST(CPUGeneratorImpl, TestPhiloxDeterministic) { + at::Philox4_32 engine1(0, 0, 4); + ASSERT_EQ(engine1(), 4013802324); // Determinism! + ASSERT_EQ(engine1(), 2979262830); // Determinism! + + at::Philox4_32 engine2(10, 0, 1); + ASSERT_EQ(engine2(), 2007330488); // Determinism! + ASSERT_EQ(engine2(), 2354548925); // Determinism! +} diff --git a/aten/src/ATen/test/cuda_generator_test.cu b/aten/src/ATen/test/cuda_generator_test.cu index 1ea5c2ebb0077..f82db6de1d5b8 100644 --- a/aten/src/ATen/test/cuda_generator_test.cu +++ b/aten/src/ATen/test/cuda_generator_test.cu @@ -21,8 +21,8 @@ using namespace at; __global__ void testEngineReproducibility(){ int idx = blockIdx.x * blockDim.x + threadIdx.x; - at::Philox4_32_10 engine1(0, idx, 4); - at::Philox4_32_10 engine2(0, idx, 4); + at::Philox4_32 engine1(0, idx, 4); + at::Philox4_32 engine2(0, idx, 4); assert(engine1() == engine2()); } @@ -45,11 +45,11 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineReproducibility) { } __global__ void testEngineOffset1(){ - at::Philox4_32_10 engine1(123, 1, 0); + at::Philox4_32 engine1(123, 1, 0); // Note: offset is a multiple of 4. // So if you want to skip 8 values, offset would // be 2, since 2*4=8. - at::Philox4_32_10 engine2(123, 1, 2); + at::Philox4_32 engine2(123, 1, 2); for(int i = 0; i < 8; i++){ // Note: instead of using the engine() call 8 times // we could have achieved the same functionality by @@ -81,8 +81,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset1) { __global__ void testEngineOffset2(){ unsigned long long increment_val = ::ldexp(1.0, 64); - at::Philox4_32_10 engine1(123, 0, increment_val); - at::Philox4_32_10 engine2(123, increment_val, increment_val); + at::Philox4_32 engine1(123, 0, increment_val); + at::Philox4_32 engine2(123, increment_val, increment_val); engine2.incr_n(increment_val); engine2.incr(); @@ -110,8 +110,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset2) { __global__ void testEngineOffset3(){ unsigned long long increment_val = ::ldexp(1.0, 64); - at::Philox4_32_10 engine1(123, 0, increment_val); - at::Philox4_32_10 engine2(123, 1, 0); + at::Philox4_32 engine1(123, 0, increment_val); + at::Philox4_32 engine2(123, 1, 0); engine1.incr(); assert(engine1() == engine2()); } @@ -136,8 +136,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset3) { } __global__ void testEngineThreadIndex(){ - at::Philox4_32_10 engine1(123456, 0, 4); - at::Philox4_32_10 engine2(123456, 1, 4); + at::Philox4_32 engine1(123456, 0, 4); + at::Philox4_32 engine2(123456, 1, 4); assert(engine1() != engine2()); } diff --git a/aten/src/ATen/test/vulkan_api_test.cpp b/aten/src/ATen/test/vulkan_api_test.cpp index 7276261738593..70abc0aa59281 100644 --- a/aten/src/ATen/test/vulkan_api_test.cpp +++ b/aten/src/ATen/test/vulkan_api_test.cpp @@ -4,6 +4,7 @@ #include #include #include +#include #include // TODO: These functions should move to a common place. @@ -248,6 +249,31 @@ class VulkanAPITest : public ::testing::Test { } }; +TEST_F(VulkanAPITest, copy_to_texture) { + at::Tensor test_tensors[] = { + // 4D + at::rand({7, 17, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)), + // 3D + at::rand({67, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)), + // 2D + at::rand({229, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)), + // 1D + at::rand({1902}, at::TensorOptions(at::kCPU).dtype(at::kFloat)), + }; + + for (auto in_cpu : test_tensors) { + at::Tensor in_vk_copied = in_cpu.vulkan(); + at::Tensor out_copied = in_vk_copied.cpu(); + + const auto check_copy = almostEqual(out_copied, in_cpu); + + if(!check_copy) { + std::cout << "Copy failed on size " << in_cpu.sizes() + << "with dtype" << in_cpu.dtype() << std::endl; + } + } +} + TEST_F(VulkanAPITest, adaptive_avg_pool2d) { c10::InferenceMode mode; @@ -336,6 +362,43 @@ TEST_F(VulkanAPITest, add_broadcast2) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, add_broadcast3) { + + const auto a_cpu = at::rand({3, 4, 41, 53}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::add(a_cpu, b_cpu, 2.5f); + const auto c_vulkan = at::add(a_vulkan, b_vulkan, 2.5f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, add_broadcast4) { + const auto a_cpu = at::rand({3, 4, 41, 1}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::add(a_cpu, b_cpu, 2.5f); + const auto c_vulkan = at::add(a_vulkan, b_vulkan, 2.5f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, add_) { auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat)); auto a_vulkan = a_cpu.vulkan(); @@ -424,6 +487,69 @@ TEST_F(VulkanAPITest, add_scalar_) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, add_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto c_cpu = at::add(a_cpu, b_scalar, 2.1f); + const auto c_vulkan = at::add(a_vulkan, b_scalar, 2.1f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, add_scalar_wrapped_) { + if (!at::is_vulkan_available()) { + return; + } + + auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat)); + auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + a_cpu.add_(b_scalar, 2.1f); + a_vulkan.add_(b_scalar, 2.1f); + + const auto check = almostEqual(a_cpu, a_vulkan.cpu()); + if (!check) { + showRtol(a_cpu, a_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, add_to_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::add(a, b_cpu, 2.1f); + const auto c_vulkan = at::add(a, b_vulkan, 2.1f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, addmm) { constexpr float alpha = 2.1f; constexpr float beta = 103.24; @@ -1033,6 +1159,42 @@ TEST_F(VulkanAPITest, div_broadcast2) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, div_broadcast3) { + const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat))+0.01; + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat))+0.01; + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::div(a_cpu, b_cpu); + const auto c_vulkan = at::div(a_vulkan, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, div_broadcast4) { + const auto a_cpu = at::rand({3, 4, 41, 1}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::div(a_cpu, b_cpu); + const auto c_vulkan = at::div(a_vulkan, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, div_) { auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat))+0.01; auto a_vulkan = a_cpu.vulkan(); @@ -1122,6 +1284,69 @@ TEST_F(VulkanAPITest, div_scalar_) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, div_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a_cpu = at::rand({17, 213, 213, 7}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto c_cpu = at::div(a_cpu, b_scalar); + const auto c_vulkan = at::div(a_vulkan, b_scalar); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, div_scalar_wrapped_) { + if (!at::is_vulkan_available()) { + return; + } + + auto a_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat)); + auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + a_cpu.div_(b_scalar); + a_vulkan.div_(b_scalar); + + const auto check = almostEqual(a_cpu, a_vulkan.cpu()); + if (!check) { + showRtol(a_cpu, a_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, div_to_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto b_cpu = at::rand({2, 3, 5, 7}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::div(a, b_cpu); + const auto c_vulkan = at::div(a, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, empty) { ASSERT_NO_THROW(at::empty({1, 17, 41, 53}, at::device(at::kVulkan).dtype(at::kFloat))); @@ -1816,6 +2041,42 @@ TEST_F(VulkanAPITest, mul_broadcast2) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, mul_broadcast3) { + const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::mul(a_cpu, b_cpu); + const auto c_vulkan = at::mul(a_vulkan, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, mul_broadcast4) { + const auto a_cpu = at::rand({3, 4, 179, 1}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::mul(a_cpu, b_cpu); + const auto c_vulkan = at::mul(a_vulkan, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, mul_) { auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat)); auto a_vulkan = a_cpu.vulkan(); @@ -1904,6 +2165,69 @@ TEST_F(VulkanAPITest, mul_scalar_) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, mul_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a_cpu = at::rand({17, 213, 213, 7}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto c_cpu = at::mul(a_cpu, b_scalar); + const auto c_vulkan = at::mul(a_vulkan, b_scalar); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, mul_scalar_wrapped_) { + if (!at::is_vulkan_available()) { + return; + } + + auto a_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat)); + auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + a_cpu.mul_(b_scalar); + a_vulkan.mul_(b_scalar); + + const auto check = almostEqual(a_cpu, a_vulkan.cpu()); + if (!check) { + showRtol(a_cpu, a_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, mul_to_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::mul(a, b_cpu); + const auto c_vulkan = at::mul(a, b_vulkan); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, reflection_pad2d) { const auto a_cpu = at::rand({2, 3, 47, 63}, at::device(at::kCPU).dtype(at::kFloat)); const auto a_vulkan = a_cpu.vulkan(); @@ -2182,6 +2506,42 @@ TEST_F(VulkanAPITest, sub_broadcast2) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, sub_broadcast3) { + const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::sub(a_cpu, b_cpu, 2.5f); + const auto c_vulkan = at::sub(a_vulkan, b_vulkan, 2.5f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, sub_broadcast4) { + const auto a_cpu = at::rand({3, 4, 179, 1}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_cpu = at::rand({1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::sub(a_cpu, b_cpu, 2.5f); + const auto c_vulkan = at::sub(a_vulkan, b_vulkan, 2.5f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, sub_) { auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat)); auto a_vulkan = a_cpu.vulkan(); @@ -2236,6 +2596,111 @@ TEST_F(VulkanAPITest, sub_broadcast1_) { ASSERT_TRUE(check); } +TEST_F(VulkanAPITest, sub_scalar) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const float b_scalar = 3.1415f; + + const auto c_cpu = at::sub(a_cpu, b_scalar, 2.1f); + const auto c_vulkan = at::sub(a_vulkan, b_scalar, 2.1f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, sub_scalar_) { + if (!at::is_vulkan_available()) { + return; + } + + auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat)); + auto a_vulkan = a_cpu.vulkan(); + + const float b_scalar = 3.1415f; + + a_cpu.sub_(b_scalar, 2.1f); + a_vulkan.sub_(b_scalar, 2.1f); + + const auto check = almostEqual(a_cpu, a_vulkan.cpu()); + if (!check) { + showRtol(a_cpu, a_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, sub_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat)); + const auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto c_cpu = at::sub(a_cpu, b_scalar, 2.1f); + const auto c_vulkan = at::sub(a_vulkan, b_scalar, 2.1f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, sub_scalar_wrapped_) { + if (!at::is_vulkan_available()) { + return; + } + + auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat)); + auto a_vulkan = a_cpu.vulkan(); + + const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + a_cpu.sub_(b_scalar, 2.1f); + a_vulkan.sub_(b_scalar, 2.1f); + + const auto check = almostEqual(a_cpu, a_vulkan.cpu()); + if (!check) { + showRtol(a_cpu, a_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + +TEST_F(VulkanAPITest, sub_to_scalar_wrapped) { + if (!at::is_vulkan_available()) { + return; + } + + const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat)); + const auto b_vulkan = b_cpu.vulkan(); + + const auto c_cpu = at::sub(a, b_cpu, 2.1f); + const auto c_vulkan = at::sub(a, b_vulkan, 2.1f); + + const auto check = almostEqual(c_cpu, c_vulkan.cpu()); + if (!check) { + showRtol(c_cpu, c_vulkan.cpu()); + } + + ASSERT_TRUE(check); +} + TEST_F(VulkanAPITest, transposed_conv2d) { // Arrange constexpr int64_t groups = 1; @@ -3368,13 +3833,15 @@ TEST_F(VulkanAPITest, gru_success) { const int H_in = 5; // input_size const int H_out = 7; // hidden_size const int num_layers = 3; + const int L = 1; + const int N = 1; const double gru_dropout = .0; const bool has_biases = true; const bool train = false; const bool bidirectional = false; const bool batch_first = true; - const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); - const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat)); + const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat)); + const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat)); c10::List weight_ih_l; // shape (3 * hidden_size, input_size) c10::List weight_hh_l; // shape (3 * hidden_size, hidden_size) @@ -3435,13 +3902,15 @@ TEST_F(VulkanAPITest, gru_mclareninputs_success) { const int H_in = 384; // input_size const int H_out = 384; // hidden_size const int num_layers = 2; + const int L = 1; + const int N = 1; const double gru_dropout = .0; const bool has_biases = true; const bool train = false; const bool bidirectional = false; const bool batch_first = true; - const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); - const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat)); + const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat)); + const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat)); c10::List weight_ih_l; // shape (3 * hidden_size, input_size) c10::List weight_hh_l; // shape (3 * hidden_size, hidden_size) @@ -3498,13 +3967,15 @@ TEST_F(VulkanAPITest, gru_invalidinputs_exceptions) { const int H_in = 17; // input_size const int H_out = 50; // hidden_size const int num_layers = 2; + const int L = 5; + const int N = 4; const double gru_dropout = .0; const bool has_biases = true; const bool train = false; const bool bidirectional = false; const bool batch_first = true; - const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); - const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat)); + const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat)); + const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat)); c10::List weight_ih_l; // shape (3 * hidden_size, input_size) c10::List weight_hh_l; // shape (3 * hidden_size, hidden_size) @@ -3591,13 +4062,15 @@ TEST_F(VulkanAPITest, gru_prepack_success) { const int H_in = 81; // input_size const int H_out = 10; // hidden_size const int num_layers = 2; + const int L = 1; + const int N = 1; const double gru_dropout = .0; const bool has_biases = true; const bool train = false; const bool bidirectional = false; const bool batch_first = true; - const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); - const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat)); + const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat)); + const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat)); c10::List weight_ih_l; // shape (3 * hidden_size, input_size) c10::List weight_hh_l; // shape (3 * hidden_size, hidden_size) @@ -3626,13 +4099,13 @@ TEST_F(VulkanAPITest, gru_prepack_success) { has_biases, num_layers, gru_dropout, train, bidirectional, batch_first); auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), has_biases, num_layers, gru_dropout, train, bidirectional, batch_first); auto out_vulkan = callOpByName( - "vulkan_prepack::gru_run", + "vulkan_prepack::run_gru_context", "", in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]); @@ -3660,13 +4133,15 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { const int H_in = 70; // input_size const int H_out = 2; // hidden_size const int num_layers = 2; + const int L = 3; + const int N = 5; const double gru_dropout = .0; const bool has_biases = true; const bool train = false; const bool bidirectional = false; const bool batch_first = true; - const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); - const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat)); + const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat)); + const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat)); c10::List weight_ih_l; // shape (3 * hidden_size, input_size) c10::List weight_hh_l; // shape (3 * hidden_size, hidden_size) @@ -3692,7 +4167,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { // Act: incorrect # of weights/biases EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1) }), @@ -3703,13 +4178,13 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { EXPECT_THROW({ const auto in_cpu_2d = at::rand({1, H_in}, at::device(at::kCPU).dtype(at::kFloat)); auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), has_biases, num_layers, gru_dropout, train, bidirectional, batch_first); auto out_vulkan = callOpByName( - "vulkan_prepack::gru_run", + "vulkan_prepack::run_gru_context", "", in_cpu_2d.vulkan(), h0_cpu.vulkan(), prepack[0]); }, ::c10::Error); @@ -3718,13 +4193,13 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { EXPECT_THROW({ const auto h0_cpu_2d = at::rand({num_layers, H_out}, at::device(at::kCPU).dtype(at::kFloat)); auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), has_biases, num_layers, gru_dropout, train, bidirectional, batch_first); auto out_vulkan = callOpByName( - "vulkan_prepack::gru_run", + "vulkan_prepack::run_gru_context", "", in_cpu.vulkan(), h0_cpu_2d.vulkan(), prepack[0]); }, ::c10::Error); @@ -3732,7 +4207,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { // Act: has_biases should be true EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), @@ -3742,7 +4217,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { // Act: train should be false EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), @@ -3752,7 +4227,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { // Act: bidirectional should be false EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), @@ -3762,17 +4237,21 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) { // Act: batch_first should be true EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), has_biases, num_layers, gru_dropout, train, bidirectional, false); + auto out_vulkan = callOpByName( + "vulkan_prepack::run_gru_context", + "", + in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]); }, ::c10::Error); // Act: dropout should be 0.0 EXPECT_THROW({ auto prepack = callOpByName( - "vulkan_prepack::gru_prepack", + "vulkan_prepack::create_gru_context", "", std::vector({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0), weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }), diff --git a/aten/src/ATen/test/vulkan_quantized_api_test.cpp b/aten/src/ATen/test/vulkan_quantized_api_test.cpp index 9519b079d35e8..3b86b472fffdf 100644 --- a/aten/src/ATen/test/vulkan_quantized_api_test.cpp +++ b/aten/src/ATen/test/vulkan_quantized_api_test.cpp @@ -34,9 +34,11 @@ bool almostEqual(const at::Tensor& a, const at::Tensor& b) { return checkRtol(a - b, {a, b}); } +/* Unused function bool exactlyEqual(const at::Tensor& a, const at::Tensor& b) { return (a - b).abs().max().item() == 0.0f; } +*/ void showRtol(const at::Tensor& a, const at::Tensor& b) { const auto diff = (a - b).abs(); diff --git a/aten/src/ATen/test/xnnpack_test.cpp b/aten/src/ATen/test/xnnpack_test.cpp index 27e87545aa280..a936273bbcecc 100644 --- a/aten/src/ATen/test/xnnpack_test.cpp +++ b/aten/src/ATen/test/xnnpack_test.cpp @@ -3,11 +3,11 @@ #include #include -#include -#include -#include #include +#include #include +#include +#include #if defined(C10_MOBILE) && defined(USE_XNNPACK) @@ -31,7 +31,8 @@ void test_hardswish(const at::Tensor& input, const at::Tensor& expected) { auto result = at::native::xnnpack::hardswish(input); auto check = almostEqual(expected, result); ASSERT_TRUE(check); - ASSERT_TRUE(expected.suggest_memory_format() == input.suggest_memory_format()); + ASSERT_TRUE( + expected.suggest_memory_format() == input.suggest_memory_format()); } void test_hardswish_(at::Tensor input, const at::Tensor& expected) { @@ -39,7 +40,8 @@ void test_hardswish_(at::Tensor input, const at::Tensor& expected) { at::native::xnnpack::hardswish_(input); auto check = almostEqual(expected, input); ASSERT_TRUE(check); - ASSERT_TRUE(expected.suggest_memory_format() == input.suggest_memory_format()); + ASSERT_TRUE( + expected.suggest_memory_format() == input.suggest_memory_format()); } void test_global_average_pool(at::Tensor input, const at::Tensor& expected) { @@ -49,58 +51,133 @@ void test_global_average_pool(at::Tensor input, const at::Tensor& expected) { ASSERT_TRUE(check); } -// Since XNNPACK path is only taken #if defined(C10_MOBILE) && defined(USE_XNNPACK) -// We can't compare regular CPU path with XNNPACK path in the same test binary -// Instead we precompute regular results and compare with XNNPACK path here +// Since XNNPACK path is only taken #if defined(C10_MOBILE) && +// defined(USE_XNNPACK) We can't compare regular CPU path with XNNPACK path in +// the same test binary Instead we precompute regular results and compare with +// XNNPACK path here +TEST(TestXNNPackOps, TestLinear) { + constexpr std::array input_shape{1, 37}; + constexpr std::array weight_shape{41, 37}; + constexpr std::array bias_shape{1, 41}; + const auto input_cpu = + at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat)); + const auto weight = + at::rand(weight_shape, at::device(at::kCPU).dtype(at::kFloat)); + const auto bias = + at::rand(bias_shape, at::device(at::kCPU).dtype(at::kFloat)); + + const auto out_cpu = at::linear(input_cpu, weight, bias); + + const auto xnnpack_bias = bias.view({41}); + ASSERT_TRUE(at::native::xnnpack::use_linear(input_cpu, weight, xnnpack_bias)); + const auto result = + at::native::xnnpack::linear(input_cpu, weight, xnnpack_bias); + + auto check = almostEqual(out_cpu, result); + ASSERT_TRUE(check); +} + +TEST(TestXNNPackOps, TestMaxPool2d) { + const auto in_cpu = + at::rand({5, 13, 55, 68}, at::TensorOptions(at::kCPU).dtype(at::kFloat)); + const auto out_cpu = + at::max_pool2d(in_cpu, {3, 4}, {2, 1}, {1, 1}, {1, 1}, false); + ASSERT_TRUE(at::native::xnnpack::use_max_pool2d( + in_cpu, {3, 4}, {1, 1}, {2, 1}, {1, 1}, false)); + const auto result = at::native::xnnpack::max_pool2d( + in_cpu, {3, 4}, {1, 1}, {2, 1}, {1, 1}, false); + + auto check = almostEqual(out_cpu, result); + ASSERT_TRUE(check); +} + +TEST(TestXNNPackOps, TestConvolution2d) { + constexpr int64_t groups = 1; + constexpr std::array stride{2, 2}; + constexpr std::array padding{1, 1}; + constexpr std::array dilation{1, 1}; + + constexpr struct { + uint32_t batches; + uint32_t channels; + uint32_t width; + uint32_t height; + + std::array size() const { + return { + batches, + channels, + width, + height, + }; + } + } input{1, 3, 8, 8}; + + constexpr struct { + uint32_t output_channels; + uint32_t input_channels; + uint32_t width; + uint32_t height; + + std::array size() const { + return { + output_channels, + input_channels, + width, + height, + }; + } + } weights{1, input.channels, 3, 3}; + + const auto input_cpu = + at::randn(input.size(), at::device(at::kCPU).dtype(at::kFloat)); + const auto weights_cpu = + at::randn(weights.size(), at::device(at::kCPU).dtype(at::kFloat)); + const auto bias_cpu = at::randn( + {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat)); + + const auto output_cpu = at::conv2d( + input_cpu, weights_cpu, bias_cpu, stride, padding, dilation, groups); + + ASSERT_TRUE(at::native::xnnpack::use_convolution2d( + input_cpu, + weights_cpu, + weights.output_channels, + padding, + stride, + dilation, + groups, + false)); + const auto result = at::native::xnnpack::convolution2d( + input_cpu, weights_cpu, bias_cpu, padding, stride, dilation, groups); + auto check = almostEqual(output_cpu, result); + ASSERT_TRUE(check); +} + TEST(TestXNNPackOps, TestHardSwish) { // input, expected_result pair auto in = torch::tensor({{1, 1}, {1, 1}}, {torch::kFloat32}); auto in_slice = in.index({"...", 0}); std::vector> input_result_pairs = { - { - torch::tensor({1, 2, 3, 4, 5}, {torch::kFloat32}), - torch::tensor({0.6667, 1.6667, 3.0000, 4.0000, 5.0000}, {torch::kFloat32}) - }, - { - torch::tensor({0.3330}, {torch::kFloat32}), - torch::tensor({0.1850}, {torch::kFloat32}) - }, - { - torch::tensor({ - {0.4523, 0.8131, 0.9829}, - {0.0782, 0.7395, 0.0787} - }), - torch::tensor({ - {0.2602, 0.5167, 0.6525}, - {0.0401, 0.4609, 0.0404} - }) - }, - { - in_slice, - torch::tensor({0.6667, 0.6667}, {torch::kFloat32}) - }, - { - torch::tensor( - {{{{0.4993, 0.3835}, - {0.3163, 0.2348}}, - {{0.4705, 0.4129}, - {0.9314, 0.0631}}}, - {{{0.0030, 0.5656}, - {0.1413, 0.1943}}, - {{0.1380, 0.1985}, - {0.2746, 0.8109}}}}).contiguous(at::MemoryFormat::ChannelsLast), - torch::tensor( - {{{{0.2912, 0.2163}, - {0.1748, 0.1266}}, - {{0.2722, 0.2349}, - {0.6103, 0.0322}}}, - {{{0.0015, 0.3361}, - {0.0740, 0.1034}}, - {{0.0722, 0.1058}, - {0.1499, 0.5150}}}}).contiguous(at::MemoryFormat::ChannelsLast) - } - }; + {torch::tensor({1, 2, 3, 4, 5}, {torch::kFloat32}), + torch::tensor( + {0.6667, 1.6667, 3.0000, 4.0000, 5.0000}, {torch::kFloat32})}, + {torch::tensor({0.3330}, {torch::kFloat32}), + torch::tensor({0.1850}, {torch::kFloat32})}, + {torch::tensor({{0.4523, 0.8131, 0.9829}, {0.0782, 0.7395, 0.0787}}), + torch::tensor({{0.2602, 0.5167, 0.6525}, {0.0401, 0.4609, 0.0404}})}, + {in_slice, torch::tensor({0.6667, 0.6667}, {torch::kFloat32})}, + {torch::tensor({{{{0.4993, 0.3835}, {0.3163, 0.2348}}, + {{0.4705, 0.4129}, {0.9314, 0.0631}}}, + {{{0.0030, 0.5656}, {0.1413, 0.1943}}, + {{0.1380, 0.1985}, {0.2746, 0.8109}}}}) + .contiguous(at::MemoryFormat::ChannelsLast), + torch::tensor({{{{0.2912, 0.2163}, {0.1748, 0.1266}}, + {{0.2722, 0.2349}, {0.6103, 0.0322}}}, + {{{0.0015, 0.3361}, {0.0740, 0.1034}}, + {{0.0722, 0.1058}, {0.1499, 0.5150}}}}) + .contiguous(at::MemoryFormat::ChannelsLast)}}; for (const auto& input_result : input_result_pairs) { test_hardswish(input_result.first, input_result.second); @@ -111,42 +188,24 @@ TEST(TestXNNPackOps, TestHardSwish) { TEST(TestXNNPackOps, TestGlobal) { // input, expected_result pair std::vector> input_result_pairs = { - { - torch::tensor({{ - {{0.0852, 0.7312, 0.9943, 0.7105}, - {0.0956, 0.9072, 0.3124, 0.9362}, - {0.5878, 0.8883, 0.5086, 0.9494}}, - {{0.1056, 0.4968, 0.7740, 0.7593}, - {0.8519, 0.3543, 0.8078, 0.5517}, - {0.1413, 0.4608, 0.1706, 0.0314}} - }}, {torch::kFloat32}), - torch::tensor({{ - {{0.6422}}, - {{0.4588}} - }}, {torch::kFloat32}) - }, - { - torch::tensor({{ - {{0.0280, 0.9073}, - {0.2103, 0.5298}}, - {{0.5335, 0.9901}, - {0.2902, 0.2955}} - }, - { - {{0.2363, 0.7024}, - {0.7903, 0.8260}}, - {{0.3802, 0.5959}, - {0.5749, 0.8855}} - }}, {torch::kFloat32}), - torch::tensor( - {{{{0.4188}}, - {{0.5273}}}, - {{{0.6388}}, - {{0.6091}}}}, - {torch::kFloat32} - ) - } - }; + {torch::tensor( + {{{{0.0852, 0.7312, 0.9943, 0.7105}, + {0.0956, 0.9072, 0.3124, 0.9362}, + {0.5878, 0.8883, 0.5086, 0.9494}}, + {{0.1056, 0.4968, 0.7740, 0.7593}, + {0.8519, 0.3543, 0.8078, 0.5517}, + {0.1413, 0.4608, 0.1706, 0.0314}}}}, + {torch::kFloat32}), + torch::tensor({{{{0.6422}}, {{0.4588}}}}, {torch::kFloat32})}, + {torch::tensor( + {{{{0.0280, 0.9073}, {0.2103, 0.5298}}, + {{0.5335, 0.9901}, {0.2902, 0.2955}}}, + {{{0.2363, 0.7024}, {0.7903, 0.8260}}, + {{0.3802, 0.5959}, {0.5749, 0.8855}}}}, + {torch::kFloat32}), + torch::tensor( + {{{{0.4188}}, {{0.5273}}}, {{{0.6388}}, {{0.6091}}}}, + {torch::kFloat32})}}; for (const auto& input_result : input_result_pairs) { test_global_average_pool(input_result.first, input_result.second); diff --git a/benchmarks/cpp/nvfuser/CMakeLists.txt b/benchmarks/cpp/nvfuser/CMakeLists.txt index 5ada0fc30d4ed..ad9053bb3a3aa 100644 --- a/benchmarks/cpp/nvfuser/CMakeLists.txt +++ b/benchmarks/cpp/nvfuser/CMakeLists.txt @@ -20,13 +20,16 @@ if(USE_CUDA) softmax_backward.cpp scale_bias_relu.cpp transpose.cpp + matmul.cpp timm.cpp utils.cpp main.cpp) target_link_libraries(nvfuser_bench PRIVATE torch_library benchmark) if(NOT MSVC) - target_compile_options(nvfuser_bench PRIVATE -Wno-unused-variable -Wno-deprecated-copy -Werror) + target_compile_options_if_supported(nvfuser_bench -Werror) + target_compile_options_if_supported(nvfuser_bench -Wno-unused-variable) + target_compile_options_if_supported(nvfuser_bench -Wno-deprecated-copy) endif() endif() diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp index 723d222516df4..2f839f0c8332a 100644 --- a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp +++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp @@ -73,10 +73,6 @@ static void NvFuserScheduler_BatchNorm( DataType dtype) { TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half); - const bool kTraining = true; - const float kMomentum = 0.1; - const float kEps = 1e-5; - std::vector input_shape{ benchmark_state.range(0), benchmark_state.range(1), diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp index af2b4d145fc8f..62a4e99e21ef6 100644 --- a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp +++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp @@ -25,7 +25,6 @@ static void setupBatchNorm_BWD(Fusion* fusion, DataType dtype) { FusionGuard fg(fusion); const bool kTraining = true; - const float kMomentum = 0.1; const float kEps = 1e-5; // setup fusion @@ -85,9 +84,6 @@ static void NvFuserScheduler_BatchNorm_BWD( DataType dtype) { TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half); - const bool kTraining = true; - const float kEps = 1e-5; - std::vector input_shape{ benchmark_state.range(0), benchmark_state.range(1), diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp index 14fde631aec0b..7b8972a0aad07 100644 --- a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp +++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp @@ -74,10 +74,6 @@ static void NvFuserScheduler_BatchNorm_nhwc( DataType dtype) { TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half); - const bool kTraining = true; - const float kMomentum = 0.1; - const float kEps = 1e-5; - std::vector input_shape{ benchmark_state.range(0), benchmark_state.range(2), diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp index 0660b75e39426..29bcfb3e81be7 100644 --- a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp +++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp @@ -25,7 +25,6 @@ static void setupBatchNorm_nhwc_BWD(Fusion* fusion, DataType dtype) { FusionGuard fg(fusion); const bool kTraining = true; - const float kMomentum = 0.1; const float kEps = 1e-5; // setup fusion @@ -86,9 +85,6 @@ static void NvFuserScheduler_BatchNorm_nhwc_BWD( DataType dtype) { TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half); - const bool kTraining = true; - const float kEps = 1e-5; - std::vector input_shape{ benchmark_state.range(0), benchmark_state.range(2), diff --git a/benchmarks/cpp/nvfuser/bert.cpp b/benchmarks/cpp/nvfuser/bert.cpp index 06bcece52c8f5..05f0f490abb2e 100644 --- a/benchmarks/cpp/nvfuser/bert.cpp +++ b/benchmarks/cpp/nvfuser/bert.cpp @@ -140,7 +140,7 @@ static void MagicScheduler_DivMaxSoftDropFwd( fe.compileFusion(&fusion); fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; cg_outputs = fe.runFusion({t0, t1}, norm_params->lparams); @@ -148,7 +148,7 @@ static void MagicScheduler_DivMaxSoftDropFwd( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; for (auto tensor : std::vector({t0, t1})) { @@ -200,7 +200,7 @@ static void MagicScheduler_DivMaxSoftDropBwd( fe.compileFusion(&fusion); fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; cg_outputs = fe.runFusion({t0, t1, t2, t3}, norm_params->lparams); @@ -208,7 +208,7 @@ static void MagicScheduler_DivMaxSoftDropBwd( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; // Some reason t1 isn't used, ignore it. @@ -316,7 +316,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd( fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; cg_outputs = fe.runFusion(at_inputs, norm_params->lparams); @@ -324,7 +324,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; for (auto inp : at_inputs) { @@ -426,7 +426,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1( fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); cg_outputs = fe.runFusion(at_inputs, norm_params->lparams); @@ -434,7 +434,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; for (auto inp : at_inputs) { @@ -537,7 +537,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2( fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; cg_outputs = fe.runFusion(at_inputs, norm_params->lparams); @@ -545,7 +545,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; for (auto inp : at_inputs) { @@ -628,7 +628,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3( fe.setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; cg_outputs = fe.runFusion(at_inputs, norm_params->lparams); @@ -636,7 +636,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); int64_t bytes = 0; for (auto inp : at_inputs) { diff --git a/benchmarks/cpp/nvfuser/broadcast.cpp b/benchmarks/cpp/nvfuser/broadcast.cpp index 05e8e052f4b26..04b6b18bd6b74 100644 --- a/benchmarks/cpp/nvfuser/broadcast.cpp +++ b/benchmarks/cpp/nvfuser/broadcast.cpp @@ -77,7 +77,7 @@ static void NvFuserScheduler_Broadcast( fusion_executor_cache->profile(false); executor_instance->setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); auto cg_outputs = fusion_executor_cache->runFusionWithInputs({t0, t1}); @@ -86,7 +86,7 @@ static void NvFuserScheduler_Broadcast( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * @@ -112,14 +112,14 @@ static void Baseline_Broadcast( // Sync everything up before we start clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; auto output = t0.add(t1.unsqueeze(bcast_dim)); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } benchmark_state.SetBytesProcessed( diff --git a/benchmarks/cpp/nvfuser/gelu_backward.cpp b/benchmarks/cpp/nvfuser/gelu_backward.cpp index 6632ba58a2365..732ad7f0ea0fd 100644 --- a/benchmarks/cpp/nvfuser/gelu_backward.cpp +++ b/benchmarks/cpp/nvfuser/gelu_backward.cpp @@ -113,9 +113,6 @@ BENCHMARK(GeluBackward_AutoSchedule)->Unit(benchmark::kMicrosecond); //------------------------------------------------------------------------------ static void GeluBackward_Lower(benchmark::State& benchmark_state) { - constexpr int kHiddenFeatures = 512; - constexpr int kBatchSize = 64; - Fusion fusion; // setup fusion @@ -173,11 +170,11 @@ static void GeluBackward_RunFusion(benchmark::State& benchmark_state) { FusionExecutor executor; executor.compileFusion(&fusion); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { outputs = executor.runFusion(c10::ArrayRef(inputs), lparams); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); } } @@ -204,7 +201,7 @@ static void GeluBackward_RunFusion_GpuOnly(benchmark::State& benchmark_state) { executor.setMeasureKernelTimeFlag(true); executor.compileFusion(&fusion); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { outputs = executor.runFusion(c10::ArrayRef(inputs), lparams); diff --git a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp index 64b1ecfb756d4..3bd4ec0b1607d 100644 --- a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp +++ b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp @@ -99,12 +99,15 @@ static void LayerNormBackward_HeuristicLookup( auto runtime = getLayerBackwardNormRuntime( std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape); + + KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs); + TORCH_INTERNAL_ASSERT( - runtime->getMaybeHeuristicsFor(aten_inputs).has_value()); + runtime->getMaybeHeuristicsFor(args).has_value()); for (auto _ : benchmark_state) { // Setup (not included in the measurement) - runtime->getMaybeHeuristicsFor(aten_inputs); + runtime->getMaybeHeuristicsFor(args); } } @@ -152,12 +155,15 @@ static void LayerNormForward_HeuristicLookup( auto runtime = getLayerForwardNormRuntime( std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape); + + KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs); + TORCH_INTERNAL_ASSERT( - runtime->getMaybeHeuristicsFor(aten_inputs).has_value()); + runtime->getMaybeHeuristicsFor(args).has_value()); for (auto _ : benchmark_state) { // Setup (not included in the measurement) - runtime->getMaybeHeuristicsFor(aten_inputs); + runtime->getMaybeHeuristicsFor(args); } } diff --git a/benchmarks/cpp/nvfuser/instance_norm.cpp b/benchmarks/cpp/nvfuser/instance_norm.cpp index a7139c113a43b..05475f1144743 100644 --- a/benchmarks/cpp/nvfuser/instance_norm.cpp +++ b/benchmarks/cpp/nvfuser/instance_norm.cpp @@ -165,7 +165,7 @@ static void Baseline_InstanceNorm( auto ato_running_var = c10::optional(at_var); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; @@ -182,9 +182,9 @@ static void Baseline_InstanceNorm( auto output = at::relu(norm); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } const size_t kChannels = benchmark_state.range(2); diff --git a/benchmarks/cpp/nvfuser/layer_norm.cpp b/benchmarks/cpp/nvfuser/layer_norm.cpp index d793a45caa3c0..d2cff09e5d2ed 100644 --- a/benchmarks/cpp/nvfuser/layer_norm.cpp +++ b/benchmarks/cpp/nvfuser/layer_norm.cpp @@ -22,7 +22,6 @@ static void setupLayerNorm(Fusion* fusion, DataType dtype) { FusionGuard fg(fusion); - const int kReductionAxis = 1; const float kEps = 1e-5; Double* eps_ptr = IrBuilder::create(kEps); @@ -61,7 +60,6 @@ static void NvFuserScheduler_LayerNorm( std::vector input_shape{ benchmark_state.range(0), benchmark_state.range(1)}; - const float kEps = 1e-5; // inputs at::manual_seed(0); @@ -105,14 +103,14 @@ static void Baseline_LayerNorm( at::Tensor bias = at::randn({input_shape[1]}, options); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; auto output = at::layer_norm(input, norm_shape, weight, bias); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } benchmark_state.SetBytesProcessed( diff --git a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp index 9e6ac1c207d1d..c431622e7b9f4 100644 --- a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp +++ b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp @@ -22,9 +22,6 @@ static void setupLayerNorm_BWD(Fusion* fusion, DataType dtype) { TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half); - const int kReductionAxis = 1; - Double* eps_ptr = IrBuilder::create(1e-5); - // setup fusion auto grad_out = makeContigTensor(2, dtype); auto input = makeContigTensor(2, dtype); @@ -136,7 +133,7 @@ static void Baseline_LayerNorm_BWD( std::array output_mask = {true, true, true}; clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; at::native_layer_norm_backward( @@ -144,9 +141,9 @@ static void Baseline_LayerNorm_BWD( auto output = at::layer_norm(input, norm_shape, weight, bias); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } benchmark_state.SetBytesProcessed( diff --git a/benchmarks/cpp/nvfuser/lstm_cell.cpp b/benchmarks/cpp/nvfuser/lstm_cell.cpp index 20ec7c8f47003..58fc057bd85fb 100644 --- a/benchmarks/cpp/nvfuser/lstm_cell.cpp +++ b/benchmarks/cpp/nvfuser/lstm_cell.cpp @@ -170,11 +170,11 @@ static void LstmCell_RunFusion( FusionExecutor executor; executor.compileFusion(&fusion); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { outputs = executor.runFusion(c10::ArrayRef(inputs), lparams); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } } diff --git a/benchmarks/cpp/nvfuser/matmul.cpp b/benchmarks/cpp/nvfuser/matmul.cpp new file mode 100644 index 0000000000000..25fc6cfe23569 --- /dev/null +++ b/benchmarks/cpp/nvfuser/matmul.cpp @@ -0,0 +1,357 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include + +using namespace torch::jit::fuser::cuda; + +bool cudaArchGuardShouldSkip(int required_major, int required_minor) { + int capability_major = at::cuda::getCurrentDeviceProperties()->major; + int capability_minor = at::cuda::getCurrentDeviceProperties()->minor; + + if (capability_major < required_major || + (capability_major == required_major && + capability_minor < required_minor)) { + return true; + } + return false; +} + +bool hasRequiredSmemSize(size_t required_size) { + // Only checking device 0 + return at::cuda::getDeviceProperties(0)->sharedMemPerBlockOptin >= + required_size; +} + +#define NVFUSER_BENCHMARK_ARCH_SMEM_GUARD( \ + REQUIRED_MAJOR, REQUIRED_MINOR, SMEM_SIZE, STATE) \ + if (cudaArchGuardShouldSkip(REQUIRED_MAJOR, REQUIRED_MINOR) || \ + !hasRequiredSmemSize(SMEM_SIZE)) { \ + STATE.SkipWithError("Unsupported arch or not enough smem!"); \ + return; \ + } + +// util to track support matmul operand layout. +using MatmulLayout = MmaOptions::MmaInputLayout; + +static constexpr std::array kAllSupportedLayout = { + MatmulLayout::TT, + MatmulLayout::NT, + MatmulLayout::TN}; + +// Generic interface to get matmul op with the given layout. +TensorView* matmul(TensorView* a, TensorView* b, MatmulLayout layout) { + TORCH_CHECK( + a->nDims() == 2 && b->nDims() == 2, "only pure matmuls for these tests"); + TensorView *tv2 = nullptr, *tv0b = nullptr, *tv1b = nullptr; + switch (layout) { + case MatmulLayout::TT: + tv0b = broadcast(a, {false, false, true}); + tv1b = broadcast(b, {true, false, false}); + tv2 = fusedMultiplySum(tv0b, tv1b, {1}); + break; + case MatmulLayout::TN: + tv0b = broadcast(a, {false, true, false}); + tv1b = broadcast(b, {true, false, false}); + tv2 = fusedMultiplySum(tv0b, tv1b, {2}); + break; + case MatmulLayout::NT: + tv0b = broadcast(a, {false, false, true}); + tv1b = broadcast(b, {false, true, false}); + tv2 = fusedMultiplySum(tv0b, tv1b, {0}); + break; + default: + TORCH_CHECK(false, "unsupported data layout."); + } + return tv2; +} + +// Utility to generate matmul input tensors based on given layout +at::Tensor atMatmul(at::Tensor a, at::Tensor b, MatmulLayout layout) { + switch (layout) { + case MatmulLayout::TT: + return a.matmul(b); + case MatmulLayout::TN: + return a.matmul(b.t()); + case MatmulLayout::NT: + return a.t().matmul(b); + default: + TORCH_CHECK(false, "unsupported data layout."); + } + return at::Tensor(); +} + +// Utility to generate reference results based on given layout +std::pair fp16MatmulAtInput( + int M, + int N, + int K, + MatmulLayout layout) { + auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0); + + switch (layout) { + case MatmulLayout::TT: + return std::make_pair( + at::randn({M, K}, options), at::randn({K, N}, options)); + case MatmulLayout::TN: + return std::make_pair( + at::randn({M, K}, options), at::randn({N, K}, options)); + case MatmulLayout::NT: + return std::make_pair( + at::randn({K, M}, options), at::randn({K, N}, options)); + default: + TORCH_CHECK(false, "unsupported data layout."); + } + return std::make_pair(at::Tensor(), at::Tensor()); +} + +// TODO: separate compute and schedule definition once the can schedule +// logic and pattern matching is ready. +void setupMatmul(Fusion* fusion, MatmulLayout layout, MatmulParam params) { + // Only hgemm on the initial setup + auto a = makeContigTensor(2, DataType::Half); + auto b = makeContigTensor(2, DataType::Half); + + auto c = matmul(a, b, layout); + + fusion->addInput(a); + fusion->addInput(b); + fusion->addOutput(c); + + scheduleMatmul(c, a, b, params); +} + +static void SingleMatmulBase( + benchmark::State& benchmark_state, + MatmulLayout layout, + MatmulParam params) { + std::vector input_mnk{ + benchmark_state.range(0), + benchmark_state.range(1), + benchmark_state.range(2)}; + + auto fusion_ptr = std::make_unique(); + auto fusion = fusion_ptr.get(); + FusionGuard fg(fusion); + + // Define fusion graph + setupMatmul(fusion, layout, params); + + // inputs + at::manual_seed(0); + + // Tensor inputs + auto inputs = fp16MatmulAtInput( + input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout); + + KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder( + {inputs.first, inputs.second}); + + // Always use 32b indexing mode for now. + TORCH_INTERNAL_ASSERT(args.getIndexMode() == KernelIndexMode::INT32); + + // Compile kernel + FusionExecutor fe; + fe.compileFusion(fusion, args, LaunchParams()); + + // Warm up run + auto outputs = fe.runFusion({inputs.first, inputs.second}); + fe.setMeasureKernelTimeFlag(true); + + // Sync everything up before we start + for (auto _ : benchmark_state) { + clearL2Cache(); + auto outputs = fe.runFusion({inputs.first, inputs.second}); + benchmark_state.SetIterationTime(fe.kernelTimeMs() / 1000.0); + } + // Sync everything up before we're finished, don't want to run ahead on the + // cpu while benchmarking. + cudaDeviceSynchronize(); + + // TODO: FLOPS calculation +} + +static void EagerModeMatmul( + benchmark::State& benchmark_state, + MatmulLayout layout) { + std::vector input_mnk{ + benchmark_state.range(0), + benchmark_state.range(1), + benchmark_state.range(2)}; + + at::manual_seed(0); + + auto inputs = fp16MatmulAtInput( + input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout); + + // warm up run + auto outputs = atMatmul(inputs.first, inputs.second, layout); + + for (auto _ : benchmark_state) { + clearL2Cache(); + CudaKernelTimer timer; + outputs = atMatmul(inputs.first, inputs.second, layout); + benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); + } + // Sync everything up before we're finished, don't want to run ahead on the + // cpu while benchmarking. + cudaDeviceSynchronize(); +} + +// Actual benchmarking +// ----------------------------------------------------------------- + +size_t getSmemSize(GemmTile cta_tile, int stage_number) { + return ((cta_tile.m * cta_tile.k) + (cta_tile.n * cta_tile.k)) * + dataTypeSize(DataType::Half) * stage_number; +} + +// TODO: this part eventually will be automated by heuristics +MatmulParam getMatmulParams( + GemmTile cta_tile, + int stage_number, + MatmulLayout layout) { + MatMulTileOptions gemm_tile; + gemm_tile.cta_tile = cta_tile; + // TODO: pipe through split K + gemm_tile.warp_tile = GemmTile(64, 64, cta_tile.k); + gemm_tile.instruction_tile = GemmTile(16, 16, 16); + + // Collect mma swizzle info + auto mma_builder = + MmaBuilder(MmaOptions::MacroType::Ampere_16_16_16, gemm_tile) + .layout(layout); + + MatmulParam params(mma_builder); + params.tile_sizes = gemm_tile; + params.async_gmem_load_operands = true; + params.double_buffer_options.double_buffer_smem_write = true; + params.double_buffer_options.double_buffer_smem_read = true; + params.double_buffer_options.smem_double_buffer_stage = stage_number; + + return params; +} + +static void Nvfuser_Matmul_4warp3stage( + benchmark::State& benchmark_state, + MatmulLayout layout) { + auto cta_tile = GemmTile(128, 128, 32); + int number_of_stage = 3; + + auto params = getMatmulParams(cta_tile, number_of_stage, layout); + + NVFUSER_BENCHMARK_ARCH_SMEM_GUARD( + 8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state); + + // Run benchmark: + SingleMatmulBase(benchmark_state, layout, params); +} + +static void Nvfuser_Matmul_8warp3stage( + benchmark::State& benchmark_state, + MatmulLayout layout) { + auto cta_tile = GemmTile(256, 128, 32); + int number_of_stage = 3; + + auto params = getMatmulParams(cta_tile, number_of_stage, layout); + + NVFUSER_BENCHMARK_ARCH_SMEM_GUARD( + 8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state); + + // Run benchmark: + SingleMatmulBase(benchmark_state, layout, params); +} + +static void Nvfuser_Matmul_4warp4stage( + benchmark::State& benchmark_state, + MatmulLayout layout) { + auto cta_tile = GemmTile(128, 128, 32); + int number_of_stage = 4; + + auto params = getMatmulParams(cta_tile, number_of_stage, layout); + + NVFUSER_BENCHMARK_ARCH_SMEM_GUARD( + 8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state); + + // Run benchmark: + SingleMatmulBase(benchmark_state, layout, params); +} + +static void Nvfuser_Matmul_8warp4stage( + benchmark::State& benchmark_state, + MatmulLayout layout) { + auto cta_tile = GemmTile(256, 128, 32); + int number_of_stage = 4; + + auto params = getMatmulParams(cta_tile, number_of_stage, layout); + + NVFUSER_BENCHMARK_ARCH_SMEM_GUARD( + 8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state); + + // Run benchmark: + SingleMatmulBase(benchmark_state, layout, params); +} + +// ----------------------------- Benchmark Instantiation------- + +// Common utils: +#define NO_TILE_QUANTIZATION_ARGS \ + ArgsProduct( \ + {{2048}, {3456}, benchmark::CreateDenseRange(512, 4096, /*step=*/512)}) \ + ->Unit(benchmark::kMicrosecond) \ + ->UseManualTime(); + +#define ForAllLayouts(run) \ + run(TT, MatmulLayout::TT); \ + run(TN, MatmulLayout::TN); \ + run(NT, MatmulLayout::NT) + +// Instantiations: +#define Nvfuser_4warp3stage_test(layout_label, layout) \ + BENCHMARK_CAPTURE( \ + Nvfuser_Matmul_4warp3stage, \ + no_quant_nvfuser_4warp_##layout_label, \ + layout) \ + ->NO_TILE_QUANTIZATION_ARGS + +#define Nvfuser_8warp3stage_test(layout_label, layout) \ + BENCHMARK_CAPTURE( \ + Nvfuser_Matmul_8warp3stage, \ + no_quant_nvfuser_8warp_##layout_label, \ + layout) \ + ->NO_TILE_QUANTIZATION_ARGS + +#define Nvfuser_4warp4stage_test(layout_label, layout) \ + BENCHMARK_CAPTURE( \ + Nvfuser_Matmul_4warp4stage, \ + no_quant_nvfuser_4warp_##layout_label, \ + layout) \ + ->NO_TILE_QUANTIZATION_ARGS + +#define Nvfuser_8warp4stage_test(layout_label, layout) \ + BENCHMARK_CAPTURE( \ + Nvfuser_Matmul_8warp4stage, \ + no_quant_nvfuser_8warp_##layout_label, \ + layout) \ + ->NO_TILE_QUANTIZATION_ARGS + +#define Eagermode_test(layout_label, layout) \ + BENCHMARK_CAPTURE( \ + EagerModeMatmul, no_quant_eagermode_##layout_label, layout) \ + ->NO_TILE_QUANTIZATION_ARGS + +ForAllLayouts(Nvfuser_4warp3stage_test); +ForAllLayouts(Nvfuser_4warp4stage_test); +ForAllLayouts(Nvfuser_8warp3stage_test); +ForAllLayouts(Nvfuser_8warp4stage_test); +ForAllLayouts(Eagermode_test); diff --git a/benchmarks/cpp/nvfuser/reduction.cpp b/benchmarks/cpp/nvfuser/reduction.cpp index d6fc0ca327ae7..c4aaaf8a60475 100644 --- a/benchmarks/cpp/nvfuser/reduction.cpp +++ b/benchmarks/cpp/nvfuser/reduction.cpp @@ -73,7 +73,7 @@ static void NvFuserScheduler_Reduction( fusion_executor_cache->profile(false); executor_instance->setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); auto cg_outputs = fusion_executor_cache->runFusionWithInputs({aten_input}); @@ -82,7 +82,7 @@ static void NvFuserScheduler_Reduction( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * @@ -105,14 +105,14 @@ static void Baseline_Reduction( // Sync everything up before we start clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; auto output = aten_input.sum({reduction_dim}); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } benchmark_state.SetBytesProcessed( diff --git a/benchmarks/cpp/nvfuser/rms_norm.cpp b/benchmarks/cpp/nvfuser/rms_norm.cpp index 81fdf46cf8189..37911ea6b1fd2 100644 --- a/benchmarks/cpp/nvfuser/rms_norm.cpp +++ b/benchmarks/cpp/nvfuser/rms_norm.cpp @@ -24,7 +24,6 @@ static void setupRMSNorm(Fusion* fusion, DataType dtype) { FusionGuard fg(fusion); - const int kReductionAxis = 2; const float kEps = 1e-6; Double* eps_ptr = IrBuilder::create(kEps); @@ -61,7 +60,6 @@ static void NvFuserScheduler_RMSNorm( dtype == DataType::BFloat16); std::vector input_shape{8, benchmark_state.range(0), 1024}; - const float kEps = 1e-6; // inputs at::manual_seed(0); diff --git a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp index b4c6ac413c758..987c3bf234fa2 100644 --- a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp +++ b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp @@ -24,9 +24,6 @@ static void setupRMSNorm_BWD(Fusion* fusion, DataType dtype) { dtype == DataType::Float || dtype == DataType::Half || dtype == DataType::BFloat16); - const int kReductionAxis = 2; - Double* eps_ptr = IrBuilder::create(1e-6); - // setup fusion auto grad_out = makeContigTensor(3, dtype); auto input = makeContigTensor(3, dtype); diff --git a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp index 74dbb5324cbab..158d3668c2792 100644 --- a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp +++ b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp @@ -144,7 +144,7 @@ static void NvFuserScheduler_SBR( fusion_executor_cache->profile(false); executor_instance->setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs); @@ -153,7 +153,7 @@ static void NvFuserScheduler_SBR( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); const size_t size = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3]; @@ -182,7 +182,7 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) { at::Tensor at_bias = at::zeros(bcast_shape, options); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; @@ -191,9 +191,9 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) { auto output = at::relu(bias); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); clearL2Cache(); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } const size_t size = @@ -245,7 +245,7 @@ static void NvFuserScheduler_SBR_Norm( fusion_executor_cache->profile(false); executor_instance->setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs); @@ -255,7 +255,7 @@ static void NvFuserScheduler_SBR_Norm( // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); const size_t size = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3]; @@ -286,7 +286,7 @@ static void Baseline_SBR_Norm( at::Tensor at_mean = at::zeros(bcast_shape, options); at::Tensor at_var = at::ones(bcast_shape, options); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { CudaKernelTimer timer; @@ -298,7 +298,7 @@ static void Baseline_SBR_Norm( auto output = at::relu(bias); benchmark_state.SetIterationTime(timer.elapsed() / 1000.0); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } const size_t size = diff --git a/benchmarks/cpp/nvfuser/shape_inference.cpp b/benchmarks/cpp/nvfuser/shape_inference.cpp index 2e5e23ed7442e..fd628a163abce 100644 --- a/benchmarks/cpp/nvfuser/shape_inference.cpp +++ b/benchmarks/cpp/nvfuser/shape_inference.cpp @@ -100,8 +100,11 @@ void LayerNormBackward_ShapeInference_Base( auto runtime = getLayerBackwardNormRuntime( std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape); + + KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs); + TORCH_INTERNAL_ASSERT( - runtime->getMaybeHeuristicsFor(aten_inputs).has_value()); + runtime->getMaybeHeuristicsFor(args).has_value()); fec->profile(true); fec->disableKernelLaunch(); @@ -172,8 +175,10 @@ void LayerNormForward_ShapeInferenceBase( auto runtime = getLayerForwardNormRuntime( std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape); + KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs); + TORCH_INTERNAL_ASSERT( - runtime->getMaybeHeuristicsFor(aten_inputs).has_value()); + runtime->getMaybeHeuristicsFor(args).has_value()); fec->profile(true); fec->disableKernelLaunch(); diff --git a/benchmarks/cpp/nvfuser/softmax.cpp b/benchmarks/cpp/nvfuser/softmax.cpp index 439e426220f87..350ccb301638f 100644 --- a/benchmarks/cpp/nvfuser/softmax.cpp +++ b/benchmarks/cpp/nvfuser/softmax.cpp @@ -107,7 +107,7 @@ static void Softmax_WarpReduceReference(benchmark::State& benchmark_state) { } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * @@ -162,7 +162,7 @@ static void Softmax_WarpReduce(benchmark::State& benchmark_state) { } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * @@ -206,7 +206,7 @@ static void Baseline_Softmax( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * diff --git a/benchmarks/cpp/nvfuser/softmax_backward.cpp b/benchmarks/cpp/nvfuser/softmax_backward.cpp index 8fb35083c6dc7..51696ede90cec 100644 --- a/benchmarks/cpp/nvfuser/softmax_backward.cpp +++ b/benchmarks/cpp/nvfuser/softmax_backward.cpp @@ -116,7 +116,7 @@ static void Baseline_Softmax_BWD( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); benchmark_state.SetBytesProcessed( int64_t(benchmark_state.iterations()) * @@ -177,13 +177,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32) NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -201,13 +201,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16) NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -225,13 +225,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32) NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -249,13 +249,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16) NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -275,13 +275,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp32) BENCHMARK(Baseline_Softmax_BWD_Outer_fp32) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); BENCHMARK(Baseline_Softmax_BWD_Outer_fp32) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -299,13 +299,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp16) BENCHMARK(Baseline_Softmax_BWD_Outer_fp16) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); BENCHMARK(Baseline_Softmax_BWD_Outer_fp16) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -323,13 +323,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp32) BENCHMARK(Baseline_Softmax_BWD_Inner_fp32) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); BENCHMARK(Baseline_Softmax_BWD_Inner_fp32) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); @@ -347,13 +347,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp16) BENCHMARK(Baseline_Softmax_BWD_Inner_fp16) // ->RangeMultiplier(2) - ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}}) + ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); BENCHMARK(Baseline_Softmax_BWD_Inner_fp16) // ->RangeMultiplier(2) - ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}}) + ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}}) ->Unit(benchmark::kMicrosecond) ->UseManualTime(); diff --git a/benchmarks/cpp/nvfuser/softmax_dropout.cpp b/benchmarks/cpp/nvfuser/softmax_dropout.cpp index 48950373731c1..383d1d4bb9f4d 100644 --- a/benchmarks/cpp/nvfuser/softmax_dropout.cpp +++ b/benchmarks/cpp/nvfuser/softmax_dropout.cpp @@ -127,7 +127,7 @@ static void Baseline_Softmax_Dropout( at::Tensor attention_scores = at::randn(input_shape, options); at::Tensor at_y = at::randn(input_shape, options); - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); @@ -144,7 +144,7 @@ static void Baseline_Softmax_Dropout( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); // 5 dtype: attention_scores + attention_mask + attention_scores_out + // attention_probs_out + output diff --git a/benchmarks/cpp/nvfuser/timm.cpp b/benchmarks/cpp/nvfuser/timm.cpp index 013b609be6020..4669ff0ecabf6 100644 --- a/benchmarks/cpp/nvfuser/timm.cpp +++ b/benchmarks/cpp/nvfuser/timm.cpp @@ -115,7 +115,7 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) { auto t6 = set(t5); auto t7 = broadcast(t6, bcast_pattern0); auto t8 = add(t4, t7); - auto t9 = randlike(t8); + auto t9 = rand_like(t8); auto d34 = sub(IrBuilder::create(1.0), IrBuilder::create(0.0)); auto t10 = lt(t9, d34); @@ -139,7 +139,6 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) { auto t20 = sum(t37, {2}); auto t24 = broadcast(t20, bcast_pattern1); auto d95 = castOp(DataType::Double, t2->axis(2)->extent()); - auto d96 = mul(IrBuilder::create(1.0), d95); auto d105 = reciprocal(d95); auto t25 = mul(t24, d105); auto t26 = add(t25, IrBuilder::create(1e-6)); @@ -289,7 +288,7 @@ static void setup_vit_base_patch16_224_norm_inner3(Fusion* fusion, void* null) { auto t10 = broadcast(t9, {false, false, false, true}); auto t11 = reciprocal(t10); auto t12 = mul(t8, t11); - auto t13 = randlike(t12); + auto t13 = rand_like(t12); auto d79 = sub(IrBuilder::create(1), IrBuilder::create(0)); auto t14 = lt(t13, d79); auto t15 = castOp(DataType::Float, t14); @@ -320,8 +319,6 @@ static void NvFuserScheduler_TIMM_vit_base_patch16_224_norm_inner3( at::manual_seed(0); auto fp16_options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0); - auto fp32_options = - at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0); auto t0 = at::randn(input_shape, fp16_options); @@ -367,7 +364,7 @@ static void setup_vit_base_patch16_224_bcast_outer6( auto t9 = add(IrBuilder::create(1), t8); auto t10 = mul(IrBuilder::create(0.5), t9); auto t11 = mul(t6, t10); - auto t12 = randlike(t11); + auto t12 = rand_like(t11); auto d66 = sub(IrBuilder::create(1), IrBuilder::create(0)); auto t13 = lt(t12, d66); auto t14 = castOp(DataType::Float, t13); @@ -456,7 +453,7 @@ static void setup_vit_base_patch16_224_bcast_inner6( auto t9 = add(IrBuilder::create(1), t8); auto t10 = mul(IrBuilder::create(0.5), t9); auto t11 = mul(t6, t10); - auto t12 = randlike(t11); + auto t12 = rand_like(t11); auto d66 = sub(IrBuilder::create(1), IrBuilder::create(0)); auto t13 = lt(t12, d66); auto t14 = castOp(DataType::Float, t13); diff --git a/benchmarks/cpp/nvfuser/utils.cpp b/benchmarks/cpp/nvfuser/utils.cpp index 3915f7d652989..0a13c57d10e19 100644 --- a/benchmarks/cpp/nvfuser/utils.cpp +++ b/benchmarks/cpp/nvfuser/utils.cpp @@ -6,7 +6,7 @@ using namespace torch::jit::fuser::cuda; -std::string toString(ReductionParams rparams) { +std::string toString(const ReductionParams& rparams) { std::stringstream ss; ss << (rparams.fastest_dim ? "Red On Fastest Dim // " : "Red On Slow Dim // ") << (rparams.persistent_kernel ? "Persistent Kernel // " : "") @@ -65,7 +65,7 @@ std::string toString(ReductionParams rparams) { return ss.str(); } -std::string toString(PointwiseParams params) { +std::string toString(const PointwiseParams& params) { std::stringstream ss; if (params.break_point) { ss << "2D Schedule at " << params.break_point << "/"; @@ -89,6 +89,15 @@ std::string toString(PointwiseParams params) { return ss.str(); } +std::string toString(const TransposeParams& params) { + std::stringstream ss; + ss << "Tile size: (" << params.tile_size1 << "," << params.tile_size2 + << ")/"; + ss << "Vectorize size: (" << params.vectorize_factor1 << "," + << params.vectorize_factor2 << ")"; + return ss.str(); +} + std::string toString(const std::shared_ptr& params) { auto rparams = std::dynamic_pointer_cast(params); if (rparams) { @@ -98,6 +107,10 @@ std::string toString(const std::shared_ptr& params) { if (pparams) { return toString(*pparams); } + auto tparams = std::dynamic_pointer_cast(params); + if (tparams) { + return toString(*tparams); + } TORCH_INTERNAL_ASSERT( false, "Unknown heuristic parameter type. Did you just added a new heuristic parameter type but forget to update here?"); @@ -176,7 +189,7 @@ void runBenchmarkIterations( executor_instance->setMeasureKernelTimeFlag(true); // Sync everything up before we start - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); for (auto _ : benchmark_state) { clearL2Cache(); auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs); @@ -185,7 +198,7 @@ void runBenchmarkIterations( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } else { // Segmented // Sync everything up before we start @@ -193,7 +206,7 @@ void runBenchmarkIterations( // Compile/warmup auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs); } - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); CudaKernelTimer timer; for (auto _ : benchmark_state) { clearL2Cache(); @@ -203,7 +216,7 @@ void runBenchmarkIterations( } // Sync everything up before we're finished, don't want to run ahead on the // cpu while benchmarking. - cudaDeviceSynchronize(); + C10_CUDA_CHECK(cudaDeviceSynchronize()); } } diff --git a/benchmarks/cpp/nvfuser/utils.h b/benchmarks/cpp/nvfuser/utils.h index e24fdfb127dab..7bfc4aefd45c2 100644 --- a/benchmarks/cpp/nvfuser/utils.h +++ b/benchmarks/cpp/nvfuser/utils.h @@ -36,8 +36,9 @@ TensorView* makeContigConcreteTensor( std::vector shape, DataType dtype = DataType::Float); -std::string toString(ReductionParams rparams); -std::string toString(PointwiseParams params); +std::string toString(const ReductionParams& rparams); +std::string toString(const PointwiseParams& params); +std::string toString(const TransposeParams& params); std::string toString(const std::shared_ptr& params); std::string toString(LaunchParams lparams); @@ -55,26 +56,27 @@ class CudaKernelTimer { public: CudaKernelTimer() { // Setup - cudaEventCreate(&start_event); - cudaEventCreate(&finish_event); - cudaEventRecord(start_event); + C10_CUDA_CHECK(cudaEventCreate(&start_event)); + C10_CUDA_CHECK(cudaEventCreate(&finish_event)); + C10_CUDA_CHECK(cudaEventRecord(start_event)); } ~CudaKernelTimer() { - cudaEventDestroy(start_event); - cudaEventDestroy(finish_event); + C10_CUDA_IGNORE_ERROR(cudaEventDestroy(start_event)); + C10_CUDA_IGNORE_ERROR(cudaEventDestroy(finish_event)); } void restart() { - cudaEventRecord(start_event); + C10_CUDA_CHECK(cudaEventRecord(start_event)); } float elapsed() { // Record - cudaEventRecord(finish_event); - cudaEventSynchronize(start_event); - cudaEventSynchronize(finish_event); - cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event); + C10_CUDA_CHECK(cudaEventRecord(finish_event)); + C10_CUDA_CHECK(cudaEventSynchronize(start_event)); + C10_CUDA_CHECK(cudaEventSynchronize(finish_event)); + C10_CUDA_CHECK( + cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event)); return kernel_time_ms_; } diff --git a/benchmarks/distributed/ddp/benchmark.py b/benchmarks/distributed/ddp/benchmark.py index 2c742d0fc9d8f..a905ad60f5309 100644 --- a/benchmarks/distributed/ddp/benchmark.py +++ b/benchmarks/distributed/ddp/benchmark.py @@ -87,7 +87,7 @@ def run_benchmark(benchmark, ranks, opts): measurements = [] if dist.get_rank() in set(ranks): if not opts: - opts = dict() + opts = {} measurements = benchmark_process_group(group, benchmark, **opts) dist.destroy_process_group(group) dist.barrier() diff --git a/benchmarks/operator_benchmark/pt/qactivation_test.py b/benchmarks/operator_benchmark/pt/qactivation_test.py index 5baf4cca3c3b4..f57ff8d1f16c3 100644 --- a/benchmarks/operator_benchmark/pt/qactivation_test.py +++ b/benchmarks/operator_benchmark/pt/qactivation_test.py @@ -1,5 +1,5 @@ import torch -import torch.nn.quantized as nnq +import torch.ao.nn.quantized.functional as qF import operator_benchmark as op_bench @@ -44,9 +44,9 @@ attrs=( ('relu', torch.nn.ReLU()), ('relu6', torch.ops.quantized.relu6), - ('functional.hardtanh', nnq.functional.hardtanh), - ('functional.hardsigmoid', nnq.functional.hardsigmoid), - ('functional.leaky_relu', nnq.functional.leaky_relu), + ('functional.hardtanh', qF.hardtanh), + ('functional.hardsigmoid', qF.hardsigmoid), + ('functional.leaky_relu', qF.leaky_relu), ('functional.sigmoid', torch.nn.functional.sigmoid), ('functional.tanh', torch.nn.functional.tanh), ), @@ -92,9 +92,9 @@ def forward(self, q_input): qactivation_scale_zero_point_ops = op_bench.op_list( attrs=( - ('functional.hardswish', nnq.functional.hardswish), - ('functional.elu', nnq.functional.elu), - ('functional.celu', nnq.functional.celu), + ('functional.hardswish', qF.hardswish), + ('functional.elu', qF.elu), + ('functional.celu', qF.celu), ), attr_names=('op_name', 'op_func'), ) diff --git a/benchmarks/operator_benchmark/pt/qarithmetic_test.py b/benchmarks/operator_benchmark/pt/qarithmetic_test.py index b1103a8a25315..97766bdb4c194 100644 --- a/benchmarks/operator_benchmark/pt/qarithmetic_test.py +++ b/benchmarks/operator_benchmark/pt/qarithmetic_test.py @@ -29,7 +29,7 @@ class _QFunctionalBinaryArithmeticBenchmarkBase(op_bench.TorchBenchmarkBase): def setup(self, N, dtype, contig): - self.qfunctional = torch.nn.quantized.QFunctional() + self.qfunctional = torch.ao.nn.quantized.QFunctional() # TODO: Consider more diverse shapes f_input = (torch.rand(N, N) - 0.5) * 256 diff --git a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py index 2dcfdd4de4399..97ce0357557e3 100644 --- a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py +++ b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py @@ -1,6 +1,6 @@ import operator_benchmark as op_bench import torch -import torch.nn.qat as nnqat +import torch.ao.nn.qat as nnqat import numpy from pt import configs from torch.ao.quantization import default_embedding_qat_qconfig diff --git a/benchmarks/operator_benchmark/pt/qcat_test.py b/benchmarks/operator_benchmark/pt/qcat_test.py index 32dd32e43adfe..2ff0b87a9d380 100644 --- a/benchmarks/operator_benchmark/pt/qcat_test.py +++ b/benchmarks/operator_benchmark/pt/qcat_test.py @@ -1,7 +1,7 @@ import operator_benchmark as op_bench import torch -import torch.nn.quantized as nnq +import torch.ao.nn.quantized as nnq from typing import List diff --git a/benchmarks/operator_benchmark/pt/qconv_test.py b/benchmarks/operator_benchmark/pt/qconv_test.py index 14e8e143a7ca8..c48759d330e78 100644 --- a/benchmarks/operator_benchmark/pt/qconv_test.py +++ b/benchmarks/operator_benchmark/pt/qconv_test.py @@ -1,7 +1,7 @@ import operator_benchmark as op_bench import torch -import torch.nn.quantized as nnq +import torch.ao.nn.quantized as nnq from pt import configs diff --git a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py index 872f8c28fccd4..5a406631b5ed8 100644 --- a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py +++ b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py @@ -1,7 +1,7 @@ import operator_benchmark as op_bench import torch -import torch.nn.quantized as nnq +import torch.ao.nn.quantized as nnq import numpy from pt import configs diff --git a/benchmarks/operator_benchmark/pt/qlinear_test.py b/benchmarks/operator_benchmark/pt/qlinear_test.py index 6e4dd9d97eca5..c4f8f36c11d3b 100644 --- a/benchmarks/operator_benchmark/pt/qlinear_test.py +++ b/benchmarks/operator_benchmark/pt/qlinear_test.py @@ -2,8 +2,8 @@ import operator_benchmark as op_bench import torch -import torch.nn.quantized as nnq -import torch.nn.quantized.dynamic as nnqd +import torch.ao.nn.quantized as nnq +import torch.ao.nn.quantized.dynamic as nnqd from pt import configs diff --git a/benchmarks/operator_benchmark/pt/quantization_test.py b/benchmarks/operator_benchmark/pt/quantization_test.py index 8ffbdd20e4429..332ff52c21d6e 100644 --- a/benchmarks/operator_benchmark/pt/quantization_test.py +++ b/benchmarks/operator_benchmark/pt/quantization_test.py @@ -1,7 +1,7 @@ import operator_benchmark as op_bench import torch -import torch.nn.quantized as nnq +import torch.ao.nn.quantized as nnq import torch.ao.quantization as tq import torch.nn as nn diff --git a/benchmarks/static_runtime/test_static_runtime.cc b/benchmarks/static_runtime/test_static_runtime.cc index 72ee217401ab0..ee37ddeaf71ac 100644 --- a/benchmarks/static_runtime/test_static_runtime.cc +++ b/benchmarks/static_runtime/test_static_runtime.cc @@ -2164,7 +2164,12 @@ TEST(StaticRuntime, Permute) { c10::List dims_b{0, 2, 1}; std::vector args_b{b, dims_b}; + auto c = at::randn({3, 3, 3}); + c10::List dims_c{0, -1, 1}; + std::vector args_c{c, dims_c}; + testStaticRuntime(permute_script, args_a); + testStaticRuntime(permute_script, args_c); testStaticRuntime(permute_script, args_a, args_b); permute_script = R"JIT( @@ -2590,23 +2595,28 @@ TEST(StaticRuntime, JIT_Aten_Numel) { } TEST(StaticRuntime, JIT_Aten_List) { - const std::string script = R"IR( + const auto script_str = R"IR( graph(%a: str): - %1 : int = prim::Constant[value=0]() %ret: str[] = aten::list(%a) return (%ret) )IR"; - - auto graph = std::make_shared(); - std::unordered_map vmap; - vmap.reserve(0); - parseIR(script, graph.get(), vmap); - torch::jit::StaticModule smodule(graph); - - string a = "abcd"; + std::string a = "abcd"; std::vector args0{a}; + testStaticRuntime(script_str, args0); + + // Update the result of aten::list to ensure that a deep copy + // took place + const auto script_list = R"IR( + graph(%a : int[]): + %idx : int = prim::Constant[value=0]() + %value : int = prim::Constant[value=42]() + %res : int[] = aten::list(%a) + %updated : int[] = aten::_set_item(%res, %idx, %value) + return (%res, %a) + )IR"; - testStaticRuntime(script, args0); + std::vector args1{c10::List{1, 2, 3}}; + testStaticRuntime(script_list, args1); } TEST(StaticRuntime, JIT_Aten_Range_Length) { diff --git a/buckbuild.bzl b/buckbuild.bzl index 40f542e3f80df..ae1519ea8f5ee 100644 --- a/buckbuild.bzl +++ b/buckbuild.bzl @@ -125,8 +125,8 @@ THIRD_PARTY_LIBS = { "XNNPACK": ["//xplat/third-party/XNNPACK:XNNPACK", "//third_party:XNNPACK"], "clog": ["//xplat/third-party/clog:clog", "//third_party:clog"], "cpuinfo": ["//third-party/cpuinfo:cpuinfo", "//third_party:cpuinfo"], - "flatbuffers-api": ["//third-party/flatbuffers:flatbuffers-api", "//third_party:flatbuffers-api"], - "flatc": ["//third-party/flatbuffers:flatc", "//third_party:flatc"], + "flatbuffers-api": ["//third-party/flatbuffers/fbsource_namespace:flatbuffers-api", "//third_party:flatbuffers-api"], + "flatc": ["//third-party/flatbuffers/fbsource_namespace:flatc", "//third_party:flatc"], "fmt": ["//third-party/fmt:fmt", "//third_party:fmt"], "glog": ["//third-party/glog:glog", "//third_party:glog"], "gmock": ["//xplat/third-party/gmock:gtest", "//third_party:gmock"], @@ -739,7 +739,7 @@ def get_pt_operator_registry_dict( third_party("glog"), C10, ] + ([ROOT + ":torch_mobile_train"] if train else []) + - ([ROOT + ":torch_flatbuffer_all"] if enable_flatbuffer else []), + ([ROOT + ":flatbuffers_mobile"] if enable_flatbuffer else []), **kwargs ) @@ -1347,7 +1347,7 @@ def define_buck_targets( exported_preprocessor_flags = get_pt_preprocessor_flags(), visibility = ["PUBLIC"], exported_deps = [ - ":torch_flatbuffer_all", + ":flatbuffers_mobile", ":torch_mobile_core", ], ) @@ -1497,8 +1497,6 @@ def define_buck_targets( # "torch/csrc/jit/mobile/compatibility/runtime_compatibility.cpp", # "torch/csrc/jit/serialization/unpickler.cpp", "torch/csrc/jit/mobile/compatibility/model_compatibility.cpp", - "torch/csrc/jit/serialization/pickle.cpp", - "torch/csrc/jit/serialization/pickler.cpp", ], header_namespace = "", exported_headers = [ @@ -1635,7 +1633,6 @@ def define_buck_targets( compiler_flags = get_pt_compiler_flags() + ["-Wno-error"], exported_preprocessor_flags = get_pt_preprocessor_flags() + [ "-DUSE_KINETO", - "-DUSE_KINETO_UPDATED", # Need this otherwise USE_KINETO is undefed # for mobile "-DEDGE_PROFILER_USE_KINETO", @@ -1662,7 +1659,6 @@ def define_buck_targets( compiler_flags = get_pt_compiler_flags() + ["-Wno-error"], exported_preprocessor_flags = get_pt_preprocessor_flags() + [ "-DUSE_KINETO", - "-DUSE_KINETO_UPDATED", "-DEDGE_PROFILER_USE_KINETO", ], # @lint-ignore BUCKLINT link_whole @@ -1689,21 +1685,29 @@ def define_buck_targets( cmd = "$(exe {})".format(third_party("flatc")) + " --cpp --gen-mutable --scoped-enums -o ${OUT} ${SRCS}", default_outs = ["."], + visibility = [ + "{}:mobile_bytecode".format(ROOT), + ], ) + # Users of this target will need to add third_party("flatbuffers-api") as a + # dep. fb_xplat_cxx_library( name = "mobile_bytecode", header_namespace = "", exported_headers = { "torch/csrc/jit/serialization/mobile_bytecode_generated.h": ":mobile_bytecode_header[mobile_bytecode_generated.h]", }, - exported_deps = [ - third_party("flatbuffers-api"), + # Avoid leaking implementation details by only exposing this header to + # the internals of the loader/serializer layer. + visibility = [ + "{}:flatbuffer_loader".format(ROOT), + "{}:flatbuffer_serializer_mobile".format(ROOT), ], ) fb_xplat_cxx_library( - name = "flatbuffer_serializer", + name = "flatbuffers_serializer_mobile", srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer.cpp"], exported_headers = [ "torch/csrc/jit/serialization/flatbuffer_serializer.h", @@ -1714,17 +1718,16 @@ def define_buck_targets( "-fexceptions", "-frtti", "-Wno-deprecated-declarations", - ], + ] + (["-DFB_XPLAT_BUILD"] if not IS_OSS else []), visibility = ["PUBLIC"], deps = [ + ":mobile_bytecode", ":torch_mobile_module", C10, + third_party("flatbuffers-api"), ], exported_deps = [ - ":flatbuffer_loader", - ":mobile_bytecode", ":torch_mobile_train", - third_party("flatbuffers-api"), ], ) @@ -1739,11 +1742,10 @@ def define_buck_targets( compiler_flags = get_pt_compiler_flags() + ["-Wno-error"], exported_preprocessor_flags = get_pt_preprocessor_flags() + [ "-DUSE_KINETO", - "-DUSE_KINETO_UPDATED", # Need this otherwise USE_KINETO is undefed # for mobile "-DEDGE_PROFILER_USE_KINETO", - ], + ] + (["-DFB_XPLAT_BUILD"] if not IS_OSS else []), extra_flags = { "fbandroid_compiler_flags": ["-frtti"], }, @@ -1758,16 +1760,18 @@ def define_buck_targets( "-Wl,--no-as-needed", ], visibility = ["PUBLIC"], - exported_deps = [ + deps = [ ":mobile_bytecode", - ":torch_mobile_deserialize", third_party("flatbuffers-api"), + ], + exported_deps = [ + ":torch_mobile_deserialize", C10, ], ) fb_xplat_cxx_library( - name = "flatbuffer_serializer_jit", + name = "flatbuffers_serializer_jit", srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp"], exported_headers = [ "torch/csrc/jit/serialization/flatbuffer_serializer_jit.h", @@ -1785,22 +1789,29 @@ def define_buck_targets( visibility = ["PUBLIC"], deps = [ ":flatbuffer_loader", - ":flatbuffer_serializer", - ":mobile_bytecode", + ":flatbuffers_serializer_mobile", ":torch_core", ":torch_mobile_module", - third_party("flatbuffers-api"), C10, ], ) fb_xplat_cxx_library( - name = "torch_flatbuffer_all", + name = "flatbuffers_jit", + visibility = ["PUBLIC"], + exported_deps = [ + ":flatbuffer_loader", + ":flatbuffers_serializer_mobile", + ":flatbuffers_serializer_jit", + ], + ) + + fb_xplat_cxx_library( + name = "flatbuffers_mobile", visibility = ["PUBLIC"], exported_deps = [ ":flatbuffer_loader", - ":flatbuffer_serializer", - ":flatbuffer_serializer_jit", + ":flatbuffers_serializer_mobile", ], ) diff --git a/build.bzl b/build.bzl index ac9ceaa0559de..5715e34786d45 100644 --- a/build.bzl +++ b/build.bzl @@ -92,6 +92,7 @@ def define_targets(rules): ":LazyIr.h", ":LazyNonNativeIr.h", ":RegisterDispatchKey.cpp", + ":RegisterDispatchDefinitions.ini", ":native_functions.yaml", ":shape_inference.h", ":tags.yaml", diff --git a/build_variables.bzl b/build_variables.bzl index e4b4b82df5f60..f70d4280825af 100644 --- a/build_variables.bzl +++ b/build_variables.bzl @@ -26,6 +26,8 @@ libtorch_nvfuser_runtime_sources = [ "torch/csrc/jit/codegen/cuda/runtime/broadcast.cu", "torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu", "torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu", + "torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu", + "torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu", "torch/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu", "torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu", "torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu", @@ -102,7 +104,6 @@ core_sources_common = [ "torch/csrc/jit/frontend/edit_distance.cpp", "torch/csrc/jit/mobile/compatibility/runtime_compatibility.cpp", "torch/csrc/jit/mobile/type_parser.cpp", - "torch/csrc/jit/operator_upgraders/upgraders_guard.cpp", "torch/csrc/jit/operator_upgraders/version_map.cpp", "torch/csrc/jit/runtime/instruction.cpp", "torch/csrc/jit/runtime/jit_exception.cpp", @@ -130,6 +131,7 @@ libtorch_profiler_sources = [ "torch/csrc/autograd/profiler_kineto.cpp", "torch/csrc/profiler/api.cpp", "torch/csrc/profiler/collection.cpp", + "torch/csrc/profiler/execution_graph_observer.cpp", "torch/csrc/profiler/kineto_shim.cpp", "torch/csrc/profiler/nvtx_observer.cpp", "torch/csrc/profiler/kineto_client_interface.cpp", @@ -289,6 +291,7 @@ core_sources_full_mobile_no_backend_interface = [ "torch/csrc/jit/passes/utils/subgraph_utils.cpp", "torch/csrc/jit/passes/utils/optimization_utils.cpp", "torch/csrc/jit/passes/utils/op_registry.cpp", + "torch/csrc/jit/passes/mkldnn_rewrite.cpp", "torch/csrc/jit/passes/xnnpack_rewrite.cpp", "torch/csrc/jit/passes/vulkan_rewrite.cpp", "torch/csrc/jit/passes/metal_rewrite.cpp", @@ -553,6 +556,8 @@ torch_mobile_core = [ # TODO: Remove this dependency "torch/csrc/jit/backends/backend_debug_info.cpp", "torch/csrc/jit/mobile/compatibility/model_compatibility.cpp", + # TODO: This line needs to be uncommented to build mobile in OSS with flatbuffers + # "torch/csrc/jit/mobile/flatbuffer_loader.cpp", "torch/csrc/jit/mobile/function.cpp", "torch/csrc/jit/mobile/import.cpp", "torch/csrc/jit/mobile/interpreter.cpp", @@ -644,7 +649,7 @@ libtorch_cuda_core_sources = [ "torch/csrc/autograd/functions/comm.cpp", "torch/csrc/jit/codegen/cuda/arith.cpp", "torch/csrc/jit/codegen/cuda/compute_at.cpp", - "torch/csrc/jit/codegen/cuda/inline_propagator.cpp", + "torch/csrc/jit/codegen/cuda/inlining.cpp", "torch/csrc/jit/codegen/cuda/compute_at_map.cpp", "torch/csrc/jit/codegen/cuda/codegen.cpp", "torch/csrc/jit/codegen/cuda/contiguity.cpp", @@ -678,6 +683,7 @@ libtorch_cuda_core_sources = [ "torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp", "torch/csrc/jit/codegen/cuda/lower_allocation.cpp", "torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp", + "torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp", "torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp", "torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp", "torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp", @@ -718,12 +724,14 @@ libtorch_cuda_core_sources = [ "torch/csrc/jit/codegen/cuda/root_domain_map.cpp", "torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp", "torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp", + "torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp", "torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp", "torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp", "torch/csrc/jit/codegen/cuda/scheduler/matmul.cpp", "torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp", "torch/csrc/jit/codegen/cuda/scheduler/registry.cpp", "torch/csrc/jit/codegen/cuda/scheduler/utils.cpp", + "torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp", "torch/csrc/jit/codegen/cuda/type_inference.cpp", "torch/csrc/jit/codegen/cuda/type_promotion.cpp", "torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp", @@ -897,7 +905,9 @@ libtorch_python_core_sources = [ "torch/csrc/jit/passes/onnx/shape_type_inference.cpp", "torch/csrc/jit/passes/onnx/function_extraction.cpp", "torch/csrc/jit/passes/onnx/onnx_log.cpp", + "torch/csrc/jit/passes/onnx/naming.cpp", "torch/csrc/jit/python/pybind_utils.cpp", + "torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp", "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp", "torch/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.cpp", "torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp", @@ -918,7 +928,7 @@ libtorch_python_core_sources = [ "torch/csrc/monitor/python_init.cpp", "torch/csrc/multiprocessing/init.cpp", "torch/csrc/onnx/init.cpp", - "torch/csrc/profiler/execution_graph_observer.cpp", + "torch/csrc/profiler/python/init.cpp", "torch/csrc/serialization.cpp", "torch/csrc/tensor/python_tensor.cpp", "torch/csrc/utils/init.cpp", @@ -1050,7 +1060,7 @@ aten_cpu_source_non_codegen_list = [ "aten/src/ATen/core/op_registration/infer_schema.cpp", "aten/src/ATen/core/op_registration/op_registration.cpp", "aten/src/ATen/core/operator_name.cpp", - "aten/src/ATen/core/TorchDispatchModeTLS.cpp", + "aten/src/ATen/core/TorchDispatchUtils.cpp", "aten/src/ATen/core/register_symbols.cpp", "aten/src/ATen/core/class_type.cpp", "aten/src/ATen/core/type.cpp", @@ -1069,6 +1079,7 @@ aten_cpu_source_non_codegen_list = [ "aten/src/ATen/native/UpSample.cpp", "aten/src/ATen/native/mkldnn/BinaryOps.cpp", "aten/src/ATen/native/mkldnn/Conv.cpp", + "aten/src/ATen/native/mkldnn/ConvPrepack.cpp", "aten/src/ATen/native/mkldnn/Copy.cpp", "aten/src/ATen/native/mkldnn/Gelu.cpp", "aten/src/ATen/native/mkldnn/IDeepRegistration.cpp", @@ -1077,8 +1088,10 @@ aten_cpu_source_non_codegen_list = [ "aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp", "aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp", "aten/src/ATen/native/mkldnn/Normalization.cpp", + "aten/src/ATen/native/mkldnn/OpContext.cpp", "aten/src/ATen/native/mkldnn/Pooling.cpp", "aten/src/ATen/native/mkldnn/Prelu.cpp", + "aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp", "aten/src/ATen/native/mkldnn/Relu.cpp", "aten/src/ATen/native/mkldnn/SoftMax.cpp", "aten/src/ATen/native/mkldnn/TensorFactories.cpp", @@ -1096,9 +1109,6 @@ aten_cpu_source_non_codegen_list = [ "aten/src/ATen/Dispatch.cpp", "aten/src/ATen/SavedTensorHooks.cpp", "aten/src/ATen/vulkan/Context.cpp", - "aten/src/ATen/nnapi/nnapi_bind.cpp", - "aten/src/ATen/nnapi/nnapi_wrapper.cpp", - "aten/src/ATen/nnapi/nnapi_model_loader.cpp", "aten/src/ATen/native/prim_native_functions.cpp", "aten/src/ATen/native/verbose_wrapper.cpp", ] @@ -1397,7 +1407,6 @@ aten_native_source_non_codegen_list = [ # Files not in native, but depends on native symbols # "aten/src/ATen/TensorIndexing.cpp", "aten/src/ATen/TensorIterator.cpp", - "aten/src/ATen/nnapi/nnapi_register.cpp", ] # 1. Files in ATen/native with a few exceptions diff --git a/c10/core/DispatchKeySet.cpp b/c10/core/DispatchKeySet.cpp index 358703210112a..3cc564bc04ae2 100644 --- a/c10/core/DispatchKeySet.cpp +++ b/c10/core/DispatchKeySet.cpp @@ -50,7 +50,7 @@ constexpr DispatchKeySet math_dispatch_keyset = backend_dispatch_keyset | autograd_dispatch_keyset | // See Note [NestedTensor Not Included in Backend Keys] // The caveat to that note is that nested_tensor is a special case - // where we would like to support composite implict kernels but not + // where we would like to support composite implicit kernels but not // explicit kernels therefore we manually add the key to the // math_dispatch_keyset DispatchKeySet{DispatchKey::NestedTensor}; diff --git a/c10/core/SymInt.cpp b/c10/core/SymInt.cpp index 28e477481390a..944bc6722add0 100644 --- a/c10/core/SymInt.cpp +++ b/c10/core/SymInt.cpp @@ -4,7 +4,8 @@ namespace c10 { -std::array normalize_symints(SymInt a_, SymInt b_) { +#ifndef C10_MOBILE +static std::array normalize_symints(SymInt a_, SymInt b_) { SymIntNode a, b; if (a_.is_symbolic()) a = a_.toSymIntNodeImpl(); @@ -33,14 +34,38 @@ c10::SymInt SymInt::toSymInt(SymIntNode sin_sp) { auto ptr = static_cast( reinterpret_cast(static_cast(sin_sp.release()))); auto rep = (ptr & ~MASK) | IS_SYM; - return c10::SymInt(static_cast(rep)); + return c10::SymInt(UNCHECKED, static_cast(rep)); } +#else +// this code should never be executed on mobile due to inlining of `is_symbolic` +// which always returns `false` on mobile. +// However, if we decide to strip off `SymIntNode` completely from mobile builds +// We would need to stub these methods anyways +c10::SymInt SymInt::toSymInt(SymIntNode sin_sp) { + TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile"); +} +SymIntNode SymInt::toSymIntNodeImpl() const { + TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile"); +} +static std::array normalize_symints(SymInt a_, SymInt b_) { + TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile"); +} +#endif SymInt SymInt::operator+(SymInt sci) const { - TORCH_CHECK( - !this->is_symbolic() && !sci.is_symbolic(), - "Symbolic Add isn't supported yet"); - return SymInt(data_ + sci.data_); + if (!is_symbolic() && !sci.is_symbolic()) { + return SymInt(data_ + sci.data_); + } + auto res = normalize_symints(*this, sci); + return SymInt::toSymInt(res[0]->add(res[1])); +} + +SymInt SymInt::operator-(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return SymInt(data_ - sci.data_); + } + auto res = normalize_symints(*this, sci); + return SymInt::toSymInt(res[0]->sub(res[1])); } SymInt SymInt::operator*(SymInt sci) const { @@ -51,6 +76,22 @@ SymInt SymInt::operator*(SymInt sci) const { return SymInt::toSymInt(res[0]->mul(res[1])); } +SymInt SymInt::operator/(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return SymInt(data_ / sci.data_); + } + auto res = normalize_symints(*this, sci); + return SymInt::toSymInt(res[0]->floordiv(res[1])); +} + +SymInt SymInt::operator%(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return SymInt(data_ % sci.data_); + } + auto res = normalize_symints(*this, sci); + return SymInt::toSymInt(res[0]->mod(res[1])); +} + bool SymInt::operator==(SymInt sci) const { if (!is_symbolic() && !sci.is_symbolic()) { return data_ == sci.data_; @@ -64,22 +105,55 @@ bool SymInt::operator!=(SymInt sci) const { } bool SymInt::operator<(SymInt sci) const { - TORCH_CHECK( - !this->is_symbolic() && !sci.is_symbolic(), - "Symbolic lt isn't supported yet"); - return data_ < sci.data_; + if (!is_symbolic() && !sci.is_symbolic()) { + return data_ < sci.data_; + } + auto res = normalize_symints(*this, sci); + return res[0]->lt(res[1])->bool_(); +} + +bool SymInt::operator<=(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return data_ <= sci.data_; + } + auto res = normalize_symints(*this, sci); + return res[0]->le(res[1])->bool_(); +} + +bool SymInt::operator>(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return data_ > sci.data_; + } + auto res = normalize_symints(*this, sci); + return res[0]->gt(res[1])->bool_(); +} + +bool SymInt::operator>=(SymInt sci) const { + if (!is_symbolic() && !sci.is_symbolic()) { + return data_ >= sci.data_; + } + auto res = normalize_symints(*this, sci); + return res[0]->ge(res[1])->bool_(); } void SymInt::operator*=(SymInt sci) { - TORCH_CHECK( - !this->is_symbolic() && !sci.is_symbolic(), - "Symbolic mul_ isn't supported yet"); - data_ = data_ * sci.data_; + *this = *this * sci; } bool SymInt::operator<(int64_t sci) const { - TORCH_CHECK(!this->is_symbolic(), "Symbolic lt isn't supported yet"); - return data_ < sci; + return *this < c10::SymInt(sci); +} + +bool SymInt::operator<=(int64_t sci) const { + return *this <= c10::SymInt(sci); +} + +bool SymInt::operator>(int64_t sci) const { + return *this > c10::SymInt(sci); +} + +bool SymInt::operator>=(int64_t sci) const { + return *this >= c10::SymInt(sci); } bool SymInt::operator==(int64_t sci) const { @@ -91,8 +165,7 @@ bool SymInt::operator!=(int64_t sci) const { } SymInt SymInt::operator*(int64_t sci) const { - TORCH_CHECK(!this->is_symbolic(), "Symbolic mul isn't supported yet"); - return SymInt(data_ * sci); + return *this * c10::SymInt(sci); } } // namespace c10 diff --git a/c10/core/SymInt.h b/c10/core/SymInt.h index 331f10305dec0..015260bfaf309 100644 --- a/c10/core/SymInt.h +++ b/c10/core/SymInt.h @@ -27,11 +27,28 @@ namespace c10 { // SymInt will be extenteded to represent a union structure Union[int64_t, // SymIntNodeImpl*] which will be implemented as a single packed int64_t field // named data_. + +#ifdef C10_MOBILE +#define SKIP_IS_SYMBOLIC_ON_MOBILE(_) \ + do { \ + } while (0) +#else +#define SKIP_IS_SYMBOLIC_ON_MOBILE(X) TORCH_CHECK(X) +#endif + class C10_API SymInt { + enum Unchecked { + UNCHECKED, + }; + public: - // TODO: this needs to only accept integers, not pointers - /*implicit*/ SymInt(int64_t d) : data_(d){}; - SymInt() = default; + /*implicit*/ SymInt(int64_t d) : data_(d) { + SKIP_IS_SYMBOLIC_ON_MOBILE(!is_symbolic()); + }; + SymInt() : data_(0) {} + + // unchecked c-tor accepting raw `data_` + SymInt(Unchecked, int64_t d) : data_(d) {} // TODO: these implementations are not optimal because they allocate a // temporary and then use the move constructor/assignment @@ -55,12 +72,14 @@ class C10_API SymInt { return *this; } SymInt& operator=(SymInt&& s) { + release_(); // release the current SymIntNode if any data_ = s.data_; if (s.is_symbolic()) s.data_ = 0; return *this; } +#ifndef C10_MOBILE SymIntNodeImpl* toSymIntNodeImplUnowned() const { uint64_t unextended_bits = static_cast(data_) & ~MASK; uint64_t sign_bit_mask = 1ULL << (62 - 1); @@ -70,35 +89,58 @@ class C10_API SymInt { reinterpret_cast(static_cast(extended_bits))); } - ~SymInt() { + void release_() { if (is_symbolic()) { SymIntNode::reclaim(toSymIntNodeImplUnowned()); // steal } } +#else + void release_() {} +#endif + + SymIntNode toSymIntNodeImpl() const; + static c10::SymInt toSymInt(SymIntNode sin); + + ~SymInt() { + release_(); + } int64_t expect_int() const { - TORCH_CHECK(!is_symbolic()); + SKIP_IS_SYMBOLIC_ON_MOBILE(!is_symbolic()); return data_; } - bool is_symbolic() const { + // N.B. It's important to keep this definition in the header + // as we expect if checks to be folded for mobile builds + // where `is_symbolic` is always false + C10_ALWAYS_INLINE bool is_symbolic() const { +#ifdef C10_MOBILE + return false; +#else return (MASK & static_cast(this->data_)) == IS_SYM; +#endif } SymInt operator+(SymInt sci) const; + SymInt operator-(SymInt sci) const; SymInt operator*(SymInt sci) const; + SymInt operator/(SymInt sci) const; + SymInt operator%(SymInt sci) const; bool operator==(SymInt sci) const; bool operator!=(SymInt p2) const; bool operator<(SymInt sci) const; + bool operator<=(SymInt sci) const; + bool operator>(SymInt sci) const; + bool operator>=(SymInt sci) const; void operator*=(SymInt sci); SymInt operator*(int64_t sci) const; bool operator<(int64_t sci) const; bool operator==(int64_t sci) const; bool operator!=(int64_t sci) const; - - SymIntNode toSymIntNodeImpl() const; - static c10::SymInt toSymInt(SymIntNode sin); + bool operator<=(int64_t sci) const; + bool operator>(int64_t sci) const; + bool operator>=(int64_t sci) const; int64_t as_int_unchecked() const { return data_; @@ -134,5 +176,7 @@ class C10_API SymInt { int64_t data_; }; +#undef SKIP_IS_SYMBOLIC_ON_MOBILE + C10_API std::ostream& operator<<(std::ostream& os, SymInt s); } // namespace c10 diff --git a/c10/core/SymIntArrayRef.h b/c10/core/SymIntArrayRef.h index bf2eb65c55366..6bfbc945ef91a 100644 --- a/c10/core/SymIntArrayRef.h +++ b/c10/core/SymIntArrayRef.h @@ -81,9 +81,9 @@ class SymIntArrayRef final { static SymIntArrayRef fromIntArrayRef(IntArrayRef array_ref) { for (size_t i = 0; i < array_ref.size(); ++i) { - TORCH_INTERNAL_ASSERT_DEBUG_ONLY( + TORCH_CHECK( SymInt::check_range(array_ref[i]), - "IntArrayRef contains int that cannot be representative as a SymInt", + "IntArrayRef contains an int that cannot be represented as a SymInt: ", array_ref[i]); } return SymIntArrayRef( diff --git a/c10/core/SymIntNodeImpl.h b/c10/core/SymIntNodeImpl.h index e5ffd2d5ef6a3..da4beaeae7dc7 100644 --- a/c10/core/SymIntNodeImpl.h +++ b/c10/core/SymIntNodeImpl.h @@ -33,7 +33,10 @@ class C10_API SymIntNodeImpl : public c10::intrusive_ptr_target { virtual SymIntNode mul(const SymIntNode& other) { TORCH_CHECK(false, "NYI"); }; - virtual SymIntNode div(const SymIntNode& other) { + virtual SymIntNode truediv(const SymIntNode& other) { + TORCH_CHECK(false, "FP division isn't support for SymInts"); + }; + virtual SymIntNode floordiv(const SymIntNode& other) { TORCH_CHECK(false, "NYI"); }; virtual SymIntNode mod(const SymIntNode& other) { diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp index e2d8e9684e6f9..5d85e90138c7b 100644 --- a/c10/core/TensorImpl.cpp +++ b/c10/core/TensorImpl.cpp @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -181,7 +182,6 @@ TensorImpl::TensorImpl( if (!is_inference()) { version_counter_ = VariableVersion(/*version=*/0); } - // we would also like to check that non-cpu devices have an index, but some // Caffe2 operators create Storages with default devices. } @@ -209,16 +209,20 @@ void TensorImpl::HandleResize() { // If needed, we will free the data. the next mutable_data() call // will create the data storage. bool reset_tensor = false; + + TORCH_CHECK(!numel_.is_symbolic(), "CAFFE2 doesn't support SymInts"); + int concrete_numel = numel_.as_int_unchecked(); if (reserved_) { // If tensor is reserved then don't claim its memeory unless nbytes() // is smaller than new size - reset_tensor = - storage_.nbytes() < (storage_offset_ + numel_) * data_type_.itemsize(); + reset_tensor = storage_.nbytes() < + (storage_offset_ + concrete_numel) * data_type_.itemsize(); } else { reset_tensor = storage_.nbytes() < - (storage_offset_ + numel_) * data_type_.itemsize() || + (storage_offset_ + concrete_numel) * data_type_.itemsize() || !FLAGS_caffe2_keep_on_shrink || - storage_.nbytes() - (storage_offset_ + numel_) * data_type_.itemsize() > + storage_.nbytes() - + (storage_offset_ + concrete_numel) * data_type_.itemsize() > static_cast(FLAGS_caffe2_max_keep_on_shrink_memory); } @@ -419,6 +423,20 @@ c10::SymIntArrayRef TensorImpl::sym_sizes_custom() const { return sym_sizes_default(); } +c10::SymInt TensorImpl::sym_numel_custom() const { + if (C10_UNLIKELY(is_python_dispatch())) { + return load_pyobj_interpreter()->sym_numel(this); + } + return sym_numel_default(); +} + +c10::SymIntArrayRef TensorImpl::sym_strides_custom() const { + if (C10_UNLIKELY(is_python_dispatch())) { + return load_pyobj_interpreter()->sym_strides(this); + } + return sym_strides_default(); +} + c10::Device TensorImpl::device_custom() const { if (is_python_dispatch()) { return load_pyobj_interpreter()->device(this); @@ -526,17 +544,25 @@ template c10::intrusive_ptr TensorImpl::shallow_copy_and_detach_core( VariableVersion&& version_counter, bool allow_tensor_metadata_change) const { - if (key_set_.has(DispatchKey::Python) && + c10::intrusive_ptr r; + const auto& maybe_torch_dispatch_mode_state = + c10::impl::TorchDispatchModeTLS::get_state(); + // TODO: do we have to exclude after Python dispatch key set? + if (maybe_torch_dispatch_mode_state && !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) { - auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this); - if (r) { - r->set_version_counter(std::forward(version_counter)); - r->set_allow_tensor_metadata_change(allow_tensor_metadata_change); - return r; - } - // otherwise just copy the TensorImpl and not the PyObject. Since - // the interpreter is dead no one can call us out on it + r = maybe_torch_dispatch_mode_state->pyinterpreter()->detach(this); + } else if ( + key_set_.has(DispatchKey::Python) && + !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) { + r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this); } + if (r) { + r->set_version_counter(std::forward(version_counter)); + r->set_allow_tensor_metadata_change(allow_tensor_metadata_change); + return r; + } + // otherwise just copy the TensorImpl and not the PyObject. Since + // the interpreter is dead no one can call us out on it auto impl = c10::make_intrusive( // No need to populate Storage; copy_tensor_metadata will do it for us. key_set_, @@ -690,7 +716,7 @@ void TensorImpl::Extend(int64_t num, float growthPct) { sizes_and_strides_.size_at_unchecked(0).as_int_unchecked() * (1 + growthPct / 100)))); auto oldData = std::move(storage_.data_ptr()); - auto oldSize = numel_; + auto oldSize = numel_.as_int_unchecked(); Resize(newCapacity); auto* newData = raw_mutable_data(data_type_); if (data_type_.copy()) { @@ -726,7 +752,7 @@ void TensorImpl::ReserveSpace(int64_t outer_dim) { "Right now ReserveSpace is only supported for contiguous Tensor."); TORCH_CHECK( !has_symbolic_sizes_strides_, - "ReserveSpace() called on tensor with symbolic shape") + "ReserveSpace() called on tensor with symbolic shape"); TORCH_CHECK(storage_.unique(), "Can't call ReserveSpace on shared storage."); // TODO: eliminate newCapacity. @@ -758,7 +784,7 @@ void TensorImpl::Reshape(const std::vector& dims) { "Right now Reshape is only supported for contiguous Tensor."); TORCH_CHECK( !has_symbolic_sizes_strides_, - "Reshape() called on tensor with symbolic shape") + "Reshape() called on tensor with symbolic shape"); int64_t new_size = 1; for (auto d : dims) { @@ -766,7 +792,7 @@ void TensorImpl::Reshape(const std::vector& dims) { new_size *= d; } TORCH_CHECK( - new_size == numel_, + new_size == numel_.as_int_unchecked(), "New size and old size are not equal. You cannot use Reshape, " "but should use Resize." // TODO(jiayq): remove the following warning after pending diffs @@ -828,8 +854,11 @@ void TensorImpl::ShareExternalPointer( data_type != ScalarType::Undefined, "To share with a raw external pointer you need to pass in an " "initialized data_type(TypeMeta)."); + TORCH_CHECK( + !has_symbolic_sizes_strides_, + "ReserveSpace() called on tensor with symbolic shape"); if (!size_bytes) { - size_bytes = numel_ * data_type.itemsize(); + size_bytes = numel_.as_int_unchecked() * data_type.itemsize(); } if (storage_.unique()) { storage_.UniqueStorageShareExternalPointer(std::move(data_ptr), size_bytes); diff --git a/c10/core/TensorImpl.h b/c10/core/TensorImpl.h index a2ffa3123b083..490f92c4c02fd 100644 --- a/c10/core/TensorImpl.h +++ b/c10/core/TensorImpl.h @@ -564,6 +564,21 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { virtual c10::SymIntArrayRef sym_sizes_custom() const; + c10::SymInt sym_numel() const { + if (C10_UNLIKELY( + sizes_strides_policy_ >= + static_cast(SizesStridesPolicy::CustomSizes))) { + return sym_numel_custom(); + } + return sym_numel_default(); + } + + inline c10::SymInt sym_numel_default() const { + return numel_; + } + + virtual c10::SymInt sym_numel_custom() const; + /** * Return a reference to the strides of this tensor. This reference remains * valid as long as the tensor is live and not restrided. @@ -577,6 +592,23 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { return strides_default(); } + // TODO: make it non-virtual after a change to XLA + virtual c10::SymIntArrayRef sym_strides() const { + if (C10_UNLIKELY( + sizes_strides_policy_ >= + static_cast(SizesStridesPolicy::CustomStrides))) { + return sym_strides_custom(); + } + return sym_strides_default(); + } + inline c10::SymIntArrayRef sym_strides_default() const { + return c10::SymIntArrayRef( + reinterpret_cast(sizes_and_strides_.strides_data()), + sizes_and_strides_.size()); + } + + virtual c10::SymIntArrayRef sym_strides_custom() const; + /** * Return the size of a tensor at some dimension, wrapping the dimension if * necessary. @@ -746,9 +778,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { inline int64_t numel_default() const { #ifdef DEBUG - TORCH_INTERNAL_ASSERT(compute_numel() == numel_); + TORCH_INTERNAL_ASSERT(compute_numel() == numel_.as_int_unchecked()); #endif - return numel_; + return numel_.as_int_unchecked(); } public: @@ -1493,7 +1525,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { * ] for details. */ void set_allow_tensor_metadata_change(bool value) { - allow_tensor_metadata_change_ = value; + // TODO: at some point, we should kill this field completely. + allow_tensor_metadata_change_ = true; } /** @@ -1926,6 +1959,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { * and a new storage will be created. */ inline void* raw_mutable_data(const caffe2::TypeMeta meta) { + auto concrete_numel = numel_.expect_int(); +#ifdef DEBUG + TORCH_INTERNAL_ASSERT(compute_numel() == concrete_numel); +#endif // For 0-size tensors it's fine to return any pointer (including nullptr) if (data_type_ == meta && storage_initialized()) { return static_cast( @@ -1940,9 +1977,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { // We can reuse the existing buffer if the current data does not have // a special destructor and the new data doesn't have a special // constructor. - if (numel_ == 0 || + if (concrete_numel == 0 || (meta.placementNew() == nullptr && !had_special_dtor && - (storage_.nbytes() >= (numel_ * data_type_.itemsize())))) { + (storage_.nbytes() >= (concrete_numel * data_type_.itemsize())))) { TORCH_INTERNAL_ASSERT( storage_offset_ == 0); // because we just reallocated return storage_.data(); @@ -1959,18 +1996,18 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { // For types that need placement new, we will call it, as well as // making sure that when the data is freed, it calls the right // destruction procedure. - auto size = numel_; auto dtor = data_type_.placementDelete(); - auto data_ptr = allocator->allocate(numel_ * data_type_.itemsize()); + auto data_ptr = + allocator->allocate(concrete_numel * data_type_.itemsize()); storage_.set_data_ptr_noswap(PlacementDeleteContext::makeDataPtr( - std::move(data_ptr), dtor, size, storage_.device())); - data_type_.placementNew()(storage_.data(), numel_); + std::move(data_ptr), dtor, concrete_numel, storage_.device())); + data_type_.placementNew()(storage_.data(), concrete_numel); } else { // For fundamental type, new and delete is easier. storage_.set_data_ptr_noswap( - allocator->allocate(numel_ * data_type_.itemsize())); + allocator->allocate(concrete_numel * data_type_.itemsize())); } - storage_.set_nbytes(numel_ * data_type_.itemsize()); + storage_.set_nbytes(concrete_numel * data_type_.itemsize()); TORCH_INTERNAL_ASSERT( storage_offset_ == 0); // because we just reallocated device_opt_ = storage_.device(); @@ -2045,7 +2082,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { "empty_tensor_restride() called on tensor with symbolic shape") #ifdef DEBUG TORCH_INTERNAL_ASSERT( - compute_numel() == numel_, + compute_numel() == numel_.as_int_unchecked(), "If you are seeing this error, that means empty_tensor_restride was " "called before setting correct numel"); #endif @@ -2469,7 +2506,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target { // time, we will immediately set sizes to {0} and reset numel to 0. // (Can't do that in the default initializers, because there's no way to // spell "allocate a one-element array" for strides_). - int64_t numel_ = 1; + SymInt numel_ = c10::SymInt(1); // INVARIANT: When storage is non-null, this type meta must // agree with the type meta in storage diff --git a/c10/core/WrapDimMinimal.cpp b/c10/core/WrapDimMinimal.cpp index 2dc359fc5d4fd..920de4b9a38e6 100644 --- a/c10/core/WrapDimMinimal.cpp +++ b/c10/core/WrapDimMinimal.cpp @@ -7,10 +7,13 @@ int64_t maybe_wrap_dim_slow( int64_t dim, int64_t dim_post_expr, bool wrap_scalar) { - if (dim_post_expr <= 0) { + TORCH_CHECK_INDEX( + dim_post_expr >= 0, "Rank cannot be negative but got ", dim_post_expr); + + if (dim_post_expr == 0) { TORCH_CHECK_INDEX( wrap_scalar, - "dimension specified as ", + "Dimension specified as ", dim, " but tensor has no dimensions"); return c10::maybe_wrap_dim(dim, /*dim_post_expr=*/1, /*wrap_scalar=*/false); diff --git a/c10/core/impl/GPUTrace.cpp b/c10/core/impl/GPUTrace.cpp new file mode 100644 index 0000000000000..405ab2c9654a4 --- /dev/null +++ b/c10/core/impl/GPUTrace.cpp @@ -0,0 +1,22 @@ +#include + +#include +#include + +namespace c10 { +namespace impl { + +std::atomic GPUTrace::gpuTraceState{nullptr}; + +bool GPUTrace::haveState{false}; + +void GPUTrace::set_trace(const PyInterpreter* trace) { + static c10::once_flag flag; + c10::call_once(flag, [&]() { + gpuTraceState.store(trace, std::memory_order_release); + haveState = true; + }); +} + +} // namespace impl +} // namespace c10 diff --git a/c10/core/impl/GPUTrace.h b/c10/core/impl/GPUTrace.h new file mode 100644 index 0000000000000..377af88be034a --- /dev/null +++ b/c10/core/impl/GPUTrace.h @@ -0,0 +1,30 @@ +#pragma once + +#include + +namespace c10 { +namespace impl { + +struct C10_API GPUTrace { + // On the x86 architecture the atomic operations are lock-less. + static std::atomic gpuTraceState; + + // When PyTorch migrates to C++20, this should be changed to an atomic flag. + // Currently, the access to this variable is not synchronized, on the basis + // that it will only be flipped once and by the first interpreter that + // accesses it. + static bool haveState; + + // This function will only register the first interpreter that tries to invoke + // it. For all of the next ones it will be a no-op. + static void set_trace(const PyInterpreter*); + + static const PyInterpreter* get_trace() { + if (!haveState) + return nullptr; + return gpuTraceState.load(std::memory_order_acquire); + } +}; + +} // namespace impl +} // namespace c10 diff --git a/c10/core/impl/PyInterpreter.cpp b/c10/core/impl/PyInterpreter.cpp index eec1d23e66da1..76a54663ff546 100644 --- a/c10/core/impl/PyInterpreter.cpp +++ b/c10/core/impl/PyInterpreter.cpp @@ -5,6 +5,23 @@ namespace c10 { namespace impl { +template +static void noop_trace_gpu_fn(const PyInterpreter*, Ts...) { + TORCH_INTERNAL_ASSERT( + 0, + "attempted to call a GPU trace function after corresponding interpreter died"); +} + +void GPUTraceFunctionWrapper::disarm() { + event_creation_fn_ = &noop_trace_gpu_fn; + event_deletion_fn_ = &noop_trace_gpu_fn; + event_record_fn_ = &noop_trace_gpu_fn; + event_wait_fn_ = &noop_trace_gpu_fn; + memory_allocation_fn_ = &noop_trace_gpu_fn; + memory_deallocation_fn_ = &noop_trace_gpu_fn; + stream_creation_fn_ = &noop_trace_gpu_fn; +} + static std::string noop_name_fn(const PyInterpreter*) { return ""; } @@ -76,6 +93,20 @@ static c10::Layout noop_layout_fn(const PyInterpreter*, const TensorImpl*) { "attempted to call `layout` on Tensor with nontrivial PyObject after corresponding interpreter died"); } +static c10::SymInt noop_sym_numel_fn(const PyInterpreter*, const TensorImpl*) { + TORCH_INTERNAL_ASSERT( + 0, + "attempted to call `sym_numel` on Tensor with nontrivial PyObject after corresponding interpreter died"); +} + +static c10::SymIntArrayRef noop_sym_strides_fn( + const PyInterpreter*, + const TensorImpl*) { + TORCH_INTERNAL_ASSERT( + 0, + "attempted to call `sym_strides` on Tensor with nontrivial PyObject after corresponding interpreter died"); +} + void PyInterpreter::disarm() noexcept { name_fn_ = &noop_name_fn; decref_fn_ = &noop_decref_fn; @@ -88,6 +119,15 @@ void PyInterpreter::disarm() noexcept { sizes_fn_ = &noop_sizes_fn; sym_sizes_fn_ = &noop_sym_sizes_fn; layout_fn_ = &noop_layout_fn; + sym_numel_fn_ = &noop_sym_numel_fn; + trace_gpu_functions.disarm(); + sym_strides_fn_ = &noop_sym_strides_fn; +} + +// Defined out-of-line because it needs access to the definition of TensorImpl. +__ubsan_ignore_function__ c10::intrusive_ptr PyInterpreter::detach( + const TensorImpl* self) const { + return (*detach_fn_)(this, self); } } // namespace impl diff --git a/c10/core/impl/PyInterpreter.h b/c10/core/impl/PyInterpreter.h index db3d9753b9dc6..3f125f6dc2be5 100644 --- a/c10/core/impl/PyInterpreter.h +++ b/c10/core/impl/PyInterpreter.h @@ -30,6 +30,46 @@ using Stack = std::vector; namespace c10 { namespace impl { +struct C10_API PyInterpreter; + +struct C10_API GPUTraceFunctionWrapper { + using event_creation_sig = void(const PyInterpreter*, uintptr_t event); + using event_deletion_sig = void(const PyInterpreter*, uintptr_t event); + using event_record_sig = + void(const PyInterpreter*, uintptr_t event, uintptr_t stream); + using event_wait_sig = + void(const PyInterpreter*, uintptr_t event, uintptr_t stream); + using memory_allocation_sig = void(const PyInterpreter*, uintptr_t pointer); + using memory_deallocation_sig = void(const PyInterpreter*, uintptr_t pointer); + using stream_creation_sig = void(const PyInterpreter*, uintptr_t stream); + + event_creation_sig* event_creation_fn_; + event_deletion_sig* event_deletion_fn_; + event_record_sig* event_record_fn_; + event_wait_sig* event_wait_fn_; + memory_allocation_sig* memory_allocation_fn_; + memory_deallocation_sig* memory_deallocation_fn_; + stream_creation_sig* stream_creation_fn_; + + GPUTraceFunctionWrapper( + event_creation_sig* event_creation_fn, + event_deletion_sig* event_deletion_fn, + event_record_sig* event_record_fn, + event_wait_sig* event_wait_fn, + memory_allocation_sig* memory_allocation_fn, + memory_deallocation_sig* memory_deallocation_fn, + stream_creation_sig* stream_creation_fn) + : event_creation_fn_(event_creation_fn), + event_deletion_fn_(event_deletion_fn), + event_record_fn_(event_record_fn), + event_wait_fn_(event_wait_fn), + memory_allocation_fn_(memory_allocation_fn), + memory_deallocation_fn_(memory_deallocation_fn), + stream_creation_fn_(stream_creation_fn) {} + + void disarm(); +}; + // Note [Python interpreter tag] // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Traditionally, PyTorch is layered such that our Python library @@ -136,6 +176,9 @@ struct C10_API PyInterpreter { using sym_sizes_sig = c10::SymIntArrayRef(const PyInterpreter*, const TensorImpl*); using layout_sig = c10::Layout(const PyInterpreter*, const TensorImpl*); + using sym_numel_sig = c10::SymInt(const PyInterpreter*, const TensorImpl*); + using sym_strides_sig = + c10::SymIntArrayRef(const PyInterpreter*, const TensorImpl*); PyInterpreter( name_sig* name_fn, @@ -148,7 +191,10 @@ struct C10_API PyInterpreter { strides_sig* strides, sizes_sig* sizes, sym_sizes_sig* sym_sizes, - layout_sig* layout) + layout_sig* layout, + sym_numel_sig* sym_numel, + sym_strides_sig* sym_strides, + GPUTraceFunctionWrapper trace_gpu_functions) : name_fn_(name_fn), decref_fn_(decref_fn), detach_fn_(detach), @@ -159,7 +205,10 @@ struct C10_API PyInterpreter { strides_fn_(strides), sizes_fn_(sizes), sym_sizes_fn_(sym_sizes), - layout_fn_(layout) {} + layout_fn_(layout), + sym_numel_fn_(sym_numel), + trace_gpu_functions(trace_gpu_functions), + sym_strides_fn_(sym_strides) {} name_sig* name_fn_; decref_sig* decref_fn_; @@ -172,6 +221,9 @@ struct C10_API PyInterpreter { sizes_sig* sizes_fn_; sym_sizes_sig* sym_sizes_fn_; layout_sig* layout_fn_; + sym_numel_sig* sym_numel_fn_; + GPUTraceFunctionWrapper trace_gpu_functions; + sym_strides_sig* sym_strides_fn_; // UBSAN suppression fixes: "call to function // (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, @@ -194,9 +246,7 @@ struct C10_API PyInterpreter { // detach, which will also arrange for the PyObject to get copied in this // situation __ubsan_ignore_function__ c10::intrusive_ptr detach( - const TensorImpl* self) const { - return (*detach_fn_)(this, self); - } + const TensorImpl* self) const; // Invoke the Python boxed fallback dispatch to go back into Python __ubsan_ignore_function__ void dispatch( @@ -236,6 +286,53 @@ struct C10_API PyInterpreter { return (*layout_fn_)(this, self); } + __ubsan_ignore_function__ c10::SymInt sym_numel( + const TensorImpl* self) const { + return (*sym_numel_fn_)(this, self); + } + + __ubsan_ignore_function__ void trace_gpu_event_creation( + uintptr_t event) const { + return (*trace_gpu_functions.event_creation_fn_)(this, event); + } + + __ubsan_ignore_function__ void trace_gpu_event_deletion( + uintptr_t event) const { + return (*trace_gpu_functions.event_deletion_fn_)(this, event); + } + + __ubsan_ignore_function__ void trace_gpu_event_record( + uintptr_t event, + uintptr_t stream) const { + return (*trace_gpu_functions.event_record_fn_)(this, event, stream); + } + + __ubsan_ignore_function__ void trace_gpu_event_wait( + uintptr_t event, + uintptr_t stream) const { + return (*trace_gpu_functions.event_wait_fn_)(this, event, stream); + } + + __ubsan_ignore_function__ void trace_gpu_memory_allocation( + uintptr_t ptr) const { + return (*trace_gpu_functions.memory_allocation_fn_)(this, ptr); + } + + __ubsan_ignore_function__ void trace_gpu_memory_deallocation( + uintptr_t ptr) const { + return (*trace_gpu_functions.memory_deallocation_fn_)(this, ptr); + } + + __ubsan_ignore_function__ void trace_gpu_stream_creation( + uintptr_t stream) const { + return (*trace_gpu_functions.stream_creation_fn_)(this, stream); + } + + __ubsan_ignore_function__ c10::SymIntArrayRef sym_strides( + const TensorImpl* self) const { + return (*sym_strides_fn_)(this, self); + } + // Disarm this PyInterpreter, making all of its methods noops. // Because the function pointers are raw pointers (not atomics), // a disarm() invocation that is concurrent with active destructors diff --git a/c10/core/impl/TorchDispatchModeTLS.cpp b/c10/core/impl/TorchDispatchModeTLS.cpp new file mode 100644 index 0000000000000..fbf9504f7b5af --- /dev/null +++ b/c10/core/impl/TorchDispatchModeTLS.cpp @@ -0,0 +1,38 @@ +#include +#include +#include +#include + +namespace c10 { +namespace impl { + +thread_local std::shared_ptr torchDispatchModeState; + +void TorchDispatchModeTLS::set_state(std::shared_ptr state) { + if (state) { + c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true); + c10::impl::tls_set_dispatch_key_included( + DispatchKey::PythonTLSSnapshot, true); + } else { + TorchDispatchModeTLS::reset_state(); + } + torchDispatchModeState = std::move(state); +} + +const std::shared_ptr& TorchDispatchModeTLS::get_state() { + return torchDispatchModeState; +} + +void TorchDispatchModeTLS::reset_state() { + torchDispatchModeState.reset(); + c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false); + c10::impl::tls_set_dispatch_key_included( + DispatchKey::PythonTLSSnapshot, false); +} + +bool dispatch_mode_enabled() { + return static_cast(c10::impl::TorchDispatchModeTLS::get_state()); +} + +} // namespace impl +} // namespace c10 diff --git a/c10/core/impl/TorchDispatchModeTLS.h b/c10/core/impl/TorchDispatchModeTLS.h new file mode 100644 index 0000000000000..81aa34b11c5fc --- /dev/null +++ b/c10/core/impl/TorchDispatchModeTLS.h @@ -0,0 +1,20 @@ +#pragma once + +#include +#include +#include +#include + +namespace c10 { +namespace impl { + +struct C10_API TorchDispatchModeTLS { + static void set_state(std::shared_ptr state); + static const std::shared_ptr& get_state(); + static void reset_state(); +}; + +C10_API bool dispatch_mode_enabled(); + +} // namespace impl +} // namespace c10 diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp index d60f6960e9f91..c6f91f5a59900 100644 --- a/c10/cuda/CUDACachingAllocator.cpp +++ b/c10/cuda/CUDACachingAllocator.cpp @@ -1,6 +1,7 @@ #include +#include #include #include #include @@ -180,6 +181,8 @@ struct Block { int event_count; // number of outstanding CUDA events int gc_count; // counter for prioritizing older / less useful blocks for // garbage collection + std::unique_ptr history; + History* history_last; Block( int device, @@ -279,6 +282,18 @@ struct AllocParams { cudaError_t err; }; +int trimHistoryBefore(Block* block, void* point) { + int n = 0; + while (block->history && block->history->addr < point) { + block->history = std::move(block->history->next); + ++n; + } + if (!block->history) { + block->history_last = nullptr; + } + return n; +} + // Note: cudaEventCreate when concurrently invoked from multiple threads can be // very expensive (at least on certain device/driver combinations). Thus, we a) // serialize event creation at a per-device level, and b) pool the events to @@ -534,18 +549,30 @@ class DeviceCachingAllocator { // Maps a capturing stream to its assigned private pool, // in case we want multiple captures to share the same pool ska::flat_hash_map capture_to_pool_map; + std::atomic context_recorder_; public: DeviceCachingAllocator() : large_blocks(BlockComparator, /*is_small=*/false), small_blocks(BlockComparator, /*is_small=*/true) { stats.max_split_size = CachingAllocatorConfig::max_split_size(); + context_recorder_.store(nullptr); + } + + void setContextRecorder(CreateContextFn c) { + context_recorder_.store(c); } // All public methods (except the above) acquire the allocator mutex. // Thus, do not call a public method from another public method. - Block* malloc(int device, size_t size, cudaStream_t stream) { + Block* malloc(int device, size_t orig_size, cudaStream_t stream) { + // done outside the lock because we don't know what locks the recorder needs + // to have... + CreateContextFn context_recorder = context_recorder_.load(); + std::unique_ptr context = + context_recorder ? context_recorder() : nullptr; + std::unique_lock lock(mutex); if (C10_LIKELY(captures_underway == 0)) { @@ -562,7 +589,7 @@ class DeviceCachingAllocator { process_events(); } - size = round_size(size); + size_t size = round_size(orig_size); auto& pool = get_pool(size, stream); const size_t alloc_size = get_allocation_size(size); AllocParams params(device, size, stream, &pool, alloc_size, stats); @@ -637,7 +664,7 @@ class DeviceCachingAllocator { // possible "cached" memory to the driver. The only remaining "cached" // memory is split from a larger block that is partially in-use. TORCH_CHECK_WITH( - CUDAOutOfMemoryError, + OutOfMemoryError, false, "CUDA out of memory. Tried to allocate ", format_size(alloc_size), @@ -685,6 +712,10 @@ class DeviceCachingAllocator { bool inserted = pool.blocks.insert(remaining).second; TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted); + if (context) { + trimHistoryBefore(remaining, (char*)block->ptr + size); + } + if (already_split) { // An already-split inactive block is being shrunk by size bytes. update_stat_array( @@ -697,6 +728,7 @@ class DeviceCachingAllocator { update_stat(stats.inactive_split[stat_type], 1); }); } + } else if (already_split) { // An already-split block is becoming active for_each_selected_stat_type(params.stat_types, [&](size_t stat_type) { @@ -706,6 +738,17 @@ class DeviceCachingAllocator { } block->allocated = true; + if (context) { + trimHistoryBefore(block, (char*)block->ptr + size); + block->history = std::make_unique(History{ + block->ptr, + orig_size, + std::move(context), + std::move(block->history)}); + if (!block->history_last) { + block->history_last = block->history.get(); + } + } bool inserted = active_blocks.insert(block).second; TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted); @@ -894,6 +937,7 @@ class DeviceCachingAllocator { SegmentInfo& segment_info = result.back(); segment_info.device = head_block->device; segment_info.address = reinterpret_cast(head_block->ptr); + segment_info.stream = head_block->stream; segment_info.is_large = (!head_block->pool->is_small); const Block* block = head_block; @@ -913,7 +957,7 @@ class DeviceCachingAllocator { if (block_info.active) { segment_info.active_size += block_info.size; } - + block_info.history = block->history.get(); block = block->next; } } @@ -1107,19 +1151,35 @@ class DeviceCachingAllocator { AT_ASSERT(dst->is_split() && src->is_split()); - if (dst->prev == src) { + if (dst->prev == src) { // [src dst] dst->ptr = src->ptr; dst->prev = src->prev; if (dst->prev) { dst->prev->next = dst; } - } else { + if (!dst->history) { + dst->history = std::move(src->history); + dst->history_last = src->history_last; + } else if (src->history) { + src->history_last->next = std::move(dst->history); + dst->history = std::move(src->history); + } + src->history_last = nullptr; + } else { // [dest src] dst->next = src->next; if (dst->next) { dst->next->prev = dst; } - } + if (!dst->history) { + dst->history = std::move(src->history); + dst->history_last = src->history_last; + } else if (src->history) { + dst->history_last->next = std::move(src->history); + dst->history_last = src->history_last; + } + src->history_last = nullptr; + } const size_t subsumed_size = src->size; dst->size += subsumed_size; auto erased = pool.blocks.erase(src); @@ -1345,7 +1405,14 @@ class DeviceCachingAllocator { std::numeric_limits::max()) return false; BlockPool& pool = *p.pool; - Block key = p.search_key; + + // because of std::unique_ptr, block cannot be trivially copied + Block key( + p.search_key.device, + p.search_key.stream, + p.search_key.size, + p.search_key.pool, + p.search_key.ptr); key.size = (key.size < CachingAllocatorConfig::max_split_size()) ? CachingAllocatorConfig::max_split_size() : key.size; @@ -1614,6 +1681,10 @@ class THCCachingAllocator { Block* block = device_allocator[device]->malloc(device, size, stream); add_allocated_block(block); *devPtr = (void*)block->ptr; + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_memory_allocation(reinterpret_cast(*devPtr)); + } } void free(void* ptr) { @@ -1624,6 +1695,11 @@ class THCCachingAllocator { if (!block) { TORCH_CHECK(false, "invalid device pointer: ", ptr); } + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_memory_deallocation( + reinterpret_cast(block->ptr)); + } device_allocator[block->device]->free(block); } @@ -1646,6 +1722,12 @@ class THCCachingAllocator { device_allocator[device]->setMemoryFraction(fraction); } + void setContextRecorder(CreateContextFn recorder) { + int device; + C10_CUDA_CHECK(cudaGetDevice(&device)); + device_allocator[device]->setContextRecorder(std::move(recorder)); + } + void emptyCache() { for (auto& da : device_allocator) da->emptyCache(); @@ -1703,6 +1785,10 @@ bool forceUncachedAllocator() { } static void uncached_delete(void* ptr) { + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_memory_deallocation(reinterpret_cast(ptr)); + } C10_CUDA_CHECK(cudaFree(ptr)); } @@ -1713,7 +1799,7 @@ struct CudaCachingAllocator : public Allocator { DataPtr allocate(size_t size) const override { constexpr size_t one_exa_bytes = 1152921504606846976ULL; TORCH_CHECK_WITH( - CUDAOutOfMemoryError, + OutOfMemoryError, size < one_exa_bytes, "CUDA out of memory. Tried to allocate more than 1EB memory."); int device; @@ -1723,6 +1809,10 @@ struct CudaCachingAllocator : public Allocator { // Deliberately don't use cudaMallocMaybeCapturing here, to force an error // if someone tries to use forceUncachedAllocator while capturing. C10_CUDA_CHECK(cudaMalloc(&r, size)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_memory_allocation(reinterpret_cast(r)); + } return {r, r, &uncached_delete, Device(DeviceType::CUDA, device)}; } if (size != 0) { @@ -1754,6 +1844,10 @@ void setMemoryFraction(double fraction, int device) { caching_allocator.setMemoryFraction(fraction, device); } +void setContextRecorder(CreateContextFn recorder) { + caching_allocator.setContextRecorder(std::move(recorder)); +} + void emptyCache(void) { caching_allocator.emptyCache(); } diff --git a/c10/cuda/CUDACachingAllocator.h b/c10/cuda/CUDACachingAllocator.h index 9b1a6ecf15903..0fd23f4e61d58 100644 --- a/c10/cuda/CUDACachingAllocator.h +++ b/c10/cuda/CUDACachingAllocator.h @@ -11,10 +11,6 @@ namespace c10 { -class C10_CUDA_API CUDAOutOfMemoryError : public c10::Error { - using Error::Error; -}; - // Caching allocator will execute every registered callback if it unable to find // block inside of already allocated area. class C10_CUDA_API FreeMemoryCallback { @@ -98,6 +94,20 @@ struct DeviceStats { int64_t max_split_size = 0; }; +struct Context { + virtual ~Context() {} +}; + +typedef std::unique_ptr (*CreateContextFn)(void); + +struct History { + void* addr; + size_t real_size; // unrounded, actually requested size + std::unique_ptr context; // per-watcher context + std::unique_ptr next; // when blocks are merged we keep records of + // what used to be in the block +}; + // Struct containing info of an allocation block (i.e. a fractional part of a // cudaMalloc).. struct BlockInfo { @@ -105,6 +115,8 @@ struct BlockInfo { int32_t gc_counter = 0; bool allocated = false; bool active = false; + History* history = + nullptr; // borrowed reference because it is owned by the allocator }; // Struct containing info of a memory segment (i.e. one contiguous cudaMalloc). @@ -114,6 +126,7 @@ struct SegmentInfo { int64_t total_size = 0; int64_t allocated_size = 0; int64_t active_size = 0; + cudaStream_t stream = 0; bool is_large = false; std::vector blocks; }; @@ -147,6 +160,8 @@ C10_CUDA_API void notifyCaptureDestroy(int device, MempoolId_t mempool_id); C10_CUDA_API std::mutex* getFreeMutex(); +C10_CUDA_API void setContextRecorder(CreateContextFn recorder); + C10_CUDA_API std::shared_ptr getIpcDevPtr(std::string handle); } // namespace CUDACachingAllocator diff --git a/c10/cuda/CUDAStream.cpp b/c10/cuda/CUDAStream.cpp index b7fc04b50a8c2..e80026cf81b85 100644 --- a/c10/cuda/CUDAStream.cpp +++ b/c10/cuda/CUDAStream.cpp @@ -1,3 +1,4 @@ +#include #include #include #include @@ -165,6 +166,14 @@ static void initDeviceStreamState(DeviceIndex device_index) { &lowpri_stream, kDefaultFlags, kLowPriority)); C10_CUDA_CHECK(cudaStreamCreateWithPriority( &hipri_stream, kDefaultFlags, kHighPriority)); + + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_stream_creation( + reinterpret_cast(lowpri_stream)); + interp->trace_gpu_stream_creation( + reinterpret_cast(hipri_stream)); + } } low_priority_counters[device_index] = 0; diff --git a/c10/cuda/impl/CUDAGuardImpl.h b/c10/cuda/impl/CUDAGuardImpl.h index 583feeec26000..52a9de8ce1cb0 100644 --- a/c10/cuda/impl/CUDAGuardImpl.h +++ b/c10/cuda/impl/CUDAGuardImpl.h @@ -2,6 +2,7 @@ #include #include +#include #include #include @@ -100,6 +101,10 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface { } C10_CUDA_CHECK(cudaEventCreateWithFlags(cuda_event, cuda_flag)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_creation(reinterpret_cast(cuda_event)); + } } void destroyEvent(void* event, const DeviceIndex device_index) @@ -110,6 +115,10 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface { int orig_device; C10_CUDA_CHECK_WARN(cudaGetDevice(&orig_device)); C10_CUDA_CHECK_WARN(cudaSetDevice(device_index)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_deletion(reinterpret_cast(cuda_event)); + } C10_CUDA_CHECK_WARN(cudaEventDestroy(cuda_event)); C10_CUDA_CHECK_WARN(cudaSetDevice(orig_device)); } @@ -140,6 +149,12 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface { C10_CUDA_CHECK(cudaEventRecord(cuda_event, cuda_stream)); // Makes the void* point to the (possibly just allocated) CUDA event *event = cuda_event; + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_record( + reinterpret_cast(cuda_event), + reinterpret_cast(cuda_stream.stream())); + } // Resets device setDevice(orig_device); @@ -156,6 +171,12 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface { cuda_stream, cuda_event, /*flags (must be zero)=*/0)); + const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); + if (C10_UNLIKELY(interp)) { + interp->trace_gpu_event_wait( + reinterpret_cast(cuda_event), + reinterpret_cast(cuda_stream.stream())); + } setDevice(orig_device); } diff --git a/c10/macros/Macros.h b/c10/macros/Macros.h index 84a5045d648c7..b97c16421028a 100644 --- a/c10/macros/Macros.h +++ b/c10/macros/Macros.h @@ -114,22 +114,17 @@ // - MSVC 19.14: https://godbolt.org/z/Dzd7gn (requires /std:c++latest) // - Clang 8.0.0: https://godbolt.org/z/3PYL4Z (always advertises support) // - gcc 8.3: https://godbolt.org/z/4tLMQS (always advertises support) -#define C10_NODISCARD -#if defined(__has_cpp_attribute) -#if __has_cpp_attribute(nodiscard) -#undef C10_NODISCARD +#if C10_HAS_CPP_ATTRIBUTE(nodiscard) #define C10_NODISCARD [[nodiscard]] -#endif // Workaround for llvm.org/PR23435, since clang 3.6 and below emit a spurious // error when __has_cpp_attribute is given a scoped attribute in C mode. -#elif __cplusplus && defined(__has_cpp_attribute) -#if __has_cpp_attribute(clang::warn_unused_result) +#elif __cplusplus && C10_HAS_CPP_ATTRIBUTE(clang::warn_unused_result) // TODO: It's possible this is still triggering // https://github.com/pytorch/pytorch/issues/13118 on Windows; if it is, better // fix it. -#undef C10_NODISCARD #define C10_NODISCARD [[clang::warn_unused_result]] -#endif +#else +#define C10_NODISCARD #endif // suppress an unused variable. @@ -243,8 +238,7 @@ using namespace c10::hip; #define C10_FALLTHROUGH #endif -#include -#include +#include #ifdef __HIPCC__ // Unlike CUDA, HIP requires a HIP header to be included for __host__ to work. @@ -332,7 +326,9 @@ constexpr uint32_t CUDA_THREADS_PER_BLOCK_FALLBACK = 256; // CUDA_KERNEL_ASSERT checks the assertion // even when NDEBUG is defined. This is useful for important assertions in CUDA // code that would otherwise be suppressed when building Release. -#if defined(__ANDROID__) || defined(__APPLE__) || defined(USE_ROCM) +#if defined(__ANDROID__) || defined(__APPLE__) || \ + (defined(USE_ROCM) && ROCM_VERSION < 40100) || \ + (defined(USE_ROCM) && defined(ROCM_DISABLE_GPU_ASSERTS)) // Those platforms do not support assert() #define CUDA_KERNEL_ASSERT(cond) #elif defined(_MSC_VER) @@ -361,22 +357,32 @@ extern SYCL_EXTERNAL void __assert_fail( const char* func); #else // __SYCL_DEVICE_ONLY__ #if (defined(__CUDA_ARCH__) && !(defined(__clang__) && defined(__CUDA__))) +// CUDA supports __assert_fail function which are common for both device +// and host side code. __host__ __device__ -#endif // __CUDA_ARCH__ +#endif + + // This forward declaration matching the declaration of __assert_fail + // exactly how it is in glibc in case parts of the program are compiled with + // different NDEBUG settings. Otherwise we might get 'ambiguous declaration' + // error. Note: On ROCm - this declaration serves for host side compilation. void __assert_fail( const char* assertion, const char* file, unsigned int line, - const char* function) throw() -// We match the declaration of __assert_fail exactly how it is in glibc in case -// parts of the program are compiled with different NDEBUG settings. Otherwise -// we might get 'ambiguous declaration' error. -#ifdef __GNUC__ - __attribute__((__noreturn__)) -#endif - ; -#endif + const char* function) throw() __attribute__((__noreturn__)); + +#if (defined(__HIP_ARCH__) || defined(__HIP__)) && \ + !defined(ROCM_DISABLE_GPU_ASSERTS) +// ROCm supports __assert_fail only as a device side function. +__device__ __attribute__((noinline)) __attribute__((weak)) void __assert_fail( + const char* assertion, + const char* file, + unsigned int line, + const char* function); +#endif // defined(__HIP_ARCH__) || defined(__HIP__) +#endif // __SYCL_DEVICE_ONLY__ } #endif // NDEBUG #define CUDA_KERNEL_ASSERT(cond) \ diff --git a/c10/test/core/SymInt_test.cpp b/c10/test/core/SymInt_test.cpp index 8892cce015daa..a57e7c706486d 100644 --- a/c10/test/core/SymInt_test.cpp +++ b/c10/test/core/SymInt_test.cpp @@ -4,7 +4,7 @@ #include using namespace c10; - +#ifndef C10_MOBILE void check(int64_t value) { EXPECT_TRUE(SymInt::check_range(value)); const auto i = SymInt(value); @@ -29,3 +29,4 @@ TEST(SymIntTest, AddNode) { TEST(SymIntTest, CheckRange) { EXPECT_FALSE(SymInt::check_range(INT64_MIN)); } +#endif diff --git a/c10/util/Exception.h b/c10/util/Exception.h index 327e4cbfabd11..a869038ea444f 100644 --- a/c10/util/Exception.h +++ b/c10/util/Exception.h @@ -235,6 +235,10 @@ class C10_API LinAlgError : public Error { using Error::Error; }; +class C10_API OutOfMemoryError : public Error { + using Error::Error; +}; + // A utility function to return an exception std::string by prepending its // exception type before its what() content C10_API std::string GetExceptionString(const std::exception& e); diff --git a/c10/util/IdWrapper.h b/c10/util/IdWrapper.h index a22a60cb9fc3d..59b5088c270f8 100644 --- a/c10/util/IdWrapper.h +++ b/c10/util/IdWrapper.h @@ -1,6 +1,7 @@ #pragma once #include +#include #include #include diff --git a/c10/util/SmallVector.cpp b/c10/util/SmallVector.cpp index f70c982c83150..d57f4d97b999e 100644 --- a/c10/util/SmallVector.cpp +++ b/c10/util/SmallVector.cpp @@ -17,6 +17,7 @@ #include #include #include +#include using namespace c10; // Check that no bytes are wasted and everything is well-aligned. diff --git a/c10/util/SmallVector.h b/c10/util/SmallVector.h index 1fcc4a1a8f43a..e4672d666a931 100644 --- a/c10/util/SmallVector.h +++ b/c10/util/SmallVector.h @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/c10/util/hash.h b/c10/util/hash.h index d4bb42da21c96..9d771e401ed46 100644 --- a/c10/util/hash.h +++ b/c10/util/hash.h @@ -304,6 +304,14 @@ struct hash> { } }; +template +struct hash> { + size_t operator()(const std::pair& pair) const { + std::tuple tuple = std::make_tuple(pair.first, pair.second); + return _hash_detail::simple_get_hash(tuple); + } +}; + template struct hash> { size_t operator()(c10::ArrayRef v) const { diff --git a/c10/util/logging_is_google_glog.h b/c10/util/logging_is_google_glog.h index b5860d8c0c9f4..e5470d22cecd3 100644 --- a/c10/util/logging_is_google_glog.h +++ b/c10/util/logging_is_google_glog.h @@ -50,13 +50,14 @@ INSTANTIATE_FOR_CONTAINER(set) #include // Additional macros on top of glog -#ifndef NDEBUG #define TORCH_CHECK_EQ(val1, val2) CHECK_EQ(val1, val2) #define TORCH_CHECK_NE(val1, val2) CHECK_NE(val1, val2) #define TORCH_CHECK_LE(val1, val2) CHECK_LE(val1, val2) #define TORCH_CHECK_LT(val1, val2) CHECK_LT(val1, val2) #define TORCH_CHECK_GE(val1, val2) CHECK_GE(val1, val2) #define TORCH_CHECK_GT(val1, val2) CHECK_GT(val1, val2) + +#ifndef NDEBUG #define TORCH_DCHECK_EQ(val1, val2) DCHECK_EQ(val1, val2) #define TORCH_DCHECK_NE(val1, val2) DCHECK_NE(val1, val2) #define TORCH_DCHECK_LE(val1, val2) DCHECK_LE(val1, val2) @@ -65,24 +66,6 @@ INSTANTIATE_FOR_CONTAINER(set) #define TORCH_DCHECK_GT(val1, val2) DCHECK_GT(val1, val2) #else // !NDEBUG // These versions generate no code in optimized mode. -#define TORCH_CHECK_EQ(val1, val2) \ - while (false) \ - CHECK_EQ(val1, val2) -#define TORCH_CHECK_NE(val1, val2) \ - while (false) \ - CHECK_NE(val1, val2) -#define TORCH_CHECK_LE(val1, val2) \ - while (false) \ - CHECK_LE(val1, val2) -#define TORCH_CHECK_LT(val1, val2) \ - while (false) \ - CHECK_LT(val1, val2) -#define TORCH_CHECK_GE(val1, val2) \ - while (false) \ - CHECK_GE(val1, val2) -#define TORCH_CHECK_GT(val1, val2) \ - while (false) \ - CHECK_GT(val1, val2) #define TORCH_DCHECK_EQ(val1, val2) \ while (false) \ DCHECK_EQ(val1, val2) diff --git a/c10/util/strides.h b/c10/util/strides.h index 40315a625c61f..8a7f7f6301f67 100644 --- a/c10/util/strides.h +++ b/c10/util/strides.h @@ -9,16 +9,12 @@ static inline DimVector contiguous_strides(const IntArrayRef sizes) { using Int = IntArrayRef::value_type; const Int dims = static_cast(sizes.size()); - DimVector strides; + // With this intialisation we get the case dim == 0 or 1 right + DimVector strides(dims, 1); - if (dims > 0) { - strides.assign(dims, 0); - // Start by populating the last dimension: its strides is always 1. - strides[dims - 1] = 1; - for (auto i = dims - 2; i >= 0; --i) { - // Strides can't be 0 even if sizes are 0. - strides[i] = strides[i + 1] * std::max(sizes[i + 1], Int{1}); - } + for (auto i = dims - 2; i >= 0; --i) { + // Strides can't be 0 even if sizes are 0. + strides[i] = strides[i + 1] * std::max(sizes[i + 1], Int{1}); } return strides; diff --git a/c10/util/variant.h b/c10/util/variant.h index 564dd3b55d018..53001afea28c2 100644 --- a/c10/util/variant.h +++ b/c10/util/variant.h @@ -2253,7 +2253,6 @@ class impl : public copy_assignment> { public: C10_MPARK_INHERITING_CTOR(impl, super) - impl& operator=(const impl& other) = default; template inline void assign(Arg&& arg) { diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt index 65cdd576d9c28..a904898040123 100644 --- a/caffe2/CMakeLists.txt +++ b/caffe2/CMakeLists.txt @@ -63,7 +63,7 @@ if(INTERN_BUILD_ATEN_OPS) set(CMAKE_POSITION_INDEPENDENT_CODE ${__caffe2_CMAKE_POSITION_INDEPENDENT_CODE}) # Generate the headers wrapped by our operator - file(GLOB_RECURSE all_python "${PROJECT_SOURCE_DIR}/torchgen/*.py") + file(GLOB_RECURSE torchgen_python "${PROJECT_SOURCE_DIR}/torchgen/*.py") add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/contrib/aten/aten_op.h COMMAND "${PYTHON_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py @@ -72,7 +72,7 @@ if(INTERN_BUILD_ATEN_OPS) --yaml_dir=${CMAKE_CURRENT_BINARY_DIR}/../aten/src/ATen --install_dir=${CMAKE_CURRENT_BINARY_DIR}/contrib/aten DEPENDS - ${all_python} + ${torchgen_python} ${CMAKE_BINARY_DIR}/aten/src/ATen/Declarations.yaml ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/aten_op_template.h) @@ -425,6 +425,9 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE) list(APPEND GEN_PER_OPERATOR_FLAG "--per_operator_headers") endif() + file(GLOB_RECURSE autograd_python "${TOOLS_PATH}/autograd/*.py") + file(GLOB_RECURSE autograd_yaml "${TOOLS_PATH}/autograd/*.yaml") + file(GLOB_RECURSE autograd_templates "${TOOLS_PATH}/autograd/templates/*") add_custom_command( OUTPUT ${TORCH_GENERATED_CODE} @@ -438,48 +441,20 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE) --gen_lazy_ts_backend ${GEN_PER_OPERATOR_FLAG} DEPENDS - "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml" - "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml" - "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml" - "${TORCH_ROOT}/torch/csrc/lazy/core/shape_inference.h" - "${TORCH_ROOT}/torch/csrc/lazy/ts_backend/ts_native_functions.cpp" - "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.h" - "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp" - "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h" - "${TORCH_ROOT}/aten/src/ATen/templates/LazyNonNativeIr.h" - "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp" - "${TOOLS_PATH}/autograd/templates/VariableType.h" - "${TOOLS_PATH}/autograd/templates/VariableType.cpp" - "${TOOLS_PATH}/autograd/templates/ADInplaceOrViewType.cpp" - "${TOOLS_PATH}/autograd/templates/TraceType.cpp" - "${TOOLS_PATH}/autograd/templates/Functions.h" - "${TOOLS_PATH}/autograd/templates/Functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_functions.h" - "${TOOLS_PATH}/autograd/templates/python_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_variable_methods.cpp" - "${TOOLS_PATH}/autograd/templates/python_torch_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_nn_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_fft_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_linalg_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_sparse_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_special_functions.cpp" - "${TOOLS_PATH}/autograd/templates/python_return_types.cpp" - "${TOOLS_PATH}/autograd/templates/python_enum_tag.cpp" - "${TOOLS_PATH}/autograd/templates/variable_factories.h" - "${TOOLS_PATH}/autograd/templates/annotated_fn_args.py.in" - "${TOOLS_PATH}/autograd/deprecated.yaml" - "${TOOLS_PATH}/autograd/derivatives.yaml" - "${TOOLS_PATH}/autograd/gen_autograd_functions.py" - "${TOOLS_PATH}/autograd/gen_autograd.py" - "${TOOLS_PATH}/autograd/gen_python_functions.py" - "${TOOLS_PATH}/autograd/gen_variable_factories.py" - "${TOOLS_PATH}/autograd/gen_variable_type.py" - "${TOOLS_PATH}/autograd/gen_inplace_or_view_type.py" - "${TOOLS_PATH}/autograd/load_derivatives.py" - "${TORCH_ROOT}/torchgen/gen_backend_stubs.py" - "${TORCH_ROOT}/torchgen/gen_lazy_tensor.py" - "${TORCH_ROOT}/torchgen/api/lazy.py" - "${TORCH_ROOT}/torchgen/dest/lazy_ir.py" + "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml" + "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml" + "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml" + "${TORCH_ROOT}/torch/csrc/lazy/core/shape_inference.h" + "${TORCH_ROOT}/torch/csrc/lazy/ts_backend/ts_native_functions.cpp" + "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.h" + "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp" + "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h" + "${TORCH_ROOT}/aten/src/ATen/templates/LazyNonNativeIr.h" + "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp" + ${autograd_python} + ${autograd_yaml} + ${autograd_templates} + ${torchgen_python} WORKING_DIRECTORY "${TORCH_ROOT}") @@ -553,7 +528,6 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE) ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm - ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm ) set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm PROPERTIES COMPILE_FLAGS "-fno-objc-arc") include_directories(${TORCH_ROOT}/third_party/nlohmann/single_include) @@ -918,11 +892,6 @@ if(HAVE_SOVERSION) VERSION ${TORCH_VERSION} SOVERSION ${TORCH_SOVERSION}) endif() -if(USE_UCC) - target_link_libraries(torch_cpu PRIVATE __caffe2_ucc) - target_compile_definitions(torch_cpu PRIVATE USE_UCC) -endif() - if(USE_ROCM) filter_list(__caffe2_hip_srcs_cpp Caffe2_HIP_SRCS "\\.(cu|hip)$") set_source_files_properties(${__caffe2_hip_srcs_cpp} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1) @@ -1070,23 +1039,36 @@ endif() # Codegen selected_mobile_ops.h for template selective build if(BUILD_LITE_INTERPRETER AND SELECTED_OP_LIST) message("running gen_selected_mobile_ops_header for: '${SELECTED_OP_LIST}'") + file(GLOB lite_interpreter_python "${TOOLS_PATH}/lite_interpreter/*.py") if(${TRACING_BASED}) + file(GLOB code_analyzer_python "${TOOLS_PATH}/code_analyzer/*.py") add_custom_command( OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h COMMAND - "${PYTHON_EXECUTABLE}" - -m tools.code_analyzer.gen_oplist - --model_file_list_path "${SELECTED_OP_LIST}" - --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen" + "${PYTHON_EXECUTABLE}" + -m tools.code_analyzer.gen_oplist + --model_file_list_path "${SELECTED_OP_LIST}" + --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen" + DEPENDS + ${torchgen_python} + ${lite_interpreter_python} + ${code_analyzer_python} + "${SELECTED_OP_LIST}" + "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml" WORKING_DIRECTORY "${TORCH_ROOT}") else() add_custom_command( OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h COMMAND - "${PYTHON_EXECUTABLE}" - -m tools.lite_interpreter.gen_selected_mobile_ops_header - --yaml_file_path "${SELECTED_OP_LIST}" - --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen" + "${PYTHON_EXECUTABLE}" + -m tools.lite_interpreter.gen_selected_mobile_ops_header + --yaml_file_path "${SELECTED_OP_LIST}" + --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen" + DEPENDS + ${torchgen_python} + ${lite_interpreter_python} + "${SELECTED_OP_LIST}" + "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml" WORKING_DIRECTORY "${TORCH_ROOT}") endif() diff --git a/caffe2/core/tensor.h b/caffe2/core/tensor.h index 4c5be742d0cf7..f7b3d90fe63a7 100644 --- a/caffe2/core/tensor.h +++ b/caffe2/core/tensor.h @@ -439,6 +439,14 @@ class TORCH_API Tensor final { return impl_->sym_sizes(); } + inline c10::SymInt sym_numel() const { + return impl_->sym_numel(); + } + + inline c10::SymIntArrayRef sym_strides() const { + return impl_->sym_strides(); + } + inline int64_t size_from_dim(int k) const { return size_from_dim_(k, impl_->sizes()); } diff --git a/caffe2/quantization/server/dnnlowp.h b/caffe2/quantization/server/dnnlowp.h index 2f68d156af108..c71ac8dbef6e1 100644 --- a/caffe2/quantization/server/dnnlowp.h +++ b/caffe2/quantization/server/dnnlowp.h @@ -6,7 +6,9 @@ #include #include +#ifdef __x86_64__ #include +#endif #include diff --git a/caffe2/quantization/server/fully_connected_fake_lowp_op.h b/caffe2/quantization/server/fully_connected_fake_lowp_op.h index 6cbfc900e9613..cee1c26498fb9 100644 --- a/caffe2/quantization/server/fully_connected_fake_lowp_op.h +++ b/caffe2/quantization/server/fully_connected_fake_lowp_op.h @@ -16,7 +16,9 @@ #pragma once +#ifdef __x86_64__ #include +#endif #include "caffe2/core/context.h" #include "caffe2/core/operator.h" #include "caffe2/utils/conversions.h" diff --git a/caffe2/serialize/inline_container.cc b/caffe2/serialize/inline_container.cc index 9847bc132264d..9d3cc332ae96e 100644 --- a/caffe2/serialize/inline_container.cc +++ b/caffe2/serialize/inline_container.cc @@ -142,7 +142,13 @@ void PyTorchStreamReader::init() { std::tie(version_ptr, version_size) = getRecord("version"); } std::string version(static_cast(version_ptr.get()), version_size); - version_ = caffe2::stoull(version); + try { + version_ = caffe2::stoull(version); + } catch (const std::invalid_argument &e) { + CAFFE_THROW("Couldn't parse the version ", + version, + " as Long Long."); + } // NOLINTNEXTLINE(clang-diagnostic-sign-compare) if (version_ < kMinSupportedFileFormatVersion) { CAFFE_THROW( diff --git a/caffe2/serialize/inline_container.h b/caffe2/serialize/inline_container.h index 139174fa3d61e..621ffbe9a41ab 100644 --- a/caffe2/serialize/inline_container.h +++ b/caffe2/serialize/inline_container.h @@ -166,11 +166,7 @@ class TORCH_API PyTorchStreamWriter final { std::function writer_func_; // This number will be updated when the model has operators // that have valid upgraders. -#if ENABLE_UPGRADERS uint64_t version_ = kMinProducedFileFormatVersion; -#else - uint64_t version_ = kProducedFileFormatVersion; -#endif bool finalized_ = false; bool err_seen_ = false; friend size_t ostream_write_func( diff --git a/caffe2/serialize/versions.h b/caffe2/serialize/versions.h index 78a91c64fe84f..6e2c27adc8fae 100644 --- a/caffe2/serialize/versions.h +++ b/caffe2/serialize/versions.h @@ -4,18 +4,9 @@ namespace caffe2 { namespace serialize { -// Flag that controls if we want to enable upgraders -// in the server side. When this flag is set to False, -// it will switch to old dynamic versioning approach -#define ENABLE_UPGRADERS true - constexpr uint64_t kMinSupportedFileFormatVersion = 0x1L; -#if ENABLE_UPGRADERS constexpr uint64_t kMaxSupportedFileFormatVersion = 0xAL; -#else -constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L; -#endif // Versions (i.e. why was the version number bumped?) @@ -57,7 +48,6 @@ constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L; // when given bool or integer fill values. // 6. Write version string to `./data/version` instead of `version`. -#if ENABLE_UPGRADERS // [12/15/2021] // kProducedFileFormatVersion is set to 7 from 3 due to a different // interpretation of what file format version is. @@ -84,9 +74,6 @@ constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L; // and aten::gelu.out to support the new approximate kwarg. // (see: https://github.com/pytorch/pytorch/pull/61439) constexpr uint64_t kProducedFileFormatVersion = 0xAL; -#else -constexpr uint64_t kProducedFileFormatVersion = 0x3L; -#endif // Absolute minimum version we will write packages. This // means that every package from now on will always be diff --git a/caffe2/sgd/learning_rate_op.cc b/caffe2/sgd/learning_rate_op.cc index e8172ab65efea..7e6545c5adebd 100644 --- a/caffe2/sgd/learning_rate_op.cc +++ b/caffe2/sgd/learning_rate_op.cc @@ -134,6 +134,12 @@ Example usage: .Arg( "cosine_lr_shrink", "defaults to 0.99, part of CompositeCosineLRPolicy") + .Arg( + "num_iter_1", + "(int, default 0) number of iterations over which to warmup for slope policy") + .Arg( + "num_iter_2", + "(int, default 0) number of iterations over which to gradually gate for slope policy") .Input(0, "input", "description needed") .Output(0, "output", "description needed") .DeviceInferenceFunction([](const OperatorDef& def) { @@ -185,5 +191,7 @@ C10_EXPORT_CAFFE2_OP_TO_C10_CPU( "int? cosine_period = 50, " "float? cosine_t_mult = 1.0, " "float? cosine_lr_shrink = 0.99, " - "float? decay = 1.0) -> Tensor output", + "float? decay = 1.0, " + "int? num_iter_1 = 0, " + "int? num_iter_2 = 0) -> Tensor output", LearningRateOpFloatCPU); diff --git a/caffe2/utils/threadpool/ThreadPool.cc b/caffe2/utils/threadpool/ThreadPool.cc index 3f0a2adc233c5..cbccf0749bef1 100644 --- a/caffe2/utils/threadpool/ThreadPool.cc +++ b/caffe2/utils/threadpool/ThreadPool.cc @@ -100,6 +100,17 @@ size_t getDefaultNumThreads() { // Always give precedence to explicit setting. numThreads = FLAGS_pthreadpool_size; } + + /* + * For llvm-tsan, holding limit for the number of locks for a single thread + * is 64. pthreadpool's worst case is the number of threads in a pool. So we + * want to limit the threadpool size to 64 when running with tsan. However, + * sometimes it is tricky to detect if we are running under tsan, for now + * capping the default threadcount to the tsan limit unconditionally. + */ + int tsanThreadLimit = 64; + numThreads = std::min(numThreads, tsanThreadLimit); + return numThreads; } diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake index c67746d903dc1..0e96653967da6 100644 --- a/cmake/Dependencies.cmake +++ b/cmake/Dependencies.cmake @@ -823,12 +823,8 @@ if(USE_FBGEMM) set_property(TARGET fbgemm PROPERTY POSITION_INDEPENDENT_CODE ON) if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 13.0.0) # See https://github.com/pytorch/pytorch/issues/74352 - target_compile_options(asmjit PRIVATE -Wno-deprecated-copy) - if(("${CMAKE_CXX_COMPILER_ID}" STREQUAL "AppleClang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13.1.6) - OR("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13.0.0)) - # -Wno-unused-but-set-variable doesn't exist in Apple clang version 13.0.0 (clang-1300.0.29.30) - target_compile_options(asmjit PRIVATE -Wno-unused-but-set-variable) - endif() + target_compile_options_if_supported(asmjit -Wno-deprecated-copy) + target_compile_options_if_supported(asmjit -Wno-unused-but-set-variable) endif() endif() @@ -1443,6 +1439,11 @@ if(USE_GLOO) get_target_property(_include_dirs uv_a INCLUDE_DIRECTORIES) set_target_properties(uv_a PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_include_dirs}") endif() + if(USE_NCCL AND NOT USE_SYSTEM_NCCL) + # Tell Gloo build system to use bundled NCCL, see + # https://github.com/facebookincubator/gloo/blob/950c0e23819779a9e0c70b861db4c52b31d1d1b2/cmake/Dependencies.cmake#L123 + set(NCCL_EXTERNAL ON) + endif() # gloo uses cuda_add_library torch_update_find_cuda_flags() add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/../third_party/gloo) diff --git a/cmake/External/nccl.cmake b/cmake/External/nccl.cmake index 84c79c243b43a..2d3821840c179 100644 --- a/cmake/External/nccl.cmake +++ b/cmake/External/nccl.cmake @@ -15,36 +15,52 @@ if(NOT __NCCL_INCLUDED) # this second replacement is needed when there are multiple archs string(REPLACE ";-gencode" " -gencode" NVCC_GENCODE "${NVCC_GENCODE}") + if("${CMAKE_GENERATOR}" MATCHES "Make") + # Recursive make with jobserver for parallelism + set(MAKE_COMMAND "$(MAKE)") + else() + if(DEFINED ENV{MAX_JOBS}) + set(MAX_JOBS "$ENV{MAX_JOBS}") + else() + include(ProcessorCount) + ProcessorCount(NUM_HARDWARE_THREADS) + # Assume 2 hardware threads per cpu core + math(EXPR MAX_JOBS "${NUM_HARDWARE_THREADS} / 2") + endif() + + # Parallel build with CPU load limit to avoid oversubscription + set(MAKE_COMMAND "make" "-j${MAX_JOBS}" "-l${MAX_JOBS}") + endif() + set(__NCCL_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/nccl") ExternalProject_Add(nccl_external SOURCE_DIR ${PROJECT_SOURCE_DIR}/third_party/nccl/nccl BUILD_IN_SOURCE 1 CONFIGURE_COMMAND "" BUILD_COMMAND - env - # TODO: remove these flags when - # https://github.com/pytorch/pytorch/issues/13362 is fixed - "CCACHE_DISABLE=1" - "SCCACHE_DISABLE=1" - make + ${MAKE_COMMAND} "CXX=${CMAKE_CXX_COMPILER}" "CUDA_HOME=${CUDA_TOOLKIT_ROOT_DIR}" "NVCC=${CUDA_NVCC_EXECUTABLE}" "NVCC_GENCODE=${NVCC_GENCODE}" "BUILDDIR=${__NCCL_BUILD_DIR}" "VERBOSE=0" - "-j" - $ENV{MAX_JOBS} - BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a" + BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a" INSTALL_COMMAND "" ) # Detect objcopy version execute_process(COMMAND "${CMAKE_OBJCOPY}" "--version" OUTPUT_VARIABLE OBJCOPY_VERSION_STR) - string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR}) - string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR}) + string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR}) + string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR}) - if((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27))) + # TODO: Replace me with SKIP_NCCL_SLIMMING option (and investigate why it does not work on newer compilers) + if("$ENV{BUILD_ENVIRONMENT}" MATCHES ".*-libtorch-cxx11-abi$") + # See https://github.com/pytorch/pytorch/issues/83887 + message(WARNING "Skip NCCL library slimming for cxx11-abi builds") + set(__NCCL_LIBRARY_DEP nccl_external) + set(NCCL_LIBRARIES ${__NCCL_BUILD_DIR}/lib/libnccl_static.a) + elseif((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27))) message(WARNING "Enabling NCCL library slimming") add_custom_command( OUTPUT "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" @@ -53,7 +69,9 @@ if(NOT __NCCL_INCLUDED) COMMAND cd objects COMMAND "${CMAKE_AR}" x "${__NCCL_BUILD_DIR}/lib/libnccl_static.a" COMMAND for obj in all_gather_* all_reduce_* broadcast_* reduce_*.o$ do "${CMAKE_OBJCOPY}" --remove-relocations .nvFatBinSegment --remove-section __nv_relfatbin $$obj$ done - COMMAND "${CMAKE_AR}" cr "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" "*.o" + COMMAND "${CMAKE_AR}" cr "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" "*.o" + COMMAND "${CMAKE_AR}" xN 1 "${__NCCL_BUILD_DIR}/lib/libnccl_static.a" net.o + COMMAND "${CMAKE_AR}" q "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" net.o COMMAND cd - COMMAND "${CMAKE_COMMAND}" -E remove_directory "${__NCCL_BUILD_DIR}/objects" WORKING_DIRECTORY "${__NCCL_BUILD_DIR}" diff --git a/cmake/External/ucc.cmake b/cmake/External/ucc.cmake index 359ea67b1a745..70cdf4b3af2d5 100644 --- a/cmake/External/ucc.cmake +++ b/cmake/External/ucc.cmake @@ -2,19 +2,14 @@ if(NOT __UCC_INCLUDED) set(__UCC_INCLUDED TRUE) if(USE_SYSTEM_UCC) - set(UCX_HOME $ENV{UCX_HOME} CACHE PATH "UCX install directory") - set(UCC_HOME $ENV{UCC_HOME} CACHE PATH "UCC install directory") - - add_library(__caffe2_ucc INTERFACE) - - target_include_directories(__caffe2_ucc INTERFACE ${UCX_HOME}/include/) - target_include_directories(__caffe2_ucc INTERFACE ${UCC_HOME}/include/) - - target_link_libraries(__caffe2_ucc INTERFACE ${UCX_HOME}/lib/libucp.so) - target_link_libraries(__caffe2_ucc INTERFACE ${UCX_HOME}/lib/libucs.so) - target_link_libraries(__caffe2_ucc INTERFACE ${UCC_HOME}/lib/libucc.so) + find_package(UCC REQUIRED) + find_package(UCX REQUIRED) + if(UCC_FOUND AND UCX_FOUND) + add_library(__caffe2_ucc INTERFACE) + target_link_libraries(__caffe2_ucc INTERFACE ucx::ucs ucx::ucp ucc::ucc) + target_include_directories(__caffe2_ucc INTERFACE ${UCC_INCLUDE_DIRS}) + endif() else() message(FATAL_ERROR "USE_SYSTEM_UCC=OFF is not supported yet when using UCC") endif() - endif() diff --git a/cmake/public/LoadHIP.cmake b/cmake/public/LoadHIP.cmake index 87bb57da1543f..89a61b6242856 100644 --- a/cmake/public/LoadHIP.cmake +++ b/cmake/public/LoadHIP.cmake @@ -143,6 +143,9 @@ message("Building PyTorch for GPU arch: ${PYTORCH_ROCM_ARCH}") # Add HIP to the CMAKE Module Path set(CMAKE_MODULE_PATH ${HIP_PATH}/cmake ${CMAKE_MODULE_PATH}) +#Disable kernel assert due to performance regression +set(ROCM_ENABLE_KERNEL_ASSERTS FALSE CACHE BOOL "Kernel asserts are disabled by default for ROCm") + macro(find_package_and_print_version PACKAGE_NAME) find_package("${PACKAGE_NAME}" ${ARGN}) message("${PACKAGE_NAME} VERSION: ${${PACKAGE_NAME}_VERSION}") @@ -283,8 +286,18 @@ if(HIP_FOUND) find_package_and_print_version(hipcub REQUIRED) find_package_and_print_version(rocthrust REQUIRED) - # Disable Asserts In Code (Can't use asserts on HIP stack.) - add_definitions(-DNDEBUG) + if(ROCM_VERSION_DEV VERSION_GREATER_EQUAL "4.1.0") + if(ROCM_ENABLE_KERNEL_ASSERTS) + message("ROCm version >= 4.1; enabling asserts") + else() + add_definitions(-DROCM_DISABLE_GPU_ASSERTS) + message("ROCm version >= 4.1; kernel asserts are disabled") + endif() + else() + # Disable Asserts In Code (Can't use asserts on HIP stack.) + add_definitions(-DNDEBUG) + message("ROCm version < 4.1; disablng asserts") + endif() if(HIP_COMPILER STREQUAL clang) set(hip_library_name amdhip64) diff --git a/cmake/public/utils.cmake b/cmake/public/utils.cmake index 0daa6b7f6a3ef..b0c4cc6f08b56 100644 --- a/cmake/public/utils.cmake +++ b/cmake/public/utils.cmake @@ -451,7 +451,6 @@ function(torch_compile_options libname) -Wno-unused-parameter -Wno-unused-function -Wno-unused-result - -Wno-unused-local-typedefs -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas @@ -570,3 +569,26 @@ function(torch_update_find_cuda_flags) " CUDA_NVCC_FLAGS_MINSIZEREL = ${FLAGS_MINSIZEREL}") endif() endfunction() + +############################################################################## +# CHeck if given flag is supported and append it to provided outputvar +# Also define HAS_UPPER_CASE_FLAG_NAME variable +# Usage: +# append_cxx_flag_if_supported("-Werror" CMAKE_CXX_FLAGS) +function(append_cxx_flag_if_supported flag outputvar) + string(TOUPPER "HAS${flag}" _FLAG_NAME) + string(REGEX REPLACE "[=-]" "_" _FLAG_NAME "${_FLAG_NAME}") + check_cxx_compiler_flag("${flag}" ${_FLAG_NAME}) + if(${_FLAG_NAME}) + string(APPEND ${outputvar} " ${flag}") + set(${outputvar} "${${outputvar}}" PARENT_SCOPE) + endif() +endfunction() + +function(target_compile_options_if_supported target flag) + set(_compile_options "") + append_cxx_flag_if_supported("${flag}" _compile_options) + if(NOT "${_compile_options}" STREQUAL "") + target_compile_options(${target} PRIVATE ${flag}) + endif() +endfunction() diff --git a/defs_gpu.bzl b/defs_gpu.bzl index 3d6cae8830893..bfc3db8618629 100644 --- a/defs_gpu.bzl +++ b/defs_gpu.bzl @@ -71,9 +71,7 @@ ATEN_NATIVE_CUDA_H_PATTERN = [ ] # T66678203: Clang CUDA rollout -ATEN_CUDA_CLANG_CU_PATTERN = [ - "aten/src/ATen/native/cuda/DistributionBernoulli.cu", -] +ATEN_CUDA_CLANG_CU_PATTERN = [] ### Cuda Files def get_aten_cuda_headers(): diff --git a/docker.Makefile b/docker.Makefile index a1772529d926d..0768f6ecf6ed8 100644 --- a/docker.Makefile +++ b/docker.Makefile @@ -1,6 +1,6 @@ -DOCKER_REGISTRY = docker.io -DOCKER_ORG = $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //') -DOCKER_IMAGE = pytorch +DOCKER_REGISTRY ?= docker.io +DOCKER_ORG ?= $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //') +DOCKER_IMAGE ?= pytorch DOCKER_FULL_NAME = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE) ifeq ("$(DOCKER_ORG)","") @@ -8,7 +8,7 @@ $(warning WARNING: No docker user found using results from whoami) DOCKER_ORG = $(shell whoami) endif -CUDA_VERSION = 11.3 +CUDA_VERSION = 11.3.1 CUDNN_VERSION = 8 BASE_RUNTIME = ubuntu:18.04 BASE_DEVEL = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-devel-ubuntu18.04 @@ -16,13 +16,13 @@ BASE_DEVEL = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-de # The conda channel to use to install cudatoolkit CUDA_CHANNEL = nvidia # The conda channel to use to install pytorch / torchvision -INSTALL_CHANNEL = pytorch +INSTALL_CHANNEL ?= pytorch -PYTHON_VERSION = 3.8 -PYTORCH_VERSION = $(shell git describe --tags --always) +PYTHON_VERSION ?= 3.8 +PYTORCH_VERSION ?= $(shell git describe --tags --always) # Can be either official / dev -BUILD_TYPE = dev -BUILD_PROGRESS = auto +BUILD_TYPE ?= dev +BUILD_PROGRESS ?= auto BUILD_ARGS = --build-arg BASE_IMAGE=$(BASE_IMAGE) \ --build-arg PYTHON_VERSION=$(PYTHON_VERSION) \ --build-arg CUDA_VERSION=$(CUDA_VERSION) \ @@ -30,10 +30,32 @@ BUILD_ARGS = --build-arg BASE_IMAGE=$(BASE_IMAGE) \ --build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \ --build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) EXTRA_DOCKER_BUILD_FLAGS ?= + +BUILD ?= build +# Intentionally left blank +PLATFORMS_FLAG ?= +PUSH_FLAG ?= +USE_BUILDX ?= +BUILD_PLATFORMS ?= +WITH_PUSH ?= false +# Setup buildx flags +ifneq ("$(USE_BUILDX)","") +BUILD = buildx build +ifneq ("$(BUILD_PLATFORMS)","") +PLATFORMS_FLAG = --platform="$(BUILD_PLATFORMS)" +endif +# Only set platforms flags if using buildx +ifeq ("$(WITH_PUSH)","true") +PUSH_FLAG = --push +endif +endif + DOCKER_BUILD = DOCKER_BUILDKIT=1 \ - docker build \ + docker $(BUILD) \ --progress=$(BUILD_PROGRESS) \ $(EXTRA_DOCKER_BUILD_FLAGS) \ + $(PLATFORMS_FLAG) \ + $(PUSH_FLAG) \ --target $(BUILD_TYPE) \ -t $(DOCKER_FULL_NAME):$(DOCKER_TAG) \ $(BUILD_ARGS) . @@ -48,7 +70,7 @@ devel-image: DOCKER_TAG := $(PYTORCH_VERSION)-devel devel-image: $(DOCKER_BUILD) -.PHONY: devel-image +.PHONY: devel-push devel-push: BASE_IMAGE := $(BASE_DEVEL) devel-push: DOCKER_TAG := $(PYTORCH_VERSION)-devel devel-push: @@ -59,9 +81,8 @@ runtime-image: BASE_IMAGE := $(BASE_RUNTIME) runtime-image: DOCKER_TAG := $(PYTORCH_VERSION)-runtime runtime-image: $(DOCKER_BUILD) - docker tag $(DOCKER_FULL_NAME):$(DOCKER_TAG) $(DOCKER_FULL_NAME):latest -.PHONY: runtime-image +.PHONY: runtime-push runtime-push: BASE_IMAGE := $(BASE_RUNTIME) runtime-push: DOCKER_TAG := $(PYTORCH_VERSION)-runtime runtime-push: diff --git a/docs/requirements.txt b/docs/requirements.txt index 9a967dd54e0ff..14c93adc22e90 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,9 +1,12 @@ sphinx==5.0.0 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme -sphinxcontrib.katex -matplotlib -tensorboard +# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering +# but it doesn't seem to work and hangs around idly. The initial thought is probably +# something related to Docker setup. We can investigate this later +sphinxcontrib.katex==0.8.6 +matplotlib==3.5.3 +tensorboard==2.10.0 # required to build torch.distributed.elastic.rendezvous.etcd* docs -python-etcd>=0.4.5 -sphinx_copybutton -sphinx-panels +python-etcd==0.4.5 +sphinx-copybutton==0.5.0 +sphinx-panels==0.4.1 diff --git a/docs/source/amp.rst b/docs/source/amp.rst index 0785849c579e2..3c0c77d4bc4f1 100644 --- a/docs/source/amp.rst +++ b/docs/source/amp.rst @@ -26,7 +26,7 @@ However, :class:`torch.autocast` and :class:`torch.cuda.amp.GradScaler` are modu As shown in the CPU example section of :class:`torch.autocast`, "automatic mixed precision training/inference" on CPU with datatype of ``torch.bfloat16`` only uses :class:`torch.autocast`. -For CUDA and CPU, APIs are also provided seperately: +For CUDA and CPU, APIs are also provided separately: * ``torch.autocast("cuda", args...)`` is equivalent to ``torch.cuda.amp.autocast(args...)``. * ``torch.autocast("cpu", args...)`` is equivalent to ``torch.cpu.amp.autocast(args...)``. For CPU, only lower precision floating point datatype of ``torch.bfloat16`` is supported for now. diff --git a/docs/source/backends.rst b/docs/source/backends.rst index 152e0144a416d..ffbc8a99081a5 100644 --- a/docs/source/backends.rst +++ b/docs/source/backends.rst @@ -11,6 +11,7 @@ These backends include: - ``torch.backends.cuda`` - ``torch.backends.cudnn`` +- ``torch.backends.mps`` - ``torch.backends.mkl`` - ``torch.backends.mkldnn`` - ``torch.backends.openmp`` diff --git a/docs/source/community/governance.rst b/docs/source/community/governance.rst index 0a7c224256073..cbb8576c89a4d 100644 --- a/docs/source/community/governance.rst +++ b/docs/source/community/governance.rst @@ -60,18 +60,15 @@ design docs, any disputes and dispute resolutions) so that contributors and other interested parties understand the future direction of the project and can participate in discussion. -Within `pytorch/pytorch `__, -maintainer groups are defined in the -`CODEOWNERS `__ -file in the GitHub repository. For other modules that correspond -to repositories, membership is recorded on GitHub as access -level to the repo (i.e. “write” permission). Module maintainers -are given privileges to administrate the repository (except for -`pytorch/pytorch `__ where -they are responsible for a folder). +Responsibilities of the maintainer includes: +* Triaging high priority issues of the module +* Triaging and reviewing and landing high priority pull requests of the module +* Supporting public documentation related to the module +* Running public developer meetings Core Maintainers ---------------- + The core maintainers are expected to have a deep understanding of the PyTorch code base and design philosophies. Their responsibilities include: @@ -130,14 +127,12 @@ The Principles The Process for Nomination ~~~~~~~~~~~~~~~~~~~~~~~~~~ -* We will have a nomination form, where anyone in the community can - nominate a person to a Module maintainer position -* Every 3 months, the core maintainers go through the nominations, - do light filtering around spam or desk-rejection, and draw up a - list of potential nominees. -* The core maintainers ask the specific module maintainers for more - information on the nominee. The information should include the following - items: +* Each module has its own process. Please contact module maintainers for more information. + However, if there is no process identified, you can file a request to the core maintainers + by submitting [this form](https://forms.gle/xNeu1byGMZVHcA2q7). Core maintainers are + meeting every three months. +* If you are submitting a request to the core maintainers, the information in your request + must include the following items: * The nominees depth and breadth of code, review and design contributions on the module diff --git a/docs/source/community/persons_of_interest.rst b/docs/source/community/persons_of_interest.rst index f6e19db5e8255..cbe5cb1462128 100644 --- a/docs/source/community/persons_of_interest.rst +++ b/docs/source/community/persons_of_interest.rst @@ -128,6 +128,14 @@ NVIDIA / CUDA - Piotr Bialecki (`ptrblck `__) - (emeritus) Xiaoqiang Zheng (`zheng-xq `__) +NVFuser +~~~~~~~ + +- Christian Sarofeen (`csarofeen `__) +- Alex Jann (`jjsjann123 `__) +- Piotr Bialecki (`ptrblck `__) +- Natalia Gimelshein (`ngimel `__) + Intel / MKLDNN ~~~~~~~~~~~~~~ @@ -182,10 +190,11 @@ C10 utils and operator dispatch - Dmytro Dzhulgakov (`dzhulgakov `__) - (emeritus) Sebastian Messmer (`smessmer `__) -PyTorch -> ONNX -~~~~~~~~~~~~~~~ +ONNX exporter +~~~~~~~~~~~~~ - Bowen Bao (`BowenBao `__) -- Gary Miguel (`garymm `__) +- Aaron Bockover (`abock `__) +- (emeritus) Gary Miguel (`garymm `__) - (emeritus) Lara Haidar (`lara-hdr `__) - (emeritus) Lu Fang (`houseroad `__) - (emeritus) Negin Raoof (`neginraoof `__) @@ -220,6 +229,7 @@ Apple M1/MPS - Alban Desmaison (`alband `__) - Nikita Shulga (`malfet `__) - Kulin Seth (`kulinseth `__) +- Ramin Azarmehr (`razarmehr `__) PowerPC ~~~~~~~ diff --git a/docs/source/conf.py b/docs/source/conf.py index e8b683cd445cd..098cc3ff61ef9 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -134,8 +134,6 @@ "unregister_custom_op_symbolic", # torch.ao.quantization "default_eval_fn", - # torch.ao.quantization.backend_config - "validate_backend_config_dict", # torch.backends "disable_global_flags", "flags_frozen", @@ -189,7 +187,10 @@ "DeserializationStorageContext", "DeviceObjType", "DictType", + "DispatchKey", + "DispatchKeySet", "EnumType", + "ExcludeDispatchKeyGuard", "ExecutionPlan", "FileCheck", "FloatType", @@ -316,7 +317,7 @@ "DDPCommHookType", # torch.jit.mobile "LiteScriptModule", - # torch.nn.quantized.modules + # torch.ao.nn.quantized.modules "DeQuantize", "Quantize", # torch.utils.backcompat @@ -492,6 +493,51 @@ def is_not_internal(modname): for o in output: f.write(o) + +def process_docstring(app, what_, name, obj, options, lines): + """ + Custom process to transform docstring lines Remove "Ignore" blocks + + Args: + app (sphinx.application.Sphinx): the Sphinx application object + + what (str): + the type of the object which the docstring belongs to (one of + "module", "class", "exception", "function", "method", "attribute") + + name (str): the fully qualified name of the object + + obj: the object itself + + options: the options given to the directive: an object with + attributes inherited_members, undoc_members, show_inheritance + and noindex that are true if the flag option of same name was + given to the auto directive + + lines (List[str]): the lines of the docstring, see above + + References: + https://www.sphinx-doc.org/en/1.5.1/_modules/sphinx/ext/autodoc.html + https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html + """ + import re + remove_directives = [ + # Remove all xdoctest directives + re.compile(r'\s*>>>\s*#\s*x?doctest:\s*.*'), + re.compile(r'\s*>>>\s*#\s*x?doc:\s*.*'), + ] + filtered_lines = [ + line for line in lines + if not any(pat.match(line) for pat in remove_directives) + ] + # Modify the lines inplace + lines[:] = filtered_lines + + # make sure there is a blank line at the end + if lines and lines[-1].strip(): + lines.append('') + + # Called automatically by Sphinx, making this `conf.py` an "extension". def setup(app): # NOTE: in Sphinx 1.8+ `html_css_files` is an official configuration value @@ -508,6 +554,7 @@ def setup(app): add_css(css_file) app.connect("build-finished", coverage_post_process) + app.connect('autodoc-process-docstring', process_docstring) # From PyTorch 1.5, we now use autogenerated files to document classes and # functions. This breaks older references since diff --git a/docs/source/cuda.rst b/docs/source/cuda.rst index 361b60ed546c8..02c3b407aa218 100644 --- a/docs/source/cuda.rst +++ b/docs/source/cuda.rst @@ -33,6 +33,7 @@ torch.cuda stream synchronize utilization + OutOfMemoryError Random Number Generator ------------------------- diff --git a/docs/source/elastic/timer.rst b/docs/source/elastic/timer.rst index e9d4228ee7a6a..f64597c4ce2bf 100644 --- a/docs/source/elastic/timer.rst +++ b/docs/source/elastic/timer.rst @@ -18,10 +18,21 @@ Below are the timer server and client pairs that are provided by torchelastic. in pairs since there is a messaging protocol between the server and client. +Below is a pair of timer server and client that is implemented based on +a ``multiprocess.Queue``. + .. autoclass:: LocalTimerServer .. autoclass:: LocalTimerClient +Below is another pair of timer server and client that is implemented +based on a named pipe. + +.. autoclass:: FileTimerServer + +.. autoclass:: FileTimerClient + + Writing a custom timer server/client -------------------------------------- diff --git a/docs/source/index.rst b/docs/source/index.rst index 4e069c9279a20..f688cbe0134fd 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -91,6 +91,7 @@ Features described in this documentation are classified by release status: quantization rpc torch.random + masked nested sparse storage diff --git a/docs/source/masked.rst b/docs/source/masked.rst new file mode 100644 index 0000000000000..e70f7b04c1ceb --- /dev/null +++ b/docs/source/masked.rst @@ -0,0 +1,11 @@ +torch.masked +============ + +.. automodule:: torch.masked +.. automodule:: torch.masked.maskedtensor + +Introduction +++++++++++++ + +WIP. For more information, you can go to github.com/pytorch/maskedtensor for the source code +or http://pytorch.org/maskedtensor for a number of tutorials diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst index c678844edcfaa..ed2d22a657d75 100644 --- a/docs/source/notes/cuda.rst +++ b/docs/source/notes/cuda.rst @@ -355,7 +355,7 @@ Use of a caching allocator can interfere with memory checking tools such as The behavior of caching allocator can be controlled via environment variable ``PYTORCH_CUDA_ALLOC_CONF``. -The format is ``PYTORCH_CUDA_ALLOC_CONF=