-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Tracking Devel #1208
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments for myself to clean up the merge PR
@@ -75,6 +75,7 @@ _(aten, _expm1) \ | |||
_(aten, _fft_with_size) \ | |||
_(aten, _fill) \ | |||
_(aten, _floor) \ | |||
_(aten, _indexCopy) \ | |||
_(aten, _fused_dropout) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: don't need this... but it doesn't matter once we cherry-pick pytorch#63937
@@ -557,7 +557,6 @@ std::tuple<Tensor, Tensor, Tensor> _batch_norm_impl_index_backward( | |||
} | |||
|
|||
// backward in inference mode is not supported in cudnn, fallback to native | |||
// TODO: verify the same thing in miopen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errr. upstream should have removed this comment by now
scale_bias_relu.cpp | ||
utils.cpp | ||
main.cpp) | ||
add_executable(nvfuser_bench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
test/test_jit.py
Outdated
@@ -10817,6 +10817,89 @@ def addmm_grad_test(b, x, w): | |||
self.assertEqual(w.grad, w_ref.grad) | |||
self.assertEqual(b.grad, b_ref.grad) | |||
|
|||
def test_layer_norm_grad(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this test case.
tools/clang_tidy.py
Outdated
@@ -0,0 +1,372 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, remove this file.
@@ -4,6 +4,7 @@ | |||
#include <torch/csrc/jit/ir/alias_analysis.h> | |||
#include <torch/csrc/jit/ir/ir.h> | |||
#include <torch/csrc/jit/ir/node_hashing.h> | |||
#include <torch/csrc/jit/jit_log.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could revert this. 😛
torch/jit/_script.py
Outdated
@@ -1301,6 +1301,10 @@ def forward(self, a) -> MyModule: | |||
obj = obj.__original_fn | |||
_rcb = _jit_internal.createResolutionCallbackFromClosure(obj) | |||
|
|||
# some functions are explicitly marked as not supported in script mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this should go to upstream!
// In PyTorch, reduction of a size-0 tensor is effectively creating a tensor | ||
// filled with the init value. | ||
auto maybe_full = | ||
maybeFullInsteadOfReduction(uint_axes, init, tv, keep_dim, dtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch, why are you so strange?
* Refactoring of lower_alias_memory
Upstream bump 1109
removing label workflow
* Look at all the loops rather than just the consumer IDs as there can be loops not mapped to the consumer.
* add print debug for nvfuser * refine dump exprs code: 1) rename option name 2) move reduplicated logic to func dumpExprsIfEnabled Co-authored-by: Feiwen Zhu <mzhu@nvidia.com>
Add computeWith to interleave gmem accesses and computations
… based on problem size (#2191)
* Add Float IR node class Represents the 32-bit floating-point scalar value. Not supported in PyTorch, so can't be used as inputs to fusions
* Refactor scalar IR nodes (Int, Double and Bool) Everything uses template class Scalar
* "Vectorize" sequential welford computations Lift the predicated count division outside of the innermost loop if that loop is exactly mapped with vectorized IDs and not a reduction domain. Targeted to address outer-reduction grid welford tuning
disables index_select / gather python tests since upstream backs out autodiff support on these ops. pytorch#95565 We'll re-enable them when we remerge the autodiff support with opt-in via environment variable
Implements fundamental logic for multi-device support Co-authored-by: shmsong <shisong@umich.edu> Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com> Co-authored-by: snordmann <snordmann@nvidia.com> Co-authored-by: Xiang Gao <qasdfgtyuiop@gmail.com> Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Added new python API fd.ops.add_output(tensor, stride_order), where stride_order means that output axis i is the stride_order[i]th fastest dimension. e.g. if we want to specify output to be in channel-last format, we should specify fd.ops.add_output(tensor_view, [0, 3, 1, 2]), where a given output with shape [N, C, H, W] will have stride [H*W*C, 1, W*C, C] Implementation details: It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by: adding a permute op on outputs inside the generated kernel, to ensure that the output is stored in the correct memory layout; after the kernel has executed, we permute that corresponding output to undo the permutation inside the kernel, this gives us the semantically correct output in the desired memory layout.
ampere tests running on pre-ampere devices triggers a CI failure.
Fixing and improving indexing type handling
Fixes #2564 Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>
Fixes python handling of expanded broadcast dimension. e.g. torch.randint(0, 1, (5, 5), device="cuda").bool().unsqueeze(-1).expand(5, 5, 5) Changes contiguity representation on python. computeContiguity currently returns an array with the length of tensor rank, elements in the array can be: True, False or None, where None indicates a given dimension is broadcast.
* Clean up compile-time and run-time index options
Fixes index_select on empty/scalar indices. Issues found in python API. Our stack should support empty tensor (numel()==0), removed the check on that. Scalar tensor should be used in place of real Scalar with variable_name[0]. Add a quick patch for that. --------- Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* Include non-view rfactor IDs in CA map rfactor ID sets
* fix tests for multicluster fusion
…s. (#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.
No description provided.