Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Tracking Devel #1208

Open
wants to merge 1,311 commits into
base: master
Choose a base branch
from
Open

[DO NOT MERGE] Tracking Devel #1208

wants to merge 1,311 commits into from

Conversation

csarofeen
Copy link
Owner

No description provided.

@csarofeen csarofeen mentioned this pull request Oct 23, 2021
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments for myself to clean up the merge PR

@@ -75,6 +75,7 @@ _(aten, _expm1) \
_(aten, _fft_with_size) \
_(aten, _fill) \
_(aten, _floor) \
_(aten, _indexCopy) \
_(aten, _fused_dropout) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: don't need this... but it doesn't matter once we cherry-pick pytorch#63937

@@ -557,7 +557,6 @@ std::tuple<Tensor, Tensor, Tensor> _batch_norm_impl_index_backward(
}

// backward in inference mode is not supported in cudnn, fallback to native
// TODO: verify the same thing in miopen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errr. upstream should have removed this comment by now

scale_bias_relu.cpp
utils.cpp
main.cpp)
add_executable(nvfuser_bench
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

test/test_jit.py Outdated
@@ -10817,6 +10817,89 @@ def addmm_grad_test(b, x, w):
self.assertEqual(w.grad, w_ref.grad)
self.assertEqual(b.grad, b_ref.grad)

def test_layer_norm_grad(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this test case.

@@ -0,0 +1,372 @@
#!/usr/bin/env python3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, remove this file.

@@ -4,6 +4,7 @@
#include <torch/csrc/jit/ir/alias_analysis.h>
#include <torch/csrc/jit/ir/ir.h>
#include <torch/csrc/jit/ir/node_hashing.h>
#include <torch/csrc/jit/jit_log.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could revert this. 😛

@@ -1301,6 +1301,10 @@ def forward(self, a) -> MyModule:
obj = obj.__original_fn
_rcb = _jit_internal.createResolutionCallbackFromClosure(obj)

# some functions are explicitly marked as not supported in script mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this should go to upstream!

@csarofeen csarofeen requested a review from mruberry as a code owner May 4, 2022 20:30
@csarofeen csarofeen removed the request for review from mruberry August 26, 2022 18:20
// In PyTorch, reduction of a size-0 tensor is effectively creating a tensor
// filled with the init value.
auto maybe_full =
maybeFullInsteadOfReduction(uint_axes, init, tv, keep_dim, dtype);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch, why are you so strange?

jjsjann123 and others added 25 commits November 9, 2022 09:25
* Refactoring of lower_alias_memory
* Look at all the loops rather than just the consumer IDs as there can be
loops not mapped to the consumer.
* add print debug for nvfuser

* refine dump exprs code: 1) rename option name 2) move reduplicated logic to func dumpExprsIfEnabled

Co-authored-by: Feiwen Zhu <mzhu@nvidia.com>
Add computeWith to interleave gmem accesses and computations
* Add Float IR node class

Represents the 32-bit floating-point scalar value. Not supported in
PyTorch, so can't be used as inputs to fusions
* Refactor scalar IR nodes (Int, Double and Bool)

Everything uses template class Scalar
* "Vectorize" sequential welford computations

Lift the predicated count division outside of the innermost loop if that
loop is exactly mapped with vectorized IDs and not a reduction domain.
Targeted to address outer-reduction grid welford tuning
liqiangxl and others added 30 commits February 28, 2023 17:27
disables index_select / gather python tests since upstream backs out autodiff support on these ops.
pytorch#95565

We'll re-enable them when we remerge the autodiff support with opt-in via environment variable
* Fix #2531

Changed ReplayTransformations to take bool parameters explicitly with
set methods. This is to avoid accidentally passing those bool arguments
in a wrong order. More verbose, but would be safer
Implements fundamental logic for multi-device support

Co-authored-by: shmsong <shisong@umich.edu>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>
Co-authored-by: snordmann <snordmann@nvidia.com>
Co-authored-by: Xiang Gao <qasdfgtyuiop@gmail.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Added new python API fd.ops.add_output(tensor, stride_order), where stride_order means that output axis i is the stride_order[i]th fastest dimension.

e.g. if we want to specify output to be in channel-last format, we should specify fd.ops.add_output(tensor_view, [0, 3, 1, 2]), where a given output with shape [N, C, H, W] will have stride [H*W*C, 1, W*C, C]

Implementation details:
It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by:

    adding a permute op on outputs inside the generated kernel, to ensure that the output is stored in the correct memory layout;
    after the kernel has executed, we permute that corresponding output to undo the permutation inside the kernel, this gives us the semantically correct output in the desired memory layout.
ampere tests running on pre-ampere devices triggers a CI failure.
Fixing and improving indexing type handling
Fixes #2564

Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>
Fixes python handling of expanded broadcast dimension.
e.g. torch.randint(0, 1, (5, 5), device="cuda").bool().unsqueeze(-1).expand(5, 5, 5)

Changes contiguity representation on python.
computeContiguity currently returns an array with the length of tensor rank, elements in the array can be: True, False or None, where None indicates a given dimension is broadcast.
* Clean up compile-time and run-time index options
Fixes index_select on empty/scalar indices. Issues found in python API.

Our stack should support empty tensor (numel()==0), removed the check on that.
Scalar tensor should be used in place of real Scalar with variable_name[0]. Add a quick patch for that.
---------

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* Include non-view rfactor IDs in CA map rfactor ID sets
* fix tests for multicluster fusion
…s. (#2576)

Recomputation for each persistent use should be done after the
accumulation is done.

Currently, recomputation and replaceVal can be done redundantly. For
example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.