Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full indexing rework #260

Merged
merged 68 commits into from
Aug 12, 2020
Merged

Full indexing rework #260

merged 68 commits into from
Aug 12, 2020

Conversation

csarofeen
Copy link
Owner

@csarofeen csarofeen commented Aug 1, 2020

Doing a lot of indexing rework.

  • Loop construction can be wrong when a TV doesn't have a dimension that its ComputeAt TV has.
  • Loop construction can be wrong when a ComputeAt TV merged an axis that another TV doesn't have that we're indexing into.
  • Unroll predicate can be wrong if consumers don't have the same broadcast patterns when they're in the same unroll loop. (Uncertain this is still true, may be fixed).
  • smem and lmem tensors are indexed wrong if their access isn't exactly the same when they're consumer vs when they're producer.
  • Collapsing needs a slight modification, I believe a merge is contiguous if all merges below are contiguous. Only checking merges are simply on ordered root domains can hit this issue:
TV0 = makeTensor(nDims=3)
TV0->merge(0, 2)
TV0->merge(0, 1)

Right now we'd say this is contiguous, but linear indexing on this is not the same as:

TV0 = makeTensor(nDims=3)
TV0->merge(0, 1)
TV0->merge(0, 1)

The latter should be contiguous merges, the former not.

  • Cleanup unused functions
  • Cleanup interface to tensorview/tensor domain around rfactor
  • Predicate modification when we have rfactor domain.
  • Fix disabled tests (python and cpp)

@jjsjann123
Copy link
Collaborator

I'm merging #277. which temporarily hides all the problem this PR is supposed to fix. We need to revert/undo the changes when testing the correctness of this PR.

Sorry for the inconvenience.


at::Tensor cg_output = at::empty({x, y, z}, options);
void testGPU_FusionComplexBCast() {
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we split this into separate test functions?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would you please push a commit for it to this pr?

//
// The reason we need both TensorView and TensorDomain is that we need to have a
// record of both what is being computed and how it is being computed. For
// Example we may have the operation: TV3[I, J, K] = TV2[I, J, K] + TV1[I, J, K]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: example

return rfactor_domain_;
};
// If rfactor domain exists in domain() return it, otherwise return root
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add empty line

@@ -342,8 +342,18 @@ void IRPrinter::handle(const kir::NamedScalar* i) {
os << i->name();
}

void IRPrinter::handle(const kir::IterDomain*) {
TORCH_INTERNAL_ASSERT(false, "Unreachable");
void IRPrinter::handle(const kir::IterDomain* id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this needed? I hit this a few times while working on the refactoring, but every time it indicated that something else needed to be updated

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging matching between the IterDomain of kir and fusion was pretty challenging when I couldn't print what they were. No remaining code relies on it, but I rely on it for debugging. I had to update Type.cpp

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should now be able to fill the rest out marked as Unreachable

}

std::vector<Expr*> UnsortedExprs::getFrom(std::vector<Val*> outputs) {
if (outputs.empty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ ..}

// the life of this context guard.
class TVDomainGuard {
public:
TensorView* tv_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if public, no need for _ suffix
  2. = nullptr (same for prev_domain)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be private, thanks.

@@ -25,18 +29,21 @@ std::vector<kir::Bool*> PredicateCompute::computePredicates(
return {};
}

std::vector<kir::Bool*> preds(root.size(), new kir::Bool(true));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to pre-initialize before the loop below.

const std::vector<kir::ForLoop*>& loops,
kir::Bool* thread_pred) {
if (loops.empty())
return new kir::Bool(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ .. }


void openLoop(kir::ForLoop*);

std::unordered_map<IterDomain*, kir::Bool*> predicates;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private: to separate private methods from private state

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're ordered, that may be the best you get from me.

@@ -64,6 +64,16 @@ static const char* val_type2string(ValType t) {
return "Scalar";
case ValType::NamedScalar:
return "NamedScalar";
case ValType::KirIterDomain:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for ir_iostream, an IR that isn't printable isn't debug-able.

Comment on lines 217 to 219
// Simply grabs all exprs needed to produce provided outputs, only in dependency
// order, not computeAtOrder like ExprSort in fusion.h
class UnsortedExprs : public IterVisitor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExprSort doesn't do anything for computeAt anymore as the logic is moved to lower_loops.cpp. Perhaps, we don't need this class and can just use ExprSort.

@@ -98,7 +100,6 @@ class TORCH_CUDA_API UnrollPass : public OptOutDispatch {
static std::vector<Expr*> runPass(
Fusion* fusion,
const std::vector<Expr*>& exprs,
const std::unordered_set<Expr*>& init_exprs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to remove a recent fix on predicates for initializing local buffers. See #246 and #255.

Copy link
Owner Author

@csarofeen csarofeen Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can re-enable that fix without carrying init_exprs. We can detect initialization expressions by the consumer pattern and the current loop nest.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open a new issue and I can do a follow up PR? I believe we still have correctness now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The previous failing test still works fine, but I'm not sure whether it's intended behavior. I re-opened the issue (#64).

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @csarofeen mentioned, predicate generation for initializing reduction buffers seems broken again. The original fix consists of #246 and #255.

@naoyam
Copy link
Collaborator

naoyam commented Aug 12, 2020

I did a quick review of all changes and left two comments. One is regarding the predicate generation for reduction buffers. Another is the newly added class, UnsortedExprs, which I think is just equivalent to ExprSort. I'll do a more thorough review later.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me for now. #64 may be an issue again.

@csarofeen csarofeen merged commit 2dec1a5 into 20_7_6_devel Aug 12, 2020
@csarofeen csarofeen deleted the contiguity_v3 branch June 9, 2021 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants