Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: IterDomain Graphs #32

Closed
wants to merge 239 commits into from
Closed

WIP: IterDomain Graphs #32

wants to merge 239 commits into from

Conversation

csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Mar 19, 2023

Build out of Iter domain graphs as infrastructure. I kept these concepts separate from the compute_at_map as that may need to be reimplemented later based on this new concept/infrastructure.

This new concept of IterDomainGraphs will eventually replace all our index and parallelization logic. This infrastructure is to make it easier to work with iter domain graphs for processes like accurate broadcast resolution/promotion. IdGraph or IterDomainGraphs could also directly replace BestEffortReplay and similar mappings across producer-consumers.

I added a couple interesting tests that I want to make work but still don't.

@csarofeen
Copy link
Collaborator Author

Helps if I include the right files.

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing. I haven't gone through all of buildLoopPromotionMap yet.

csrc/disjoint_set.h Outdated Show resolved Hide resolved
csrc/id_graphs.h Outdated
Comment on lines 35 to 37
// (1) The disjoint set of the provided Iter Domain if it exists,
// otherwise a null shared ptr
// (2) If the disjoint set of the provided Iter Domain exists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from returning a std::optional<IdGroup>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an optional probably makes a lot of sense, but actually I think most instances I just want to assert there's actually an id set. So probably I'll just put an alias.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just stole the pattern for https://cplusplus.com/reference/unordered_set/unordered_set/emplace/ for development, no reason not to switch to optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. Makes sense. I think std::optional was introduced in C++17, but std::unordered_set has been around since C++11 so they had to use the explicit pair<..., bool> . From side discussion it looks like we might be OK with using C++17 going forward on NVFuser. Either way not a big deal of course.

csrc/id_graphs.h Outdated
//! all IterDomains in the disjoint set to that PType.
void validateAndPropagatePType() const;

void buildLoopPromotionMap(const std::vector<Expr*>& exprs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic question, but I see "promotion" mentioned many times here. In this context what does it mean to promote an ID?

Copy link
Collaborator Author

@csarofeen csarofeen Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promotion is the concept that an iteration domain that a TensorView has in its root->domain might not really be what's required for the generated kernel to index into that TensorView. Promotion is for example:

  • Producer has a broadcast merged with an iteration domain
  • Consumer has a (mapped to the producer) iteration domain merged with an iteration domain
  • Based on other transformations the producer might have to "promote" it's broadcast domain to the iteration domain in it's consumer
  • If there's a producer of that producer, then we still might need the "broadcast promotion" but there isn't a broadcast that maps in that producer's producer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "end goal" of promotion in this pass (still very WIP), is that each leaf iter domain of a tensor view might be "promoted" to a larger iteration domain representative of the for loops. That larger iter domain still needs connections that we can traverse to index into the tensor view's buffer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this after some thought and going through some of the machinery. Just to try and reiterate the example above:

t0[(b0*i1), i2] // producer
t1[(i3 * i4)] = f(t0)  // consumer, through expression f

i3 is defined by an expression in the variables i1, i2.

We may need to alter b0 to match i3 if i4 matches i1.

/This might cascade, since (b0*i1) might match another merged broadcast in its producer.

End goal of promotion: map each leaf IterDomain in each tensorview (importantly including bcast domains and transforms thereof) to an Iteration IterDomain that is written as a transform of that TVs root_domain, so that we can index into it.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look, but that's not enough to make any meaningful feedback. I'd need to spend a significant amount of time (i.e., a few days). Should I wait? Is it ready?

csrc/id_graphs.h Outdated Show resolved Hide resolved
@csarofeen
Copy link
Collaborator Author

@naoyam I think it's worth trying to read through, I wouldn't worry about really nailing down interfaces, but the building and relationship of an Iter Domain Graph, and how we operate on a collection of Iter Domain Graphs. The infrastructure is here, but buildLoopPromotionMap is the real challenge. That's the function where I'm trying to resolve all the broadcasts in a way we can write index pass directly on the graph iter domain graphs, or really (hopefully) on an iter domain graph.

@csarofeen
Copy link
Collaborator Author

i.e. buildLoopPromotionMap is something you could initially ignore.

@naoyam naoyam closed this Mar 22, 2023
@naoyam naoyam reopened this Mar 22, 2023
@naoyam
Copy link
Collaborator

naoyam commented Mar 22, 2023

(I'm sorry I accidentally clicked the close button)

@naoyam
Copy link
Collaborator

naoyam commented Mar 22, 2023

i.e. buildLoopPromotionMap is something you could initially ignore.

Good to know. That's where I intended to focus on as I think that's one of the main tasks in this PR. I'll wait further updates for the function. Will look though the rest.

@csarofeen
Copy link
Collaborator Author

@naoyam you could start looking at buildLoopPromotionMap with all the debug spew I have. Interesting cases in my opinion are:
FusionInlineBroadcastIndexing0_CUDA
FusionIndexing19_CUDA
FusionIndexSplitMerge_CUDA

I still need to make a backward replay before we try to perform indexing, as I don't want to index all the tensor views one by one, so instead I intend to build a graph that we can naively traverse all at once to get consumer indices.

@naoyam naoyam closed this Mar 28, 2023
@naoyam naoyam reopened this Mar 28, 2023
@naoyam
Copy link
Collaborator

naoyam commented Mar 28, 2023

(Sorry again accidentally clicked the close button)

FusionInlineBroadcastIndexing0_CUDA
FusionIndexing19_CUDA
FusionIndexSplitMerge_CUDA

All these tests resulted in a failure. Is it expected?

@naoyam
Copy link
Collaborator

naoyam commented Mar 28, 2023

(Sorry again accidentally clicked the close button)

FusionInlineBroadcastIndexing0_CUDA
FusionIndexing19_CUDA
FusionIndexSplitMerge_CUDA

All these tests resulted in a failure. Is it expected?

It seems expected as there's TORCH_INERNAL_ASSERT(false) at the end of buldLoopPromotionMap

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on IdGraph

csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
@csarofeen
Copy link
Collaborator Author

(Sorry again accidentally clicked the close button)

FusionInlineBroadcastIndexing0_CUDA
FusionIndexing19_CUDA
FusionIndexSplitMerge_CUDA

All these tests resulted in a failure. Is it expected?

Yes, indexing is not hooked up, so they're not yet supported. I'm throwing an error where I leave off in the analysis.

csrc/id_graphs.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments so far on buildLoopPromotionMap

csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
Comment on lines 2246 to 2256
VectorOfUniqueEntries<IterDomain*> all_producer_ca_deps;
{
auto ca_dep_vals = DependencyCheck::getAllValsBetween(
{producer_root.begin(), producer_root.end()},
{producer_domain.begin(),
producer_domain.begin() + producer->getComputeAtPosition()});
auto ca_deps_filter = ir_utils::filterByType<IterDomain>(ca_dep_vals);

all_producer_ca_deps.insert(
ca_deps_filter.begin(), ca_deps_filter.end());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This pattern appears quite often. We should create a utility function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Something that goes directly from...

DependencyCheck::getAllValsBetween(
            {producer_root.begin(), producer_root.end()},
            {producer_domain.begin(),
             producer_domain.begin() + producer->getComputeAtPosition()})

To:

      VectorOfUniqueEntries<IterDomain*> all_producer_ca_deps;

?

}
}

const IdGraph& IterDomainGraphs::idGraph(IdMappingMode mode) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this assert if the returned graph is already constructed? For example, the LOOP map is not available before lowering, so it would be nice if we could assert it's accidentally queried.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, do you think the check in replay as is enough to do this effectively or do you think that we should have some flag in IterDomainGraphs marking when each mode is initialized?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter seems to make more sense to me.

csrc/id_graphs.cpp Outdated Show resolved Hide resolved

buildPermissiveMap(tv_exprs);

// Only build loop map during lowering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also sometimes just want to build only an exact and/or permissive map. The scheduler is one example where we sometimes use an exact map only. Maybe a permissive map is also used anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it's not too expensive, but I'd be happy to have finer granularity on the building of the graphs. Being able to specify which are needed, or generating the maps lazily could also be cool. Leaving to future work for now.

csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
idGraph(IdMappingMode::LOOP).disjointIdSets().disjointSets()) {
if (group->size() == 1) {
p2c_ca_terminal_loop_ids.pushBack(group->front());
id_consumer_terminal_loop_ids.pushBack(group->front());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: insert continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I need to clean this up.

// T4 = T1[i0, b1] + T3[i0, i1]
// T6 = T2[i0, b1] + T5[i0, i2]
//
// The almost exact map will map T1's and T2's b1 together, but they're being
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a little more why they are mapped? Are these domains merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you, this assumes merge(i0, b1), merge(i0, i1), and merge(i0, i2)

Copy link
Collaborator Author

@csarofeen csarofeen Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then merge(i0, b1) of T1 and merge(i0, b1) of T2 are almost exact mapped together from id graph propagation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rewrite the expressions as below. I think this is more accurate.

T1[i0, b1] = T0[i0]
T2[i0, b2] = T0[i0] // Not T2[i0, b1]

T4 = T1[i0, b1] + T3[i0, i1]
T6 = T2[i0, b2] + T5[i0, i2]

Then merge(0, 1) with all tensors except for T0.

The almost-exact map would map i0 and i0*b1 and i0*b2. Does it also map b1 and b2?

Copy link
Collaborator Author

@csarofeen csarofeen May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, if it maps b1 and b2 together. Should be benign if so.

csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
csrc/id_graphs.cpp Outdated Show resolved Hide resolved
@naoyam
Copy link
Collaborator

naoyam commented May 21, 2024

!build

@naoyam
Copy link
Collaborator

naoyam commented Jun 4, 2024

Closing this PR. Most of the code was already merged into main, and the remaining code is currently not used and will be revisited it if necessary.

The next work is indexing, which is tracked in #2238.

@naoyam naoyam closed this Jun 4, 2024
@naoyam naoyam mentioned this pull request Sep 26, 2024
naoyam added a commit that referenced this pull request Sep 27, 2024
Renamed test_gpu_indexing.cpp to test_indexing_advanced.cpp. Changed the
tests to exercise both the legacy and new indexers.

Added several tests originally developed for IdModel (#32). Some of them
are disabled as they are not yet supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants