Fix tv parallelization #758

naoyam · 2021-03-16T23:42:31Z

Fixes #757

naoyam · 2021-03-17T20:21:55Z

It turned out this is actually not the right thing to do. The parallel map contains mappings created with the the forward-bcast-mismatch enabled, so for example, it would create a mapping between I2*I1 of t4 and I1 of t2:

t0 = makeSymbolicTensor(1); // t0: [I1]
t1 = makeSymbolicTensor(2); // t1: [I2, I1]
t2 = t0 + 1; // t2: [I1]
t3 = broadcast(t2, {true, false}); // t3: [B1, I1]
t4 = t1 + t3; // t4: [I2, I1]
t4->merge(0, 1); // t4: [I2*I1]
t2->computeAt(t4, -1);
t4->axis(0)->parallelize(TIDx);

This PR would do t2->axis(0)->parallelize(TIDx) as the axis is mapped with the t4 axis. However, this is a problem since that would mean I2*I1 == blockDim.x and I1 == blockDim.x.

naoyam · 2021-03-17T23:42:30Z

Ended up adding ParallelTypeBitmap to kir::BroadcastOp. See 9f9f902. This is the only way I can think of for properly finding the parallelism of kir::BroadcastOp.

More broadly, I think a bigger problem is that we don't have straightforward way to know parallelism of kir:TensorView and kir::IterDomain. The truth only lies in the ComputeAt parallel map, which only exists at the lowering time. Referring to kir::IterDomain::parallelType() is not robust as it may not be the same as the real parallel type. We use that for codegen of kir::ReductionOp, where it is fortunately safe as there should be no reduction in the CA axes.

csarofeen · 2021-03-18T03:17:22Z

I2*I1 == blockDim.x and I1 == blockDim.x actually seems fine to me in this case.

blockDim.x should be set to the maximum size. I think the issue is that we substitute values in the IR for blockDim.x where we shouldn't. I think the original issue stems from the fact that a tensor size gets substituted for this value, and thread bindings can be ambiguous in the presence of broadcst. I'm uncertain we can resolve all cases.

csarofeen · 2021-03-18T03:19:51Z

I thought that was how we started down this path: #622

csarofeen

Clearing approval as looks like we should discuss/think about this more. Broadcast parallelism is definitely a complex topioc.

naoyam · 2021-03-18T03:45:14Z

The substitution is done here:

https://github.com/csarofeen/pytorch/blob/20_12_3_devel/torch/csrc/jit/codegen/cuda/kernel_ir.cpp#L100-L109

If we allow I2*I1 == blockDim.x and I1 == blockDim.x, the substitution is no longer valid.

I see two options:

Parallelize all kir::IterDomain when inferred from the computeAt map and stop the substitution. (ab83e3e)
Annotate kir::BroadcastOp (9f9f902)

I'd say the second option is more conservative. It only solves this particular problem with BroadcastOp. The first option may be more preferable as it conveys the parallelism information to KIR. However, there may be side effects as the substitution problem.

The first option seems better to me from long-term perspectives.

csarofeen · 2021-03-18T17:17:31Z

Can you try 1 and run the reduction benchmark suite before and after to see if you find any serious perf regressions. We use a lot of tensor size information in our kernels anyways, I think removing the substitution shouldn't be significant, though I could definitely be wrong.

…onships" This reverts commit ab83e3e6367ab186498b2d0ab81ca09dcb52f434.

There is no easy way to know which parallel types are used for kir::TensorView after the lowering as the ComputeAt parallel map is not maintained. Adds that information to kir::BroadcastOp as it is needed for codegen.

This reverts commit 2caed18.

naoyam · 2021-03-19T17:07:55Z

@csarofeen Done. I changed kir::IterDomain::extent() to just return extent_, which is just the same as rawExtent(). All tests are working fine.

…e registered EBCs with shardedTensors as registered modules (#758) (pytorch#88026) Summary: X-link: meta-pytorch/torchrec#758 This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore. this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals) It defines device of ShardedTensor to be None when local_tensor() does not exist on rank update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit Differential Revision: D40458625 Pull Request resolved: pytorch#88026 Approved by: https://github.com/wanchaol, https://github.com/rohan-varma

naoyam requested a review from csarofeen March 16, 2021 23:43

naoyam mentioned this pull request Mar 17, 2021

Fix predicated broadcast #764

Merged

csarofeen approved these changes Mar 17, 2021

View reviewed changes

naoyam force-pushed the fix-tv-parallelization branch from a415093 to 0ddf3e5 Compare March 17, 2021 15:47

naoyam force-pushed the fix-tv-parallelization branch from 0ddf3e5 to 9f9f902 Compare March 17, 2021 23:25

naoyam requested a review from csarofeen March 17, 2021 23:43

csarofeen requested changes Mar 18, 2021

View reviewed changes

naoyam added 7 commits March 19, 2021 09:45

Add a repro for issue #757

0b00077

Parallelize all IterDomains when inferred by computeAt relationships

f267cb8

Revert "Parallelize all IterDomains when inferred by computeAt relati…

ee0bc8e

…onships" This reverts commit ab83e3e6367ab186498b2d0ab81ca09dcb52f434.

Add parallelism information to kir::BroadcastOp

2caed18

There is no easy way to know which parallel types are used for kir::TensorView after the lowering as the ComputeAt parallel map is not maintained. Adds that information to kir::BroadcastOp as it is needed for codegen.

Revert "Add parallelism information to kir::BroadcastOp"

fc06ec3

This reverts commit 2caed18.

Parallelize all IterDomains when inferred by computeAt relationships

3c7c147

Do not substiutte kir::IterDomain::extent_ with parallel dimensions

c0fa0c9

naoyam force-pushed the fix-tv-parallelization branch from 30d5976 to c0fa0c9 Compare March 19, 2021 17:06

naoyam requested a review from csarofeen March 19, 2021 17:08

csarofeen approved these changes Mar 19, 2021

View reviewed changes

csarofeen merged commit 4df7a6a into 20_12_3_devel Mar 19, 2021

csarofeen deleted the fix-tv-parallelization branch June 9, 2021 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tv parallelization #758

Fix tv parallelization #758

Uh oh!

naoyam commented Mar 16, 2021

Uh oh!

naoyam commented Mar 17, 2021

Uh oh!

naoyam commented Mar 17, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

csarofeen left a comment

Uh oh!

naoyam commented Mar 18, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

naoyam commented Mar 19, 2021

Uh oh!

Uh oh!

Fix tv parallelization #758

Fix tv parallelization #758

Uh oh!

Conversation

naoyam commented Mar 16, 2021

Uh oh!

naoyam commented Mar 17, 2021

Uh oh!

naoyam commented Mar 17, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam commented Mar 18, 2021

Uh oh!

csarofeen commented Mar 18, 2021

Uh oh!

naoyam commented Mar 19, 2021

Uh oh!

Uh oh!