Dynamic shape latency improvement, step 1.5. Misc runtime allocation and cast cleanup #1114

shmsong · 2021-09-13T19:00:12Z

All the changes should be for better performance only and not supposed to change any of the compile/runtime semantics.

Some effort was spent on making sure each commit is a self-contained optimization so commit-by-commit review might be easier.

Update:
I have removed the commits that made significant difference only under profiled environment (nsys), but google benchmark didn't show improvement (within 5us all together). They could be added back if needed.

Benchmark result:

@c9759b44e8aec7cd45990b22130589dbffcf26e8

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------
ShapeInferenceBenchmark_FusedGraph0_1Segment                                 20.0 us         20.0 us        32964
ShapeInferenceBenchmark_FusedGraph0_1Segment_NoShapeInferenceBaseline        9.77 us         9.77 us        74790
ShapeInferenceBenchmark_FusedGraph1_2Segments                                42.9 us         42.9 us        16149
ShapeInferenceBenchmark_FusedGraph1_2Segments_NoShapeInferenceBaseline       24.0 us         24.0 us        28487

comparison reference available at #1098

torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp

benchmarks/cpp/nvfuser/heuristic_lookup.cpp

torch/csrc/jit/codegen/cuda/kernel_cache.h

torch/csrc/jit/codegen/cuda/executor_launch_params.h

torch/csrc/jit/codegen/cuda/executor_utils.cpp

torch/csrc/jit/codegen/cuda/expr_evaluator.cpp

torch/csrc/jit/codegen/cuda/executor.cpp

naoyam · 2021-09-13T22:12:40Z

I left comments. The only major comment is whether we could use ParalellDimensionMap at the execution time as well.

I'd be surprised with some of the changes if they really had performance impact. For example, 27dff5e#r707737832. It would eliminate the call to count(), so it should be more efficient, but if it's empty, I'd expect count() should just return immediately, so I expect the difference would be pretty small. Or, if it has reached a point where these small differences do actually matter, that's actually a significant improvement.

naoyam

Thanks for updating the PR. LGTM.

csarofeen · 2021-09-16T15:54:39Z

torch/csrc/jit/codegen/cuda/executor_utils.h

+//!    This entry type is a simplified version of ParallelIterExtentMap.
+//!
+//!    For launch parameter binding we only need the most concrete iterdomain
+//!      in each disjoint set stored in CaParallelMap. This entry stores the


@naoyam are we certain this doesn't have to be index map?

I actually think the parallel map is fine here since concrete domains are selected.

This actually makes me think again about ParallelDimensionMap as well. For example:

auto tv0 = makeSymbolicTensor(1); fusion.addInput(tv0); auto tv1 = makeSymbolicTensor(2); fusion.addInput(tv1); auto tv2 = broadcast(tv0, {true, false}); auto tv3 = add(tv2, tv1); fusion.addOutput(tv4); tv3->merge(0, 1); tv0->computeAt(tv3, -1); tv1->computeAt(tv3, -1); tv3->axis(0)->parallelize(ParallelType::BIDx);

Inputs: T0_g[ iS0{i1} ], float T1_g[ iS9{( i4 * i7 )} ], float Outputs: T3_g[ iblockIdx.x7{( i4 * i1 )} ] produce_pos( 1), float %kernel_math { T2_l[ iS8{( 1 * i1 )} ] ca_pos( 1 ) = broadcast( T0_g[ iS0{i1} ] ) T3_g[ iblockIdx.x7{( i4 * i1 )} ] produce_pos( 1) = T2_l[ iS8{( 1 * i1 )} ] ca_pos( 1 ) + T1_g[ iS9{( i4 * i7 )} ]; }

With this fusion, the parallel dimension map looks like:

blockIdx.x: gridDim.x, non-exact blockIdx.y: unused blockIdx.z: unused threadIdx.x: unused threadIdx.y: unused threadIdx.z: unused

Note that BIDx is marked as non-exact, even though it is actually exact. This is because ParallelDimensionMap uses the index map so the merged axis of T3 is not mapped with the axis of T0, but both of them are mapped with each other in the parallel map and are parallelized by BIDx.

I thought this should be not exact since iS8{( 1 * i1 )} != iblockIdx.x7{( i4 * i1 )}

I thought that was the behavior we're interested in.

I think you meant iS0{i1} != iblockIdx.x7{( i4 * i1 )}. iS8{( 1 * i1 )} and iblockIdx.x7{( i4 * i1 )} are mapped in the index map. The difference comes from the forwarding, so iS0{i1} and iblockIdx.x7{( i4 * i1 )} are mapped in the parallel map but not in the index map.

Since i1 != i4 * i1, they appear to make BIDx non-exact. However, with the computeAt, indexing T0 is always done with i4 * i1. In this particular case, while iS0{i1} is marked as parallelized with BIDx, its indexing is blockDim.x % i1, so it never goes beyond i1. So, in that sense, BIDx is still exact, in other words, we don't need a predicate like blockDim.x < i1.

This seems related to the issue we discussed before about the indexing change with and without computeAt.

I feel this is too much intricated. I'm spending too much time to think about which maps to use. Would be great if we could come up with a single map, but not sure if that's reasonably possible.

iS8{( 1 * i1 )} != iblockIdx.x7{( i4 * i1 )} should now be true in the index map, it was true that it used to map, but I explicitly changed that because it shouldn't.

This makes sense though. I forgot we don't map to the right of the compute at point in the parallel map. So really this is what parallel type is being used to index into the tensor. In that instance then yes, the parallel map is enough to get the unique entries of parallelization bound into the problem.

…e_misc

shmsong force-pushed the dynamic_shape_misc branch from 4434f95 to 39149a6 Compare September 13, 2021 21:17