Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel split K mode for porfiling #277

Merged
merged 5 commits into from
Jan 27, 2022

Conversation

Peter9606
Copy link
Contributor

@Peter9606 Peter9606 commented Jun 11, 2021

I'm trying to add support for parallel profiling, and this patch is what I modified.
Unfortunately, it only works for very small portion of problem sizes whose m should equal n. Also, to make it workable, I have to hard code the number of elements computed per operation during epilogue to 1 which is obviously not correct. Hope someone can correct it.

A command line sample if anyone want to try it
./cutlass_profiler --split_k_slices=2 --m=242 --n=242 --k=300 --split_k_mode=parallel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
@hwu36
Copy link
Collaborator

hwu36 commented Jun 11, 2021

@manishucsd and @kerrmudgeon , would you please help @Peter9606 here? Maybe point him to the conv parallel splitk code?

@manishucsd
Copy link
Contributor

manishucsd commented Jun 14, 2021

Hi @Peter9606 , I have reviewed your code. It looks like you have followed the changes in convolutions to support parallel reductions.

A few points to consider:

  1. Can you check that everything for the reduction kernel is initialized properly.

  2. Also, check at the dispatch point from the profiler to the actual kernel. Use these printfs.

  3. Make sure the reduction kernel you are trying to use is instantiated and compiled into the CUTLASS library.

  • Reduction kernels are manually instanced in reduction_device.cu
  • It currently has the largest aligned kernels. If you need smaller alignments you will need to instance them and not overwrite them.
  • Also, to use smaller alignment you will need to add something similar to GemmPreferenceKey which handles alignment. Note that ReductionFunctionalKey will give the list of all kernels that match the functional requirement. There could be many possible alignments that functionally satisfy the problem size. It will pick the largest possible alignment from a functionally equivalent kernel set.
  • I recommend making Gemm parallel split-k work for the largest alignment first. Make it work for F16 align8 kernels first.
  1. I find something missing for GEMM parallel reduction change. See this part (
    // initialize conv2d underlying operation to handle parallel reduction
    )
  • For F16 output and F32 accumulation, Gemm + Parallel reduction will change the Gemm kernel you need to call. Instead of calling Gemm with F16 output, now Gemm writes output in F32, and reduction kernel writes it F16 (F32->F16).
  • Try to make your changes work for F16 accumulation and F16 output for the largest possible alignment first and then go from there.

Thanks!

@Peter9606
Copy link
Contributor Author

* For F16 output and F32 accumulation, Gemm + Parallel reduction will change the Gemm kernel you need to call. Instead of calling Gemm with F16 output, now Gemm writes output in F32, and reduction kernel writes it F16 (F32->F16).

I finally get it, thank you very much!

  1. find gemm kernel by preference key
  2. switch m n for redution kernel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
@Peter9606
Copy link
Contributor Author

@manishucsd Now parallel split K reduction profiling works for SMIT fp32, but only through new added reduction kernel with smaller alignment. Still not very clear why larger alignment reduction kernel cannot be selected.

@Peter9606
Copy link
Contributor Author

Seems that this version can have a successful run for fp16 with align1/2/4.

@mnicely mnicely added this to the CUTLASS 2.9 milestone Dec 1, 2021
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@hwu36
Copy link
Collaborator

hwu36 commented Jan 27, 2022

Now it can support

  • fp32 gemm output, fp16 reduction output
  • fp32 gemm output, fp32 reduction output

It still requires the 128bit alignment in the reduction. In this PR, I removed small aligned reduction code which requires some extra logic to find the correct reduction configuration. To do this, we need to put alignment into the FunctionalKey or PreferenceKey of the reduction operation and use problem_size.m to decide the correct the reduction kernel to use. We welcome the community to extend this PR to support small alignment reduction.

@hwu36 hwu36 merged commit 1e4703c into NVIDIA:master Jan 27, 2022
@Peter9606 Peter9606 deleted the parallel_profiling_support branch January 28, 2022 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants