Support size_per_head=112 #660

dskhudia · 2023-06-09T04:59:31Z

Support for size_per_head=112 is missing for GPT style models and this PR adds it.

fix multi-gpu build

dwyatte · 2023-06-23T21:13:19Z

Looks like this is needed for https://www.mosaicml.com/blog/mpt-30b, would be great to get it supported in FasterTransformer

dskhudia · 2023-06-26T18:32:36Z

@dwyatte : Correct. That's why I added this PR.

dwyatte · 2023-06-26T18:49:47Z

src/fastertransformer/utils/CMakeLists.txt

@@ -57,7 +57,7 @@ add_library(mpi_utils STATIC mpi_utils.cc)
 set_property(TARGET mpi_utils PROPERTY POSITION_INDEPENDENT_CODE  ON)
 set_property(TARGET mpi_utils PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS  ON)
 if (BUILD_MULTI_GPU)
-    target_link_libraries(mpi_utils PUBLIC -lmpi logger)
+    target_link_libraries(mpi_utils PUBLIC -lmpi -lmpi_cxx logger)


(I'm not reviewing, but am interested in this functionality)

@dskhudia can you say more about this change and whether it's needed? I'm building this via https://github.com/triton-inference-server/fastertransformer_backend and mpi_cxx is not there for linking (so may require some additional changes across the FasterTransformer ecosystem). If this is not actually needed, omitting it may be easier to get this approved/merged (edit: I am able to run this via my fork(s) without mpi_cxx)

This was needed when I was building FT as described in the GPT doc. I created a separate PR for this as well. Willing to remove it from this PR. It accidentally got added in this.

#616

tuttlebr · 2023-07-03T16:26:40Z

I realize FT/FTB are not well supported anymore but this appears to have broken some dependencies when building in Docker. Anyone else experiencing that?

dwyatte · 2023-07-05T16:59:18Z

@tuttlebr I just opened #705 which I'm hoping should fix the Docker build dependencies. Any idea if that's the source of your issue?

/usr/bin/ld: cannot find -lmpi_cxx
--
  | 2023-07-05 10:43:21 MDT | collect2: error: ld returned 1 exit status
  | 2023-07-05 10:43:21 MDT | make[2]: *** [_deps/repo-ft-build/src/fastertransformer/triton_backend/CMakeFiles/TransformerTritonBackend.dir/build.make:101: lib/libTransformerTritonBackend.so] Error 1
  | 2023-07-05 10:43:21 MDT | make[1]: *** [CMakeFiles/Makefile2:7382: _deps/repo-ft-build/src/fastertransformer/triton_backend/CMakeFiles/TransformerTritonBackend.dir/all] Error 2
  | 2023-07-05 10:43:21 MDT | make[1]: *** Waiting for unfinished jobs....

* Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <842720660@qq.com> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <842720660@qq.com> * Format the code. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <842720660@qq.com> * update docs Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <842720660@qq.com> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <842720660@qq.com> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <r.yang@tianrang-inc.com> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>

* fix multi-gpu build * add support for size_per_head=112 for gpt decoder

* Merge with main (#1) * Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <842720660@qq.com> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <842720660@qq.com> * Format the code. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <842720660@qq.com> * update docs Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <842720660@qq.com> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <842720660@qq.com> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <r.yang@tianrang-inc.com> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com> * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: Asim Shankar <asim.shankar@snowflake.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>

dskhudia and others added 3 commits May 17, 2023 23:48

fix multi-gpu build

eef4cf6

Merge pull request #1 from dskhudia/multi_gpu_build_fix

29a376a

fix multi-gpu build

add support for size_per_head=112 for gpt decoder

7a71b18

dwyatte reviewed Jun 26, 2023

View reviewed changes

byshiue merged commit 7777ff1 into NVIDIA:main Jun 29, 2023

dwyatte mentioned this pull request Jul 5, 2023

remove mpi_cxx from multi-gpu build #705

Merged

azahed98 pushed a commit to togethercomputer/FasterTransformer that referenced this pull request Jul 20, 2023

Support size_per_head=112 (NVIDIA#660)

81b1284

* fix multi-gpu build * add support for size_per_head=112 for gpt decoder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support size_per_head=112 #660

Support size_per_head=112 #660

dskhudia commented Jun 9, 2023

dwyatte commented Jun 23, 2023

dskhudia commented Jun 26, 2023

dwyatte Jun 26, 2023 •

edited

Loading

dskhudia Jun 26, 2023

tuttlebr commented Jul 3, 2023

dwyatte commented Jul 5, 2023 •

edited

Loading

Support size_per_head=112 #660

Support size_per_head=112 #660

Conversation

dskhudia commented Jun 9, 2023

dwyatte commented Jun 23, 2023

dskhudia commented Jun 26, 2023

dwyatte Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

dskhudia Jun 26, 2023

Choose a reason for hiding this comment

tuttlebr commented Jul 3, 2023

dwyatte commented Jul 5, 2023 • edited Loading

dwyatte Jun 26, 2023 •

edited

Loading

dwyatte commented Jul 5, 2023 •

edited

Loading