Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

rhenry-nv · 2020-12-15T01:23:33Z

Description

This PR adds a new inference operator to GPU for getting the lemma logits for a factored vocabulary. It demonstrates significant speedup on GPU inference over PR #772

Here are some perf numbers relative to PR #772

Times from a proxy model with 1 stream as measured on a Titan V.

Batch	Time from #772 (s)	Current Time(s)	Speedup factor
1	162.169	105.638	1.53513887
2	109.386	73.2865	1.492580489
4	66.8378	45.1652	1.479851744
8	39.2434	27.1344	1.446260098
16	23.2354	16.1808	1.43598586
32	14.3573	10.1867	1.4094162
64	9.01939	6.45151	1.398027749
128	6.04882	4.39468	1.376396006
256	4.28963	3.2189	1.332638479

Times from a proxy model with two streams as measured on a Titan V

Batch	Time from #772 (s)	Current Time(s)	Speedup factor
1	110.365	94.137	1.172387053
2	74.248	61.736	1.202669431
4	45.4441	37.2677	1.219396421
8	26.7974	21.81	1.22867492
16	15.6832	12.6072	1.243987563
32	9.79104	7.70231	1.271182282
64	6.22622	4.86047	1.280991344
128	4.22329	3.3429	1.263361153
256	3.2422	2.52556	1.28375489

List of changes:

Adds a new operator used only for GPU inference
Adds a cache by name function to the expression graph so store a tensor containing whether each lemma has a given factor group. (There may be a better way to do this but it wasn't obvious to me). This tensor needs to reside permanently on the GPU since the D2H copy time would be too great for this optimization to be worthwhile otherwise.

Added dependencies: cub

How to test

I ran the regression tests and they all passed. I also manually tested on a proxy model and the outputs from master exactly match the outputs after this change was made.

CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on

Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…erence when processing the lemmas

…since the allocator has a memory pool that it manages for it won't get released by a cuda free. Additionally, two kernels may get the same pointer but they cannot execute concurrently since a single thread does not launch concurrent kernels. Since there is an allocator per thread, this means that no two kernels can ever race on the same pointer (I think). I have not seen any issues after removing this sync

…expose more parallelism when adding into the lemmas

rhenry-nv · 2020-12-15T01:26:36Z

FYI: I am currently requesting internally to remove the notices in each file and for NVIDIA to be added to the license file. I will take care of the licensing once I get confirmation.

rhenry-nv · 2020-12-15T01:29:56Z

src/layers/generic.cpp

-        auto factorMaxima = max(logits_[g]->loss(), -1);
-        auto factorMasks = constant(getFactorMasks(g, shortlist ? shortlist->indices() : std::vector<WordIndex>()));
-        sel = sel + factorMaxima * factorMasks; // those lemmas that don't have a factor get multiplied with 0
+      if(numGroups > 1 && graph()->isInference() && graph()->getBackend()->getDeviceId().type == DeviceType::gpu) {


This fork is something I wasn't sure how to remove. It would be better if it was under the expression operator but moving it down causes the operator interface to be a bit ugly and introduces some code duplication. Feedback on this in particular would be greatly appreciated.

rhenry-nv added 15 commits December 11, 2020 14:13

Introduces a special operator to handle getting factor logits for inf…

1100596

…erence when processing the lemmas

Adds fp16 support for AddFactorMaxes

8694fd9

Adds cub as a submodule

0fd9aa1

Adds definitions for cub compilation

d700727

Bug fix in add factor maxes

f7aba6d

WIP - Rework addFactorMaxes. Starts splitting it into two kernels to …

baab83f

…expose more parallelism when adding into the lemmas

Stylistic changes to addFactorMaxes

f2d77e0

Initializes reduction variable to negative inf instead of float::lowest

706bbe7

Small refactoring

edebe59

Changes kernel parallelization strategy for addFactorMaxesKernel2

07de32d

More refactorings

eb49d6c

Fixes bug in factor maxes

975682b

Refactor of addFactorMaxes entrance

f0ffee1

Updates changelog

4f57fdb

rhenry-nv commented Dec 15, 2020

View reviewed changes

rhenry-nv changed the title ~~Factor maxes op~~ Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference Dec 15, 2020

Adds bigobj flag to windows builds

de5449b

rhenry-nv mentioned this pull request Dec 15, 2020

Adds better Affine support for GPUs when using CUDA 11. Introduces a new bias addition kernel for CUDA < 11 #778

Merged

4 tasks

rhenry-nv added 2 commits March 5, 2021 18:02

Merge remote-tracking branch 'public/master' into factorMaxesOp

22d13b3

Fix factor map op for shortlist

7be492a

snukky added the performance label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

rhenry-nv commented Dec 15, 2020

rhenry-nv commented Dec 15, 2020

rhenry-nv Dec 15, 2020

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

Are you sure you want to change the base?

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

Conversation

rhenry-nv commented Dec 15, 2020

Description

How to test

Checklist

rhenry-nv commented Dec 15, 2020

rhenry-nv Dec 15, 2020

Choose a reason for hiding this comment