Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

rhenry-nv
Copy link
Contributor

Description

This PR adds a new inference operator to GPU for getting the lemma logits for a factored vocabulary. It demonstrates significant speedup on GPU inference over PR #772

Here are some perf numbers relative to PR #772

Times from a proxy model with 1 stream as measured on a Titan V.

Batch Time from #772 (s) Current Time(s) Speedup factor
1 162.169 105.638 1.53513887
2 109.386 73.2865 1.492580489
4 66.8378 45.1652 1.479851744
8 39.2434 27.1344 1.446260098
16 23.2354 16.1808 1.43598586
32 14.3573 10.1867 1.4094162
64 9.01939 6.45151 1.398027749
128 6.04882 4.39468 1.376396006
256 4.28963 3.2189 1.332638479

Times from a proxy model with two streams as measured on a Titan V

Batch Time from #772 (s) Current Time(s) Speedup factor
1 110.365 94.137 1.172387053
2 74.248 61.736 1.202669431
4 45.4441 37.2677 1.219396421
8 26.7974 21.81 1.22867492
16 15.6832 12.6072 1.243987563
32 9.79104 7.70231 1.271182282
64 6.22622 4.86047 1.280991344
128 4.22329 3.3429 1.263361153
256 3.2422 2.52556 1.28375489

List of changes:

  • Adds a new operator used only for GPU inference
  • Adds a cache by name function to the expression graph so store a tensor containing whether each lemma has a given factor group. (There may be a better way to do this but it wasn't obvious to me). This tensor needs to reside permanently on the GPU since the D2H copy time would be too great for this optimization to be worthwhile otherwise.

Added dependencies: cub

How to test

I ran the regression tests and they all passed. I also manually tested on a proxy model and the outputs from master exactly match the outputs after this change was made.

CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on

Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

…since the allocator has a memory pool that it manages for it won't get released by a cuda free. Additionally, two kernels may get the same pointer but they cannot execute concurrently since a single thread does not launch concurrent kernels. Since there is an allocator per thread, this means that no two kernels can ever race on the same pointer (I think). I have not seen any issues after removing this sync
…expose more parallelism when adding into the lemmas
@rhenry-nv
Copy link
Contributor Author

FYI: I am currently requesting internally to remove the notices in each file and for NVIDIA to be added to the license file. I will take care of the licensing once I get confirmation.

auto factorMaxima = max(logits_[g]->loss(), -1);
auto factorMasks = constant(getFactorMasks(g, shortlist ? shortlist->indices() : std::vector<WordIndex>()));
sel = sel + factorMaxima * factorMasks; // those lemmas that don't have a factor get multiplied with 0
if(numGroups > 1 && graph()->isInference() && graph()->getBackend()->getDeviceId().type == DeviceType::gpu) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fork is something I wasn't sure how to remove. It would be better if it was under the expression operator but moving it down causes the operator interface to be a bit ugly and introduces some code duplication. Feedback on this in particular would be greatly appreciated.

@rhenry-nv rhenry-nv changed the title Factor maxes op Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants