-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776
base: master
Are you sure you want to change the base?
Conversation
…erence when processing the lemmas
…since the allocator has a memory pool that it manages for it won't get released by a cuda free. Additionally, two kernels may get the same pointer but they cannot execute concurrently since a single thread does not launch concurrent kernels. Since there is an allocator per thread, this means that no two kernels can ever race on the same pointer (I think). I have not seen any issues after removing this sync
…expose more parallelism when adding into the lemmas
FYI: I am currently requesting internally to remove the notices in each file and for NVIDIA to be added to the license file. I will take care of the licensing once I get confirmation. |
auto factorMaxima = max(logits_[g]->loss(), -1); | ||
auto factorMasks = constant(getFactorMasks(g, shortlist ? shortlist->indices() : std::vector<WordIndex>())); | ||
sel = sel + factorMaxima * factorMasks; // those lemmas that don't have a factor get multiplied with 0 | ||
if(numGroups > 1 && graph()->isInference() && graph()->getBackend()->getDeviceId().type == DeviceType::gpu) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fork is something I wasn't sure how to remove. It would be better if it was under the expression operator but moving it down causes the operator interface to be a bit ugly and introduces some code duplication. Feedback on this in particular would be greatly appreciated.
Description
This PR adds a new inference operator to GPU for getting the lemma logits for a factored vocabulary. It demonstrates significant speedup on GPU inference over PR #772
Here are some perf numbers relative to PR #772
Times from a proxy model with 1 stream as measured on a Titan V.
Times from a proxy model with two streams as measured on a Titan V
List of changes:
Added dependencies: cub
How to test
I ran the regression tests and they all passed. I also manually tested on a proxy model and the outputs from master exactly match the outputs after this change was made.
CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on
Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0
Checklist