forked from marian-nmt/marian-dev
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] TensorCore 8bit implementations #1
Open
XapaJIaMnu
wants to merge
283
commits into
rhenry-nv:master
Choose a base branch
from
XapaJIaMnu:8bitgpu
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…arian-nmt#749) * Add Triton Marian backend running in AzureML Inference Environment
…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.
…l. Fixes buf in AddFactorMaxes
* This PR adds training of embedding spaces with better separation based on https://arxiv.org/abs/2007.01852 * We can now train with in-batch negative examples or a handful of hand-constructed negative examples provided in a tsv-file.
…nmt#759) * fix problem if the optimization step is set to 0 * set the first error residual to 0
- Updates sentencepiece to the newest version (removes dependency on protobuf) - Enable SentencePiece compilation by default since there is no dependency in protobuf anymore.
…search This PR changes the stopping criterion for mini-batch-fit binary search to better maximize batch size.
This updates the SentencePiece version in Marian to a much more recent revision. Due to that there is no dependency on Protobuf anymore.
* attempt to enable cutlass tensorcore with fp16: compilation OK * FP16 support for NodeOp (partial implementation) * switch to reinterpret_cast * add cutlass FP16 support from Nick * add cutlass FP16 support for DotNodeOp * set quant NodeOp type * more NodeOp changes for FP16 support * remove debugging info * some comments and aborts
This is done so that we can compile with newer cuda versions. It also adds some extra templates that are not used in this branch of marian, but it doesn't impede translation or compilation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hey @rhenry-nv ,
Thank you for your recent improvements to marian. Do you have any particular development goal with regards to marian?
I am asking this, since I have been working on 8bit GPU version with and without tensor cores, using CUTLASS and I would like to avoid duplicated efforts where possible. In particular the GPU code for the 8bit GEMM is located here: https://github.com/XapaJIaMnu/marian-dev/blob/8bitgpu/src/tensors/gpu/integer_tools.cu
Do you have an opinion on the use of CUTLASS vs CUBLAS for TensorCore operations? Any particular comments on the GPU code?
Part of this pull request has been pending for review @marian-dev master, and the rest will incrementally go in, hopefully.
Cheers,
Nick