Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] TensorCore 8bit implementations #1

Open
wants to merge 283 commits into
base: master
Choose a base branch
from

Conversation

XapaJIaMnu
Copy link

@XapaJIaMnu XapaJIaMnu commented Oct 20, 2020

Hey @rhenry-nv ,

Thank you for your recent improvements to marian. Do you have any particular development goal with regards to marian?

I am asking this, since I have been working on 8bit GPU version with and without tensor cores, using CUTLASS and I would like to avoid duplicated efforts where possible. In particular the GPU code for the 8bit GEMM is located here: https://github.com/XapaJIaMnu/marian-dev/blob/8bitgpu/src/tensors/gpu/integer_tools.cu

Do you have an opinion on the use of CUTLASS vs CUBLAS for TensorCore operations? Any particular comments on the GPU code?

Part of this pull request has been pending for review @marian-dev master, and the rest will incrementally go in, hopefully.

Cheers,

Nick

delong-coder and others added 30 commits November 4, 2020 14:29
…arian-nmt#749)

* Add Triton Marian backend running in AzureML Inference Environment
…ith is causes a misaligned read when the bias is not a multiple of 8 (this occurs during slicing). This case is handled specially. This bug will be fixed in a subsequent release of cuda.
* This PR adds training of embedding spaces with better separation based on https://arxiv.org/abs/2007.01852
* We can now train with in-batch negative examples or a handful of hand-constructed negative examples provided in a tsv-file.
…nmt#759)

* fix problem if the optimization step is set to 0
* set the first error residual to 0
- Updates sentencepiece to the newest version (removes dependency on protobuf)
- Enable SentencePiece compilation by default since there is no dependency in protobuf anymore.
…search

This PR changes the stopping criterion for mini-batch-fit binary search to better maximize batch size.
This updates the SentencePiece version in Marian to a much more recent revision. Due to that there is no dependency on Protobuf anymore.
* attempt to enable cutlass tensorcore with fp16: compilation OK

* FP16 support for NodeOp (partial implementation)

* switch to reinterpret_cast

* add cutlass FP16 support from Nick

* add cutlass FP16 support for DotNodeOp

* set quant NodeOp type

* more NodeOp changes for FP16 support

* remove debugging info

* some comments and aborts
This is done so that we can compile with newer cuda versions. It also adds some extra templates that are not used in this branch of marian, but it doesn't impede translation or compilation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants