-
Notifications
You must be signed in to change notification settings - Fork 615
EmbeddingBag op and layer #2352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
Can you add the python compositional impl as https://github.com/tensorflow/addons/blob/master/tensorflow_addons/seq2seq/beam_search_decoder.py#L237 ? |
Here some other examples about custom + python ops alternative impl (now python only on master) #1114 (comment) |
Sure, I'll add that soon while I'm fixing everything else up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Rocketknight1, thanks for the contribution. This is a huge PR. I have a suggestion: I can help you do C++ CPU impl (with threading and vectorization) and python interface. This can reduce back and forth communication during review. If you can accept it, I'll work on your branch and comment out GPU build (I'll modify style and input/output), or I can file another PR for CPU ops. Thank you!
The CUDA/C++ ops expect a weight tensor. When no weights are passed, the Python wrapper instantiates dummy weights with tf.ones_like(). Is this acceptable.
It is acceptable. TensorFlow op registration mini language does not support optional input. The workaround is
- Manipulate some inputs in python (as what you do).
- Or pass input as a list of Tensor. In this way, you can check the length of input to identity if Tensors present or not.
Hi @Rocketknight1 thanks for your PR. We recently switched from Torch to TF and Embedding Bag was missing for us too! In our use case we often work with list of non constant len called Ragged tensors in TF which uses a similar data format as Sparse CSR matrix: https://www.tensorflow.org/guide/ragged_tensor We have ragged tensors such as :
For now we currently replace embedding bag by converting our ragged to a Sparse Tensor (indices being the rows and y the nbr of item in each row) and values being the indices we want to gather. We also use a second sparse Tensor which will have the weights instead of the indices. Another workaround we found is to create a sparse tensor Indicator: the x coordinate will be the rows of your batch, the y the indices and values being the weight: you can then consider your embedding + sum as a sparse_dense_matmul between the embeddings matrix and your sparse indicator. The sparse_dense_matmul itself is very fast the issue is more on creating the sparse indicator. I'm not sure how that option behaves on memory since the internals are handled by TF. I did a few tests here: the performance depends a lot on the sparsity of indices and nbr of items to compute in the ragged case. I'll try to compile your branch to compare the performance between sparse matmul and your EmbeddingBag ! We did a few months ago a benchmark Pytorch vs TF and embeddingbag was slightly better than matmul in some cases. Would be happy to help on benchmarks if you need some in that PR ! |
@tanguycdls Thanks. This seems interesting to explore. |
Hi @tanguycdls, your workaround with sparse multiplications is really interesting! I'm also curious to know how it compares in terms of memory usage to the CUDA EmbeddingBag. Please note that my CPU implementations are not very optimized right now (as @WindQAQ pointed out), but the CUDA should be at least reasonably performant. |
Also yes @WindQAQ, if you want to improve the CPU implementations feel free to make those changes. I'm aware that they are currently 'reference' implementations that are not high-performance. Thank you! |
@tanguycdls It could be interesting to benchmark also against TF-nightly as in the Compiler/Stack some optimizations on sparse are really on the development edge. See https://llvm.discourse.group/t/mlir-support-for-sparse-tensors/ |
/cc @aartbik if he is interested on the sparse code-path. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
Hi @bhack can you approve running workflows? It's awkward for me to run all the tests locally and it's really helpful if I can quickly check via CI! |
I have run into a problem - the code added by @WindQAQ to convert the parameter gradients to an IndexedSlices object does not seem to work in graph mode. I don't think this is a problem with the code - it looks good to me! However, the IndexedSlices tensor is returned as a dense tensor when I wrap _embedding_bag with tf.function(). I googled around and couldn't find a cause for this, but there's a suggestive issue here - it's possible this is because of a different TF bug that got monkey-patched: tensorflow/tensorflow#36236 Either way, I've edited the tests and hopefully it should all work now. |
sorted_unique_indices = tf.sort(unique_indices) | ||
return [ | ||
None, | ||
tf.IndexedSlices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just return value_grads is a better choice, according to https://github.com/foxik/tensorflow/blob/b0e08f68d84476b941e8a821d653ab08129c92ce/tensorflow/python/eager/function.py#L1241.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I made the change!
@bhack I believe this PR should have fixed it! Can you approve running the tests? |
You have a lint error. |
Fixed! I think. |
We did it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add yourself as Codeowner in https://github.com/tensorflow/addons/blob/master/.github/CODEOWNERS ?
@bhack Ah, sorry, just saw that. I've added myself! |
@bhack Can we run tests? I think this is ready to merge now! |
@bhack I made the requested change (adding myself to CODEOWNERS) so I think this is ready to merge! |
@fsx950223 Anything else? |
Description
Brief Description of the PR:
This is a PR for the EmbeddingBag op. Please don't merge it yet! Although it works, testing is incomplete and the file structure needs to be cleaned up. I'm opening it now just to get some initial feedback. I'll keep working on several of these issues (particularly 1, 3, 4 and 6 see below), but I'll need some feedback on 2) and 5), plus any other feedback you have for the rest of it!
Fixes # (issue)
#2201
Type of change
New layer and associated C++/CUDA op
Comments
There are a few issues that need to be resolved before I'd feel comfortable with this being merged. In no particular order, they are:
The CUDA/C++ code is split with the forward and backward passes in separate files, which is not how other Tensorflow or Addons ops do it. This is just a style thing - I'll merge them soon.
There are really two different entrypoints for users here, the function/op (analogous to tf.gather) and the layer (analogous to tf.keras.layers.Embedding). Like Embedding, the layer instantiates its own embeddings tensor and expects to be passed only indices and weights, whereas the function needs to be passed embeddings as well. Following PyTorch's naming conventions, I called the op embeddingbag and the layer EmbeddingBag, but this is almost certainly not what you want. What is the right way to name these two? Should I make the function/op a stateless Layer rather than just a function?
No support for float16/bfloat16 yet.
Because context->AllocateTemp continuously segfaulted for me when I was compiling in the custom-op repo, I used AllocateOutput to make some dummy outputs and then just used them as temp arrays. Compiling in tensorflow_addons itself seems much more stable, but I still need to go back and set that properly to AllocateTemp.
The CUDA/C++ ops expect a weight tensor. When no weights are passed, the Python wrapper instantiates dummy weights with
tf.ones_like()
. Is this acceptable?More tests! I don't have any gradient tests at all yet, and I should probably add additional tests with weird shapes.