-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Large Tensor] Fixed Embedding op #17599
[Large Tensor] Fixed Embedding op #17599
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
@connorgoggins can you paste output of a case where sparseEmbedding params are used ? Are both used in the same example use case presented here ? |
@access2rohit great question. Here is an example of when sparseEmbedding params are used without my fix, and the resulting error (same error as with Embedding):
With my fix, the call above passes without any issues - output below:
|
src/operator/tensor/indexing_op.h
Outdated
@@ -66,7 +66,7 @@ enum QuantizedEmbeddingOpResource {kTempSpace}; | |||
|
|||
|
|||
struct SparseEmbeddingParam: public dmlc::Parameter<SparseEmbeddingParam> { | |||
int input_dim; | |||
index_t input_dim; | |||
int output_dim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a case where output_dim also exceeds 2^32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeforest excellent point, there would be. For example, the following call throws an error:
>>> mx.nd.Embedding(data=mx.nd.random_normal(shape=(1,)), weight=mx.nd.random_normal(shape=(1,2**32)), input_dim=1, output_dim=2**32)
mxnet.base.MXNetError: MXNetError: Invalid Parameter format for output_dim expect int but value='4294967296', in operator Embedding(name="", output_dim="4294967296", input_dim="1")
With my latest update (changing the dtype of output_dim
to index_t
), here is the result:
>>> mx.nd.Embedding(data=mx.nd.random_normal(shape=(1,)), weight=mx.nd.random_normal(shape=(1,2**32)), input_dim=1, output_dim=2**32)
[[ 1.6323917 -0.33354783 -1.7378405 ... -0.36648417 0.6363522
2.367109 ]]
<NDArray 1x4294967296 @cpu(0)>
7214064
to
9a4e9e1
Compare
LGTM. Have you verified that your nightly test can run successfully in your local machine? |
@apeforest thanks! The individual nightly tests I wrote for each op run successfully on my local machine. Running the full nightly test suite for each op now. |
@apeforest update: full suite of nightly tests passed on r5dn.24xl instances for every one of my PRs with additional tests added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the good work
@connorgoggins I retriggered CI tests |
9a4e9e1
to
af9c607
Compare
af9c607
to
97f0523
Compare
* Switched from int to index_t for input_dim * Implemented fix for output_dim * Added nightly test for Embedding * Set const value for output dim * More standardization via const param
Description
The Embedding op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:
the following error was thrown:
To fix this issue, I modified
indexing_op.h
to switch from storing input_dim as anint
to storing it as anindex_t
. After implementing my fix and rebuilding, the previous input command displayed the correct output:Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments
Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with
Results
The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.
Single operator test - Embedding op (GPU)
Single operator test - Embedding op (CPU)
Full OpPerf test (GPU)
Full OpPerf test (CPU)
@apeforest @access2rohit @ChaiBapchya