[Large Tensor] Fixed Embedding op #17599

connorgoggins · 2020-02-15T00:57:37Z

Description

The Embedding op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

nd.Embedding(data=nd.random_normal(shape=(2**32,1)), weight=nd.random_normal(shape=(2**32,1)), input_dim=2**32, output_dim=1)

the following error was thrown:

mxnet.base.MXNetError: MXNetError: Invalid Parameter format for input_dim expect int but value='4294967296', in operator Embedding(name="", output_dim="1", input_dim="4294967296")

To fix this issue, I modified indexing_op.h to switch from storing input_dim as an int to storing it as an index_t. After implementing my fix and rebuilding, the previous input command displayed the correct output:

[[[-0.5190417]]

 [[-1.4388928]]

 [[ 1.1367434]]

 ...

 [[-1.4388928]]

 [[-1.4388928]]

 [[-0.5190417]]]
<NDArray 4294967296x1x1 @cpu(0)>

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/tensor/indexing_op.h

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - Embedding op (GPU)
Single operator test - Embedding op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

connorgoggins · 2020-02-15T00:58:29Z

@mxnet-label-bot add [pr-awaiting-review]

access2rohit · 2020-02-15T01:37:09Z

@connorgoggins can you paste output of a case where sparseEmbedding params are used ? Are both used in the same example use case presented here ?

connorgoggins · 2020-02-15T02:15:56Z

@access2rohit great question. Here is an example of when sparseEmbedding params are used without my fix, and the resulting error (same error as with Embedding):

>>> mx.contrib.nd.SparseEmbedding(data=mx.nd.random_normal(shape=(2**32,1)), weight=mx.nd.random_normal(shape=(2**32, 1)), input_dim=2**32, output_dim=1)

mxnet.base.MXNetError: MXNetError: Invalid Parameter format for input_dim expect int but value='4294967296', in operator _contrib_SparseEmbedding(name="", output_dim="1", input_dim="4294967296")

With my fix, the call above passes without any issues - output below:

[[[-0.5190417]]

 [[-1.4388928]]

 [[ 1.1367434]]

 ...

 [[-1.4388928]]

 [[-1.4388928]]

 [[-0.5190417]]]
<NDArray 4294967296x1x1 @cpu(0)>

apeforest · 2020-02-15T06:37:01Z

src/operator/tensor/indexing_op.h

@@ -66,7 +66,7 @@ enum QuantizedEmbeddingOpResource {kTempSpace};


 struct SparseEmbeddingParam: public dmlc::Parameter<SparseEmbeddingParam> {
- int input_dim;
+ index_t input_dim;
 int output_dim;


Would there be a case where output_dim also exceeds 2^32?

@apeforest excellent point, there would be. For example, the following call throws an error:

>>> mx.nd.Embedding(data=mx.nd.random_normal(shape=(1,)), weight=mx.nd.random_normal(shape=(1,2**32)), input_dim=1, output_dim=2**32) mxnet.base.MXNetError: MXNetError: Invalid Parameter format for output_dim expect int but value='4294967296', in operator Embedding(name="", output_dim="4294967296", input_dim="1")

With my latest update (changing the dtype of output_dim to index_t), here is the result:

>>> mx.nd.Embedding(data=mx.nd.random_normal(shape=(1,)), weight=mx.nd.random_normal(shape=(1,2**32)), input_dim=1, output_dim=2**32) [[ 1.6323917 -0.33354783 -1.7378405 ... -0.36648417 0.6363522 2.367109 ]] <NDArray 1x4294967296 @cpu(0)>

apeforest · 2020-02-20T22:06:28Z

LGTM. Have you verified that your nightly test can run successfully in your local machine?

connorgoggins · 2020-02-20T22:36:26Z

@apeforest thanks! The individual nightly tests I wrote for each op run successfully on my local machine. Running the full nightly test suite for each op now.

connorgoggins · 2020-02-21T01:36:22Z

@apeforest update: full suite of nightly tests passed on r5dn.24xl instances for every one of my PRs with additional tests added.

apeforest

LGTM. Thanks for the good work

apeforest · 2020-02-22T00:38:59Z

@connorgoggins I retriggered CI tests

* Switched from int to index_t for input_dim * Implemented fix for output_dim * Added nightly test for Embedding * Set const value for output dim * More standardization via const param

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 15, 2020

apeforest reviewed Feb 15, 2020

View reviewed changes

connorgoggins force-pushed the fix_embedding_large_tensor branch from 7214064 to 9a4e9e1 Compare February 20, 2020 21:48

connorgoggins changed the title ~~Fixed Embedding op for LT input~~ [Large Tensor] Fixed Embedding op Feb 20, 2020

apeforest approved these changes Feb 22, 2020

View reviewed changes

connorgoggins force-pushed the fix_embedding_large_tensor branch from 9a4e9e1 to af9c607 Compare February 24, 2020 17:50

connorgoggins added 4 commits February 24, 2020 14:33

Switched from int to index_t for input_dim

6f5b962

Implemented fix for output_dim

62748b7

Added nightly test for Embedding

0d0400e

Set const value for output dim

97f0523

connorgoggins force-pushed the fix_embedding_large_tensor branch from af9c607 to 97f0523 Compare February 24, 2020 22:37

More standardization via const param

96b4973

apeforest merged commit d51753b into apache:master Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Large Tensor] Fixed Embedding op #17599

[Large Tensor] Fixed Embedding op #17599

connorgoggins commented Feb 15, 2020 •

edited

Loading

connorgoggins commented Feb 15, 2020

access2rohit commented Feb 15, 2020 •

edited

Loading

connorgoggins commented Feb 15, 2020

apeforest Feb 15, 2020

connorgoggins Feb 17, 2020

apeforest commented Feb 20, 2020

connorgoggins commented Feb 20, 2020

connorgoggins commented Feb 21, 2020 •

edited

Loading

apeforest left a comment

apeforest commented Feb 22, 2020

[Large Tensor] Fixed Embedding op #17599

[Large Tensor] Fixed Embedding op #17599

Conversation

connorgoggins commented Feb 15, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Results

connorgoggins commented Feb 15, 2020

access2rohit commented Feb 15, 2020 • edited Loading

connorgoggins commented Feb 15, 2020

apeforest Feb 15, 2020

Choose a reason for hiding this comment

connorgoggins Feb 17, 2020

Choose a reason for hiding this comment

apeforest commented Feb 20, 2020

connorgoggins commented Feb 20, 2020

connorgoggins commented Feb 21, 2020 • edited Loading

apeforest left a comment

Choose a reason for hiding this comment

apeforest commented Feb 22, 2020

connorgoggins commented Feb 15, 2020 •

edited

Loading

access2rohit commented Feb 15, 2020 •

edited

Loading

connorgoggins commented Feb 21, 2020 •

edited

Loading