Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tf.shape for graph mode support #6

Merged
merged 2 commits into from
Feb 23, 2023

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Feb 14, 2023

Why

With custom layers/models in graph mode, it seems to have trouble inferring shapes because .shape is used in call() instead of tf.shape(), and I'm running into issues such as:

[1,0]<stdout>:E                               TypeError: Exception encountered when calling layer "distributed_embedding" "                 f"(type DistributedEmbedding).
[1,0]<stdout>:E
[1,0]<stdout>:E                               in user code:
[1,0]<stdout>:E
[1,0]<stdout>:E                                   File "/usr/local/lib/python3.8/dist-packages/distributed_embeddings/python/layers/dist_model_parallel.py", line 503, in call  *
[1,0]<stdout>:E                                       outputs = self._call_base(inputs)
[1,0]<stdout>:E                                   File "/usr/local/lib/python3.8/dist-packages/distributed_embeddings/python/layers/dist_model_parallel.py", line 279, in _call_base  *
[1,0]<stdout>:E                                       global_splits.append(sum(local_splits[-1]))
[1,0]<stdout>:E
[1,0]<stdout>:E                                   TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
[1,0]<stdout>:E
[1,0]<stdout>:E
[1,0]<stdout>:E                               Call arguments received by layer "distributed_embedding" "                 f"(type DistributedEmbedding):
[1,0]<stdout>:E                                 • inputs=['tf.Tensor(shape=(None, None), dtype=int64)', 'tf.Tensor(shape=(None, None), dtype=int64)']

We've run into this error while trying to integrate distributed-embeddings into Merlin Models (NVIDIA-Merlin/models#974).

From Tensorflow docs:

tf.shape and Tensor.shape should be identical in eager mode. Within tf.function or within a compat.v1 context, not all dimensions may be known until execution time. Hence, when defining custom layers and models for graph mode, prefer the dynamic tf.shape(x) over the static x.shape.

This PR proposes to replace .shape with tf.shape(). According to the doc quoted above, it seems that tf.shape() has advantages because it should behave the same as .shape in eager mode and will also work correctly in graph mode. With the changes in this PR, I've managed to make the custom model work without errors.

Testing

Manually ran

python distributed_embeddings/python/layers/dist_model_parallel_test.py
python tests/dist_model_parallel_test.py

This PR includes a variable name change that makes the tests in python tests/dist_model_parallel_test.py pass.

@edknv
Copy link
Contributor Author

edknv commented Feb 14, 2023

@FDecaYed @skyw It looks like I don't have access to assign reviewers. Can you please review this?

@FDecaYed
Copy link
Collaborator

@edknv Thanks for the contribution. We've seen this before and I agree this need to be fixed.
There is one issue:
from our testing, use of tf.shape() is not always strictly better than tensor.shape, at least in current tf version. It could create some blocking copy between host/device.
one way to deal with the change is something like

# to get shape[0]
dim0 = tf.shape(tensor)[0] if tensor.shape[0] is None else tensor.shape[0]
# to get shape
tshape = tf.shape(tensor) if None in tensor.shape else tensor.shape

@FDecaYed FDecaYed merged commit 6c401c3 into NVIDIA-Merlin:main Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants