Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] TTNN typecast operation fails when the tensor is on host #16279

Open
sdjordjevicTT opened this issue Dec 23, 2024 · 8 comments
Open
Assignees
Labels
bug Something isn't working op_cat: copy

Comments

@sdjordjevicTT
Copy link
Contributor

Describe the bug
TTNN typecast operation fails with the following message when the tensor is present on the host:

2024-12-23 14:17:39,026 - ERROR - ERROR: test=/__w/tt-mlir/tt-mlir/build/test/ttmlir/Silicon/TTNN/embedding/Output/simple_embedding.mlir.tmp.ttnn experienced an error with exception=TT_THROW @ /__w/tt-mlir/tt-mlir/third_party/tt-metal/src/tt-metal/ttnn/cpp/ttnn/tensor/tensor.hpp:329: tt::exception
info:
Cannot get the device from a tensor with host storage

To Reproduce
Steps to reproduce the behavior:

  1. Run following simple TTNN test:
import ttnn
import torch

device = ttnn.open_device(device_id=0)

torch_tensor = torch.rand(32, 32, dtype=torch.float32)
ttnn_tensor_cpu = ttnn.from_torch(torch_tensor, layout=ttnn.ROW_MAJOR_LAYOUT)

ttnn_tensor = ttnn.typecast(ttnn_tensor_cpu, dtype=ttnn.uint32)

ttnn.close_device(device)

The test should fail with the following exception:

Always | FATAL    | Cannot get the device from a tensor with host storage
Traceback (most recent call last):
  File "/localdev/sdjordjevic/src/tt-metal/python_test.py", line 9, in <module>
    ttnn_tensor = ttnn.typecast(ttnn_tensor_cpu, dtype=ttnn.uint32)
  File "/localdev/sdjordjevic/src/tt-metal/ttnn/ttnn/decorators.py", line 329, in __call__
    return self.function(*function_args, **function_kwargs)
RuntimeError: TT_THROW @ /localdev/sdjordjevic/src/tt-metal/ttnn/cpp/ttnn/tensor/tensor.hpp:329: tt::exception
info:
Cannot get the device from a tensor with host storage
backtrace:

Expected behavior
Expected behavior should not be the failure, the typecast op should under the hood do the conversion on the host.

Screenshots
/

Please complete the following environment information:

  • OS: both Ubuntu 20 and Ubuntu 22
  • Version of software: latest main

Additional context
Instead of using the typecast op, we can use to_dtype to convert on the host, but I believe that the semantics of typecast should cover this case as well.

@sdjordjevicTT
Copy link
Contributor Author

Hi @sjameelTT, we are currently facing this issue with our MLIR-based compiler. @jnie-TT mentioned that you spoke with him and have plans to improve the typecast operation to support casting on the host. I've created this issue so we can track our progress on this matter.

@sjameelTT
Copy link
Contributor

Any reason why we can't do this on device?

@jnie-TT
Copy link
Contributor

jnie-TT commented Jan 2, 2025

@sjameelTT because typecast on device only works with tilized data, but then tilize on device works with only specific data types, so sometimes we need to fallback to host. This issue is related: #16270

@sjameelTT
Copy link
Contributor

I see, I remember now. We definitely want to support typecast on row major and tilize for all data types before adding typecast on host.

@sdjordjevicTT
Copy link
Contributor Author

Great, thanks for letting us know. Let's keep this issue opened to track the progress.

@sdjordjevicTT
Copy link
Contributor Author

Hi, @sjameelTT. Are there any updates regarding Typecast op enhancements?

@ntarafdar
Copy link
Contributor

hey @sdjordjevicTT sorry I think this was improperly assigned. @ayerofieiev-tt and @TT-BrianLiu are in charge of tensor creation and storage (if I'm not mistaken). Can you guys triage this?

@sdjordjevicTT
Copy link
Contributor Author

Thanks @ntarafdar for including @ayerofieiev-tt and @TT-BrianLiu into the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working op_cat: copy
Projects
None yet
Development

No branches or pull requests

6 participants