-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test that exports to MLIR a small sharded Llama model #220
Add test that exports to MLIR a small sharded Llama model #220
Conversation
5fc031d
to
e8011ec
Compare
e8011ec
to
7931d44
Compare
7931d44
to
375a436
Compare
Could you include a link to the export failure in a gist? We should be able to know why the export is failing. There is a good chance we are successfully generating IR just failing during the conversion from |
torch.testing.assert_close( | ||
sharded_prefill_result, expected_prefill_result, atol=1e-3, rtol=1e-2 | ||
def make_decode_args(self, model: PagedLlamaModelV1) -> Dict[str, Any]: | ||
seq_lens = torch.randint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be specific in random testing where possible. If we want to check variable sequence length we should pick a set of numbers instead of randomizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are setting the random seed at the test setup stage, so it should be consistent between runs. Isn't the torch random generator going to be stable between releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed them to fixed values.
high=vocabulary_size, | ||
size=[batch_size, 1], | ||
high=self.vocabulary_size, | ||
size=[self.batch_size, 1], | ||
dtype=torch.int32, | ||
) | ||
decode_seq_lens = torch.randint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about above. We would expect the decode to be one more than the sequence length of the encode. We should maintain that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the sequence length I think that is the case, but the cache is being regenerated again. In this test it really does not matter if we get plausible numbers, just that the numbers are close. I can make it behave more like in the "real" world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made them prefill_seq_lens + 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have a sharded attention block test, lets keep both around and renamed the original one to sharded_attention_test.py
. It would be better to increase test coverage rather than replace. We can always delete it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had the name wrong on my previous PR. This test checks the whole model. Not just the attention. I just refactored a lot to not replicate code in the export test.
The test calls just |
@rsuderman is this your PR llvm/torch-mlir#3738 that may fix the type conversion issue?
|
I added a link with full error in the description. |
I tried with the changes from llvm/torch-mlir#3738, but it does not solve the type conversion issue. |
The decode step requires exporting in non-strict torch mode due to pytorch/pytorch#135061 This export required to extend the registration functionality of our custom tensor types by provinding `flatten_with_keys_fn`. This is also required to bump the PyTorch version to >=2.4 for other export tests. The export to MLIR fails with TypeError: Unsupported torch type conversion for !torch.vtensor<[3,1,7],complex<f32>> Needs further debugging.
eb925e5
to
9f09fee
Compare
Add test that exports to MLIR a small sharded Llama model
The decode step requires exporting in non-strict torch mode due to
pytorch/pytorch#135061
This export required to extend the registration functionality of our custom
tensor types by provinding
flatten_with_keys_fn
. This is also requiredto bump the PyTorch version to >=2.4 for other export tests.
The export to MLIR fails with
TypeError: Unsupported torch type conversion for !torch.vtensor<[3,1,7],complex>
Needs further debugging.
Detailed error here.