Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling Snowflake Arctic on Gaudi 3 #1719

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

pi314ever
Copy link
Contributor

@pi314ever pi314ever commented Jan 24, 2025

What does this PR do?

This PR enables snowflake-arctic-instruct on a single node in Gaudi 3. A single Gaudi 2 node with 8 cards does not have enough memory to load the whole model, so only Gaudi 3 was validated. Graph mode is not enabled yet due to memory issues as well.

This depends on synchronizing the Habana DeepSpeed fork to include deepspeedai/DeepSpeed#6856, which can be found in my branch here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19.

Validated configurations:

  • Gaudi 3 x8 lazy mode
  • Gaudi3 x8 lazy mode with KV caching

Before submitting

  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@imangohari1
Copy link
Contributor

@pi314ever
Hi Daniel,
I am trying to get this running on 8xG3 and it keeps crashing on me with an error RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 896 and 7168.
Can you plese:

  • share the cmds that you've tested this with?
  • try below cmd and see what you get?
  • rebase your branch with OH main?

Thanks

Tests

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/igohari/codes/applications.hpc.workloads.pytorch.hpu-models/benchmarking/perfmaker/.cache/huggingface/hub/models--Snowflake--snowflake-arctic-instruct  --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --sdp_on_bf16
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/igohari/codes/applications.hpc.workloads.pytorch.hpu-models/benchmarking/perfmaker/.cache/huggingface/hub/models--Snowflake--snowflake-arctic-instruct  --batch_size 1--use_kv_cache --max_new_tokens 100 --sdp_on_bf16

Both above, as of here, crashes with

[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 2449, in _sample
[rank5]:     outputs = self(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 1525, in forward
[rank5]:     outputs = self.model(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 1179, in forward
[rank5]:     layer_outputs = decoder_layer(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 851, in forward
[rank5]:     hidden_states = self.residual_mlp(hidden_states)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 685, in forward
[rank5]:     current_hidden_states = self.w2(current_hidden_states)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/layers.py", line 142, in forward
[rank5]:     output = torch.matmul(input, self.weight.transpose(-1, -2))
[rank5]: RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 896 and 7168

pi314ever and others added 18 commits January 30, 2025 16:10
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Contributor Author

pi314ever commented Jan 31, 2025

@imangohari1 this depends on a patched version of deepspeed located here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19

Test command:

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path Snowflake/snowflake-arctic-instruct --bf16 --use_kv_cache --max_new_tokens 128 --batch_size 1

Note

Graph mode is currently not enabled yet due to memory issues during graph compilation.

Specifically, the steps to reproducing are:

  1. Install pip dependencies: pip install -r requirements.txt
  2. Install custom DeepSpeed: pip install git+https://github.com/pi314ever/DeepSpeed@arctic-enabling-1.19
  3. Run the test command from above.

Expected performance results on Gaudi 3:

Batch size Max new tokens Throughput (tokens/s)
1 256 0.466
1 512 0.545
1 1024 0.484
2 128 1.019
2 256 1.055
2 512 1.006
4 128 2.034
4 256 1.856
8 128 3.575

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@imangohari1
Copy link
Contributor

@imangohari1 this depends on a patched version of deepspeed located here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19

Test command:

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path Snowflake/snowflake-arctic-instruct --bf16 --use_kv_cache --max_new_tokens 128 --batch_size 1

Note

Graph mode is currently not enabled yet due to memory issues during graph compilation.

Specifically, the steps to reproducing are:

1. Install pip dependencies: `pip install -r requirements.txt`

2. Install custom DeepSpeed: `pip install https://github.com/pi314ever/DeepSpeed@arctic-enabling-1.19`

3. Run the test command from above.

Expected performance results on Gaudi 3:
Batch size Max new tokens Throughput (tokens/s)
1 256 0.466
1 512 0.545
1 1024 0.484
2 128 1.019
2 256 1.055
2 512 1.006
4 128 2.034
4 256 1.856
8 128 3.575

@pi314ever
Thanks for the details.
I wasn't able to install the forked dp with the shared cmd, but I cloned your repo and installed it locally.

Few follow ups:

  • for bs=1, max_new_token=256, 8x G3 I am seeing a perf of Throughput (including tokenization) = 0.8442290508361003 tokens/second. this is different from what you shared. Any thoughts?
  • the arg --sdp_on_bf16 leads to OOM. any thoughts?

@pi314ever
Copy link
Contributor Author

I updated the command to install custom DS, forgot a git+.

I have noticed the performance to vary quite a bit, but I'm not entirely sure what the reason for it is. I am suspecting node configuration/firmware version but it is hard to know.

I tested --sdp_on_bf16 with bs1 output len 256 and did not seem to run into OOM issues. I am not too familiar with how this flag affects computation/memory, but I do know using graph mode will cause OOM during graph compilation. Are you using graph mode for that run?

@pi314ever
Copy link
Contributor Author

pi314ever commented Jan 31, 2025

@imangohari1 Running it again with --sdp_on_bf16 I get throughput of 0.926 tokens / sec. Seems like the performance is varying a lot.

Correction: This was for batch size 2 output 128, not batch size 1 output 256. The result is in the ballpark with my table above.

@imangohari1
Copy link
Contributor

@pi314ever
This PR is fine, but it needs to wait until the DS is rebased.
The fact that this can only run on 8x G3 makes it somewhat limited scope.
The perf variation is concerning as well.

@regisss WDYT?

@regisss
Copy link
Collaborator

regisss commented Feb 4, 2025

@pi314ever Do you know if this PR will be compatible with the version of DeepSpeed that will be released with Synapse 1.20 ?

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Contributor Author

@regisss This PR should be compatible with Synapse 1.20.

@imangohari1
Copy link
Contributor

@regisss This PR should be compatible with Synapse 1.20.

@pi314ever FYI @regisss
I am looking at the current DeepSpeed releae for 1.20 and the changes for this is NOT included in there. We need to make sure the changes are included before merging this.

@imangohari1
Copy link
Contributor

@regisss
@libinta confirmed that the deepspeed-fork won't be rebased in 1.20 release on top of the public repo, so it is likely that the changes needed for this PR wont' be included in 1.20.
I also tried this PR with current internal deepspeed-fork RC for 1.20 and it crashes.

We need to push this PR out for 1.21 or later.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants