Enabling Snowflake Arctic on Gaudi 3 #1719

pi314ever · 2025-01-24T23:30:27Z

What does this PR do?

This PR enables snowflake-arctic-instruct on a single node in Gaudi 3. A single Gaudi 2 node with 8 cards does not have enough memory to load the whole model, so only Gaudi 3 was validated. Graph mode is not enabled yet due to memory issues as well.

This depends on synchronizing the Habana DeepSpeed fork to include deepspeedai/DeepSpeed#6856, which can be found in my branch here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19.

Validated configurations:

Gaudi 3 x8 lazy mode
Gaudi3 x8 lazy mode with KV caching

Before submitting

Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

imangohari1 · 2025-01-30T23:05:55Z

@pi314ever
Hi Daniel,
I am trying to get this running on 8xG3 and it keeps crashing on me with an error RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 896 and 7168.
Can you plese:

share the cmds that you've tested this with?
try below cmd and see what you get?
rebase your branch with OH main?

Thanks

Tests

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/igohari/codes/applications.hpc.workloads.pytorch.hpu-models/benchmarking/perfmaker/.cache/huggingface/hub/models--Snowflake--snowflake-arctic-instruct  --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --sdp_on_bf16

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/igohari/codes/applications.hpc.workloads.pytorch.hpu-models/benchmarking/perfmaker/.cache/huggingface/hub/models--Snowflake--snowflake-arctic-instruct  --batch_size 1--use_kv_cache --max_new_tokens 100 --sdp_on_bf16

Both above, as of here, crashes with

[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 2449, in _sample
[rank5]:     outputs = self(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 1525, in forward
[rank5]:     outputs = self.model(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 1179, in forward
[rank5]:     layer_outputs = decoder_layer(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 851, in forward
[rank5]:     hidden_states = self.residual_mlp(hidden_states)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/snowflake/modeling_arctic.py", line 685, in forward
[rank5]:     current_hidden_states = self.w2(current_hidden_states)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank5]:     return inner()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/layers.py", line 142, in forward
[rank5]:     output = torch.matmul(input, self.weight.transpose(-1, -2))
[rank5]: RuntimeError: Common dimension sizes of matmul inputs should be the same. Got 896 and 7168

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

This reverts commit 9c390e7.

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2025-01-31T00:22:40Z

@imangohari1 this depends on a patched version of deepspeed located here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19

Test command:

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path Snowflake/snowflake-arctic-instruct --bf16 --use_kv_cache --max_new_tokens 128 --batch_size 1

Note

Graph mode is currently not enabled yet due to memory issues during graph compilation.

Specifically, the steps to reproducing are:

Install pip dependencies: pip install -r requirements.txt
Install custom DeepSpeed: pip install git+https://github.com/pi314ever/DeepSpeed@arctic-enabling-1.19
Run the test command from above.

Expected performance results on Gaudi 3:

Batch size	Max new tokens	Throughput (tokens/s)
1	256	0.466
1	512	0.545
1	1024	0.484
2	128	1.019
2	256	1.055
2	512	1.006
4	128	2.034
4	256	1.856
8	128	3.575

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

imangohari1 · 2025-01-31T16:56:01Z

@imangohari1 this depends on a patched version of deepspeed located here: https://github.com/pi314ever/DeepSpeed/tree/arctic-enabling-1.19

Test command:
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path Snowflake/snowflake-arctic-instruct --bf16 --use_kv_cache --max_new_tokens 128 --batch_size 1
Note

Graph mode is currently not enabled yet due to memory issues during graph compilation.

Specifically, the steps to reproducing are:
1. Install pip dependencies: `pip install -r requirements.txt`

2. Install custom DeepSpeed: `pip install https://github.com/pi314ever/DeepSpeed@arctic-enabling-1.19`

3. Run the test command from above.
Expected performance results on Gaudi 3:
Batch size Max new tokens Throughput (tokens/s)
1 256 0.466
1 512 0.545
1 1024 0.484
2 128 1.019
2 256 1.055
2 512 1.006
4 128 2.034
4 256 1.856
8 128 3.575

@pi314ever
Thanks for the details.
I wasn't able to install the forked dp with the shared cmd, but I cloned your repo and installed it locally.

Few follow ups:

for bs=1, max_new_token=256, 8x G3 I am seeing a perf of Throughput (including tokenization) = 0.8442290508361003 tokens/second. this is different from what you shared. Any thoughts?
the arg --sdp_on_bf16 leads to OOM. any thoughts?

pi314ever · 2025-01-31T17:42:36Z

I updated the command to install custom DS, forgot a git+.

I have noticed the performance to vary quite a bit, but I'm not entirely sure what the reason for it is. I am suspecting node configuration/firmware version but it is hard to know.

I tested --sdp_on_bf16 with bs1 output len 256 and did not seem to run into OOM issues. I am not too familiar with how this flag affects computation/memory, but I do know using graph mode will cause OOM during graph compilation. Are you using graph mode for that run?

pi314ever · 2025-01-31T19:52:59Z

@imangohari1 Running it again with --sdp_on_bf16 I get throughput of 0.926 tokens / sec. Seems like the performance is varying a lot.

Correction: This was for batch size 2 output 128, not batch size 1 output 256. The result is in the ballpark with my table above.

imangohari1 · 2025-02-03T23:06:34Z

@pi314ever
This PR is fine, but it needs to wait until the DS is rebased.
The fact that this can only run on 8x G3 makes it somewhat limited scope.
The perf variation is concerning as well.

@regisss WDYT?

regisss · 2025-02-04T09:42:59Z

@pi314ever Do you know if this PR will be compatible with the version of DeepSpeed that will be released with Synapse 1.20 ?

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2025-02-05T00:35:23Z

@regisss This PR should be compatible with Synapse 1.20.

imangohari1 · 2025-02-05T02:53:36Z

@regisss This PR should be compatible with Synapse 1.20.

@pi314ever FYI @regisss
I am looking at the current DeepSpeed releae for 1.20 and the changes for this is NOT included in there. We need to make sure the changes are included before merging this.

imangohari1 · 2025-02-05T18:35:13Z

@regisss
@libinta confirmed that the deepspeed-fork won't be rebased in 1.20 release on top of the public repo, so it is likely that the changes needed for this PR wont' be included in 1.20.
I also tried this PR with current internal deepspeed-fork RC for 1.20 and it crashes.

We need to push this PR out for 1.21 or later.
Thanks.

pi314ever requested review from ssarkar2, bhargaveede, vivekgoe and regisss as code owners January 24, 2025 23:30

pi314ever force-pushed the arctic-enabling branch from 3065354 to 02ca4de Compare January 27, 2025 22:01

pi314ever and others added 18 commits January 30, 2025 16:10

Ported arctic instruct code

07d9d08

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Make style and resolve GenerationMixin warnings

c25b59a

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fixed tokenization imports

485010a

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Updated requirements for Arctic Model

22013b3

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Apply fix for ArcticRMSNorm from Llama

209eebe

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Use customized rope

c498361

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Better try imports, unified RoPE implementation

4826e79

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix typo

642999b

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Added mark step after decoder layers

053746a

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Using gaudi mixtral MOE impl

a18264c

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Changed to gaudi repeat_kv and rope impls

6888ee9

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add missing rope scaling to config

ec975db

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix repeat_kv signature

0709774

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Revert "Using gaudi mixtral MOE impl"

6b50696

This reverts commit 9c390e7.

Remove other attention impls

0561e49

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add initial KV cache support

32ab5f7

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add fixed moe from mixtral

b9f36d6

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Integrate KV cache

5d63631

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever force-pushed the arctic-enabling branch from 02ca4de to 5d63631 Compare January 31, 2025 00:10

Updated docs

7254633

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Apply fixes from huggingface#1705

0a0cefb

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

regisss added the synapse1.20 label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling Snowflake Arctic on Gaudi 3 #1719

Enabling Snowflake Arctic on Gaudi 3 #1719

pi314ever commented Jan 24, 2025 •

edited

Loading

imangohari1 commented Jan 30, 2025

pi314ever commented Jan 31, 2025 •

edited

Loading

imangohari1 commented Jan 31, 2025

pi314ever commented Jan 31, 2025

pi314ever commented Jan 31, 2025 •

edited

Loading

imangohari1 commented Feb 3, 2025

regisss commented Feb 4, 2025

pi314ever commented Feb 5, 2025

imangohari1 commented Feb 5, 2025

imangohari1 commented Feb 5, 2025

Enabling Snowflake Arctic on Gaudi 3 #1719

Are you sure you want to change the base?

Enabling Snowflake Arctic on Gaudi 3 #1719

Conversation

pi314ever commented Jan 24, 2025 • edited Loading

What does this PR do?

Before submitting

imangohari1 commented Jan 30, 2025

Tests

pi314ever commented Jan 31, 2025 • edited Loading

imangohari1 commented Jan 31, 2025

pi314ever commented Jan 31, 2025

pi314ever commented Jan 31, 2025 • edited Loading

imangohari1 commented Feb 3, 2025

regisss commented Feb 4, 2025

pi314ever commented Feb 5, 2025

imangohari1 commented Feb 5, 2025

imangohari1 commented Feb 5, 2025

pi314ever commented Jan 24, 2025 •

edited

Loading

pi314ever commented Jan 31, 2025 •

edited

Loading

pi314ever commented Jan 31, 2025 •

edited

Loading