[Feature] Enable Intel XPU support #839

abhilash1910 · 2023-10-06T08:53:45Z

Thanks for creating this library. In our effort to streamline huggingface on Intel devices, this PR is an important step . With support already enabled in peft/accelerate and transformers this would be a good addition to run TRL out of the box on our devices.
cc @younesbelkada @pacman100

HuggingFaceDocBuilderDev · 2023-10-06T13:00:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

younesbelkada

Thanks a lot for your great efforts and enabling TRL for intel XPU devices!
My question is about backward compatbility with this PR, it seems you have removed the optimize_cuda_cache - that makes the library not BC with users that use that argument. Would it be possible to add a new attribute in PPOConfig optimize_device_cache that is going to be initialized with the same value as optimize_cuda_cache and raise a warning for users telling them that optimize_cuda_cache is going to be deprecated in the future?

abhilash1910 · 2023-10-07T16:25:28Z

Thanks @younesbelkada for the suggestion.Could you help re trigger CI and re-review? Thanks

abhilash1910 · 2023-10-09T11:06:19Z

@younesbelkada could you help review and rerigger CI ? I see the previous CI failure was arising due to mps . I tested the tests locally , seems to pass.

examples/research_projects/stack_llama_2/scripts/sft_llama2.py

younesbelkada

Thanks again! I would like @lvwerra to have a second look if possible

lvwerra

Generally looks good to me. Have you had a chance to run the sentiment script with XPU to make sure everything works? Would you mind sharing the learning curves so we can compare to GPU runs?

trl/trainer/ppo_config.py

abhilash1910 · 2023-10-11T03:51:44Z

Generally looks good to me. Have you had a chance to run the sentiment script with XPU to make sure everything works? Would you mind sharing the learning curves so we can compare to GPU runs?

Yes we are running on all example scripts on our devices. I can share the loss curves for different tasks here ? Some of the perf is internal but we can share the functionality aspects .
@lvwerra @younesbelkada could you help trigger CI ? thanks

lvwerra · 2023-10-11T14:29:06Z

Could you for example run python examples/scripts/sentiment_tuning.py and post reward/kl curves? That's what we usually use for regression checks.

* Standardise example scripts * fix plotting script * Rename run_xxx to xxx * Fix doc --------- Co-authored-by: Costa Huang <costa.huang@outlook.com>

abhilash1910 · 2023-10-12T07:58:37Z

@lvwerra @younesbelkada I tried the dpo.py for sanity check on our GPUs for training , the wandb log: https://wandb.ai/abhilash-majumder/huggingface/runs/chj3jbgf?workspace=user-abhilash-majumder
I am in process of running other experiements and notebooks as well , but for sanity and initial functionality seems ok on our devices. Thanks

younesbelkada · 2023-10-12T12:50:48Z

THanks very much for running the experiments @abhilash1910 ! Looking at the logs it seems the wandb is private :/ can you make the logs public so that we can have a look at them ? 🙏 Thanks!

abhilash1910 · 2023-10-12T13:01:08Z

THanks very much for running the experiments @abhilash1910 ! Looking at the logs it seems the wandb is private :/ can you make the logs public so that we can have a look at them ? 🙏 Thanks!

Oh apologies, it is public now, could you let me know if you are able to check? Thanks @younesbelkada

Refactor

abhilash1910 · 2023-10-12T13:55:05Z

@younesbelkada could you please help re-trigger CI? thanks.

abhilash1910 · 2023-10-13T11:37:21Z

@younesbelkada @lvwerra if the changes make sense and internal slow tests pass, can this be merged ? Also is there a slack channel where I can further follow up regarding the benchmarking perf stats from our devices (if required) ? Thanks

lvwerra · 2023-10-13T14:57:02Z

Hi @abhilash1910, the PR is in a good shape to be merged IMO. If you could run python examples/scripts/sentiment_tuning.py that would be great since we have good comparisons for it and it should only take ~1h or so to see that it works. DPO is a bit newer, and we know less how regressions would look like.

abhilash1910 · 2023-10-17T03:36:06Z

Hi @lvwerra, there seems to be no "sentiment_tuning.py" script but I think there is "ppo.py" script which I tried out. Just for testing purposes: https://wandb.ai/abhilash-majumder/trl/runs/k0kletcb/overview?workspace=user-abhilash-majumder (let me know if you can access). It contains the stdout(prints) as I was making sure all results were correct. I am in process of taking a longer run. Thanks!

lvwerra · 2023-10-17T12:54:41Z

Oh yes, that's it - we recently renamed it. Looks like your W&B logs are private though :)

abhilash1910 · 2023-10-17T15:39:51Z

Oh I thought it was public, it is now. Could you also re-trigger the CI ? Thanks

abhilash1910 · 2023-10-18T07:22:10Z

@lvwerra @younesbelkada could you help re-trigger CI and re-review? If any further modules need to be supported from our side, it will be in a separate PR . Thanks

lvwerra · 2023-10-25T08:23:41Z

Could you merge main into your branch? We deactivated the flaky tests so now it should run through.

lvwerra · 2023-10-25T13:33:11Z

Looks good, if you can share the longer run, I think we can merge :)

abhilash1910 · 2023-10-26T07:17:12Z

@lvwerra completed the ppo script with 194 iterations; I am sharing the logs here:
ppo_train.log
I did not use the wandb for this instance , but wanted a full run to check the timeline.
Please let me know if this can be added in this version release . Thanks for the support.

lvwerra · 2023-10-26T12:47:14Z

Hi @abhilash1910, did you still save the training logs? All we want to check is that PPO also converges at the same rate as on the GPU. E.g. the mean reward vs. iteration plot would be helpful. Thank you!

abhilash1910 · 2023-10-26T13:22:03Z

@lvwerra Unfortunately I did not save the entire run log for wandb (for the previous run), I retook a snap again for 6 iters for the ppo mean & the returns:

It seems like it is converging .policy kl curve is something I am looking at :

which seems to be going as it should. Let me know if this seems ok ? Thanks

lvwerra

Awesome thanks!

abhilash1910 · 2023-10-31T12:06:35Z

Thanks @lvwerra , @younesbelkada for your continued support.

* enable xpu support * fix bug * review commits * fix style * add xou decorator * refactor review commit * fix test * review commit * fix test * Update benchmark.yml (huggingface#856) * Standardise example scripts (huggingface#842) * Standardise example scripts * fix plotting script * Rename run_xxx to xxx * Fix doc --------- Co-authored-by: Costa Huang <costa.huang@outlook.com> * Fix version check in import_utils.py (huggingface#853) * dont use get_peft_model if model is already peft (huggingface#857) * merge conflict * add xou decorator * resolve * resolves * upstream * refactor and precommit * fix new tests * add device mapping for xpu --------- Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Costa Huang <costa.huang@outlook.com> Co-authored-by: Adam Pauls <adpauls@gmail.com> Co-authored-by: abhishek thakur <1183441+abhishekkrthakur@users.noreply.github.com>

abhilash1910 and others added 3 commits October 6, 2023 01:48

enable xpu support

f505e00

fix bug

227f487

Merge branch 'huggingface:main' into sycl

8c644fc

younesbelkada reviewed Oct 6, 2023

View reviewed changes

review commits

ef44042

abhilash1910 and others added 3 commits October 8, 2023 23:57

fix style

ae42341

Merge branch 'huggingface:main' into sycl

b80112d

add xou decorator

973bfe4

younesbelkada reviewed Oct 9, 2023

View reviewed changes

examples/research_projects/stack_llama_2/scripts/sft_llama2.py Outdated Show resolved Hide resolved

abhilash1910 and others added 3 commits October 9, 2023 11:42

refactor review commit

e72d3c2

Merge branch 'huggingface:main' into sycl

295f87d

fix test

99f82eb

younesbelkada approved these changes Oct 10, 2023

View reviewed changes

younesbelkada requested a review from lvwerra October 10, 2023 08:09

lvwerra reviewed Oct 10, 2023

View reviewed changes

trl/trainer/ppo_config.py Show resolved Hide resolved

review commit

1752ae4

abhilash1910 and others added 9 commits October 11, 2023 20:22

fix test

008c2a6

Update benchmark.yml (huggingface#856)

d6f399b

Standardise example scripts (huggingface#842)

e72cc90

* Standardise example scripts * fix plotting script * Rename run_xxx to xxx * Fix doc --------- Co-authored-by: Costa Huang <costa.huang@outlook.com>

Fix version check in import_utils.py (huggingface#853)

3fce988

dont use get_peft_model if model is already peft (huggingface#857)

9f4d177

merge conflict

06dfd8e

add xou decorator

5e923b7

resolve

308d7bf

resolves

2c8343a

refactor and precommit

2c3f999

abhilash1910 force-pushed the sycl branch from 5ae9182 to 5805670 Compare October 12, 2023 13:19

abhilash1910 and others added 3 commits October 12, 2023 18:51

Merge pull request #3 from abhilash1910/main

b0c8b52

Refactor

Merge branch 'main' into sycl

a579504

fix new tests

fdc0896

Merge branch 'huggingface:main' into sycl

2200089

abhilash1910 and others added 3 commits October 17, 2023 10:15

Merge branch 'huggingface:main' into sycl

98956ac

Merge branch 'huggingface:main' into sycl

9a962fc

add device mapping for xpu

d6e9e4f

Merge branch 'huggingface:main' into sycl

aac92f8

Merge branch 'huggingface:main' into sycl

8b77310

lvwerra approved these changes Oct 31, 2023

View reviewed changes

lvwerra merged commit ec9e766 into huggingface:main Oct 31, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable Intel XPU support #839

[Feature] Enable Intel XPU support #839

abhilash1910 commented Oct 6, 2023

HuggingFaceDocBuilderDev commented Oct 6, 2023

younesbelkada left a comment

abhilash1910 commented Oct 7, 2023

abhilash1910 commented Oct 9, 2023

younesbelkada left a comment

lvwerra left a comment

abhilash1910 commented Oct 11, 2023 •

edited

Loading

lvwerra commented Oct 11, 2023

abhilash1910 commented Oct 12, 2023

younesbelkada commented Oct 12, 2023

abhilash1910 commented Oct 12, 2023

abhilash1910 commented Oct 12, 2023

abhilash1910 commented Oct 13, 2023

lvwerra commented Oct 13, 2023

abhilash1910 commented Oct 17, 2023

lvwerra commented Oct 17, 2023

abhilash1910 commented Oct 17, 2023

abhilash1910 commented Oct 18, 2023

lvwerra commented Oct 25, 2023

lvwerra commented Oct 25, 2023

abhilash1910 commented Oct 26, 2023 •

edited

Loading

lvwerra commented Oct 26, 2023

abhilash1910 commented Oct 26, 2023 •

edited

Loading

lvwerra left a comment

abhilash1910 commented Oct 31, 2023

[Feature] Enable Intel XPU support #839

[Feature] Enable Intel XPU support #839

Conversation

abhilash1910 commented Oct 6, 2023

HuggingFaceDocBuilderDev commented Oct 6, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

abhilash1910 commented Oct 7, 2023

abhilash1910 commented Oct 9, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

abhilash1910 commented Oct 11, 2023 • edited Loading

lvwerra commented Oct 11, 2023

abhilash1910 commented Oct 12, 2023

younesbelkada commented Oct 12, 2023

abhilash1910 commented Oct 12, 2023

abhilash1910 commented Oct 12, 2023

abhilash1910 commented Oct 13, 2023

lvwerra commented Oct 13, 2023

abhilash1910 commented Oct 17, 2023

lvwerra commented Oct 17, 2023

abhilash1910 commented Oct 17, 2023

abhilash1910 commented Oct 18, 2023

lvwerra commented Oct 25, 2023

lvwerra commented Oct 25, 2023

abhilash1910 commented Oct 26, 2023 • edited Loading

lvwerra commented Oct 26, 2023

abhilash1910 commented Oct 26, 2023 • edited Loading

lvwerra left a comment

Choose a reason for hiding this comment

abhilash1910 commented Oct 31, 2023

abhilash1910 commented Oct 11, 2023 •

edited

Loading

abhilash1910 commented Oct 26, 2023 •

edited

Loading

abhilash1910 commented Oct 26, 2023 •

edited

Loading