Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Llama3 8b/70b #256

Merged
merged 2 commits into from
Apr 20, 2024
Merged

Support Llama3 8b/70b #256

merged 2 commits into from
Apr 20, 2024

Conversation

wanchaol
Copy link
Contributor

This PR adds support for Llama3 8b/70b, mainly it:

  • add tiktonizer, add instructions to download tokenizer
  • add options for the llama model to support Llama3
  • add Llama3 8b/70b configs

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 19, 2024
@wanchaol wanchaol force-pushed the llama3 branch 3 times, most recently from 7404baf to 1d037e3 Compare April 19, 2024 04:12
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks for adding this!
left a couple minor grammar nits and one nit on naming, but overall looks fantastic.

@gnadathur
Copy link
Contributor

gnadathur commented Apr 19, 2024

Thank you for adding llama3 support so fast.
nit: to keep testing current to llama3, it would be good to move debug_model.toml to use llama3.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Can't wait to try running them. Had some inline comments.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
torchtitan/datasets/tokenizer/base_tokenizer.py Outdated Show resolved Hide resolved
torchtitan/datasets/tokenizer/tiktokenizer.py Outdated Show resolved Hide resolved
torchtitan/models/__init__.py Outdated Show resolved Hide resolved
train_configs/llama3_8b.toml Outdated Show resolved Hide resolved
train_configs/llama3_8b.toml Outdated Show resolved Hide resolved
train_configs/llama3_8b.toml Outdated Show resolved Hide resolved
[model]
name = "llama3"
flavor = "8B"
tokenizer_path = "./torchtitan/datasets/tokenizer/original/tokenizer.model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting llama2 tokenizer.model under torchtitan/datasets/tokenizer/ and llama3 tokenizer.model under /torchtitan/datasets/tokenizer/original/ is confusing, especially because they share the same file name. Can we organize them better?

README.md Outdated Show resolved Hide resolved
requirements.txt Outdated
datasets
tomli >= 1.1.0 ; python_version < "3.11"
tensorboard
sentencepiece
tiktoken==0.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asked for pip install blobfile when tiktoken is first time running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still the case for tiktokenizer version > 0.5.2? Or we have to install the blob file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question... I haven't tried

on mast it requires an additional pkg chardet, which was not asked on local devgpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I guess I'll try to remove the version pin for now

This PR adds support for Llama3 8b/70b, mainly it:
- add tiktonizer, add instructions to download tokenizer
- add options for the llama model to support Llama3
- add Llama3 8b/70b configs

Have to remove integration test first, will add it back later once we
figure out the details
@wanchaol
Copy link
Contributor Author

wanchaol commented Apr 20, 2024

Thank you for adding llama3 support so fast. nit: to keep testing current to llama3, it would be good to move debug_model.toml to use llama3.

@gnadathur I had to make debug_model.toml not run in the integration tests because of tokenier.model in order to support llama3, I'll try to add it back to the test next week once we have clarify how to fetch it dynamically

@wanchaol wanchaol merged commit 35470ca into main Apr 20, 2024
4 checks passed
@wanchaol wanchaol deleted the llama3 branch April 20, 2024 05:26
tianyu-l pushed a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
This PR adds support for Llama3 8b/70b, mainly it:
- add tiktonizer, add instructions to download tokenizer
- add options for the llama model to support Llama3
- add Llama3 8b/70b configs
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
This PR adds support for Llama3 8b/70b, mainly it:
- add tiktonizer, add instructions to download tokenizer
- add options for the llama model to support Llama3
- add Llama3 8b/70b configs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants