-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAPT with NeMo FW #10689
base: main
Are you sure you want to change the base?
DAPT with NeMo FW #10689
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we avoid uploading large images in the pr directly, because it will increase the git size and live in history forever. To use a image, plz first upload it in release assets: https://github.com/NVIDIA/NeMo/releases/tag/r2.0.0rc1 and use its path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaoyu-33 This was already committed to the current repo. I did not add them. I just renamed the folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that's fine then.
], | ||
"source": [ | ||
"# download llam2-70b model weights and tokenizer \n", | ||
"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment this line to actually download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"# download llam2-70b model weights and tokenizer \n", | ||
"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n", | ||
"\n", | ||
"#Cope original tokenizer to a different folder\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Copy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the typo
} | ||
], | ||
"source": [ | ||
"# download llam2-70b model weights and tokenizer \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users will need to login via huggingface-cli login
first since it is a restricted repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code block for huggingface-cli login with instructions in comments and the text block right before.
"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n", | ||
"\n", | ||
"#Cope original tokenizer to a different folder\n", | ||
"! cp ./models/weight/llama2-hf/tokenizer.model ./models/tokenizer/llama2/original_tokenizer\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing cp: No such file or directory
error, the target dir needs to be created first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a code block to create all the target directories in Step2
"tokenizer = train_tokenizer(data_root, batch_size, vocab_size, tokenizer, keys)\n", | ||
"\n", | ||
"#Save and print paths\n", | ||
"tokenizer.save_pretrained(save_root + \"custom_tokenizer_init_\" + str(vocab_size) + \".json\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The save directory is named 'custom_tokenizer_init_x.json', suggest changing the directory name to not end with .json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed directory name to custom_tokenizer_init_x_json
tutorials/llm/llama/domain-adaptive-pretraining/code/extend_tokenizer_utils.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to push this ?
domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. Specifically, LLama 2 foundation models are continually pre-trained with 20B plus tokens on domain-specific chip design data, including code, documents, etc., and then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that | ||
domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities. | ||
|
||
Here, we share a tutorial with best practices on custom tokenization + DAPT (domain-adaptive pre-training) for a ChipNeMo-like code generation use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this
|
||
* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation) | ||
|
||
* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true? Since we are not releasing the processed dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this.
|
||
* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. | ||
|
||
* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"./code/data
should contain" --> "./code/data
contains"
Should sounds like we are asking them to download stuff to that location. Since these assets are part of the github, just tell them it contains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is not part of the github since its to large to checkin. The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this. Please let me know if its still unclear.
|
||
* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) | ||
|
||
* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have defined DAPT above already. Not required to expand it "(Domain Adaptive Pre-training)" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add pre-requisites and instructions to launch this notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added them to the Readme. Please let me know if it needs anything additional.
with open(new_model_path, 'wb') as f: | ||
f.write(m.SerializeToString()) | ||
|
||
if split > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of function from here on will expand the embedding&output layer by append to original embedding/output, and then save to disk. However: 1. the tokens we obtain here are not filtered/sorted by frequency yet. 2. there is similar function in extend_tokenizer_utils.py:extend_tokenizer_high_freq_tokens that will save the extended embedding for high freq tokens again and write to the same file. Should this part be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvamaraju can you pls address this?
tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py
Show resolved
Hide resolved
tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py
Show resolved
Hide resolved
tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py
Show resolved
Hide resolved
tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py
Outdated
Show resolved
Hide resolved
import numpy as np | ||
|
||
|
||
def binary_search(arr, low, high, bar=0.98): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several qs about the use of binary search here:
- the input to binary search is desc sorted frequencies, (e.g. [[4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), but low is the min freq(1) high being max freq(4). The binary search operates on indices. why is minfreq and max freq passed in as low and high? For the above freq list input, the output of this binary_search func is 3, which doesn't look correct to me according to "top-K tokens in a way that their cumulative frequency accounts for approximately 98%".
- What about using
bisect
in python? Could we first calc the cumulative sum and then usebisect
to perform binary search directly? The binary_search implementation is recursive and also calcarr[0:mid].sum()
every recursive step, could be costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please switch to python bisect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvamaraju can you pls look into this and switch to bisect if possible?
topics = [] | ||
p_ths = [] | ||
for key in freq_dict: | ||
print(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printed the key/topics twice, here and ln 53, consider removing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this print statement.
# state_dict['args'].vocab_size = state_dict['args'].padded_vocab_size - 768 | ||
# print("vocab_size: ", state_dict['args'].vocab_size) | ||
# print("padded_vocab_size: ", state_dict['args'].padded_vocab_size) | ||
torch.save(state_dict, f"{save_path}/consolidated.0{i}.pth") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent dir needs to be created first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a check to ensure that the directory exists before trying to save
@@ -0,0 +1,16 @@ | |||
# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 70b with NeMo Framework |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we
- clear the notebook output?
- clear the unnecessary comments in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have cleaned the unnecessary notebook output but we have decided to keep some of the outputs in the notebook that are informative, and useful for the reader in understanding what to expect when they run those code blocks.
Also cleared the unnecessary comments and print statements in the code.
p_ths.append(p_th) | ||
|
||
tokens = {} | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a separate index i instead of enumerate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was mostly for having explicit control over the index variable for readability or debugging purposes.
freqs.append(term[-1]) | ||
ids.append(term[0]) | ||
freqs_np = np.array(freqs) | ||
th = binary_search(freqs_np, freqs_np.min(), freqs_np.max(), bar=p_ths[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freqs_np.min(), freqs_np.max() these 2 values are wrong. It should be index, like 0, and len(freqs_np)? but again, also switch to bisect
L = [] | ||
for key in tokens: | ||
L = L + tokens[key] | ||
# print(len(L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up all debug comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
state_dict['tok_embeddings.weight'] = batch_dict['word_embeddings'] | ||
print("embedding shape: ", state_dict['tok_embeddings.weight'].shape) | ||
print("output shape: ", state_dict['output.weight'].shape) | ||
# state_dict['args'].padded_vocab_size = state_dict['model']['language_model']['embedding']['word_embeddings']['weight'].shape[0] * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned
print("error") | ||
|
||
L = [] | ||
for key in tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz change to list comprehension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaoyu-33 are you recommending changing the for loop to a single line using list comprehension?
if th > 0: | ||
tokens[topic] = ids[0:th] | ||
else: | ||
print("error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added raise ValueError instead
word_embedding = word_embedding.bfloat16() | ||
output_layer = output_layer.bfloat16() | ||
|
||
KK, K = word_embedding.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix naming heuristic. avoid using KK, K, T, R. Use meaningful vocab_size, dim = ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
a2369d9
to
aab9c96
Compare
Signed-off-by: Janaki <jvamaraju@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
Finalizing reviews and comments this week |
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
What does this PR do ?
Tutorial for DAPT with NeMo Framework
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information