Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAPT with NeMo FW #10689

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

DAPT with NeMo FW #10689

wants to merge 24 commits into from

Conversation

jvamaraju
Copy link

What does this PR do ?

Tutorial for DAPT with NeMo Framework

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • [Y ] Make sure you read and followed Contributor guidelines
  • [N ] Did you write any new necessary tests?
  • [N ] Did you add or update any necessary documentation?
  • [N ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ Y] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we avoid uploading large images in the pr directly, because it will increase the git size and live in history forever. To use a image, plz first upload it in release assets: https://github.com/NVIDIA/NeMo/releases/tag/r2.0.0rc1 and use its path.
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 This was already committed to the current repo. I did not add them. I just renamed the folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that's fine then.

],
"source": [
"# download llam2-70b model weights and tokenizer \n",
"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment this line to actually download

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"# download llam2-70b model weights and tokenizer \n",
"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",
"\n",
"#Cope original tokenizer to a different folder\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Copy"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the typo

}
],
"source": [
"# download llam2-70b model weights and tokenizer \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will need to login via huggingface-cli login first since it is a restricted repo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code block for huggingface-cli login with instructions in comments and the text block right before.

"# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",
"\n",
"#Cope original tokenizer to a different folder\n",
"! cp ./models/weight/llama2-hf/tokenizer.model ./models/tokenizer/llama2/original_tokenizer\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing cp: No such file or directory error, the target dir needs to be created first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code block to create all the target directories in Step2

"tokenizer = train_tokenizer(data_root, batch_size, vocab_size, tokenizer, keys)\n",
"\n",
"#Save and print paths\n",
"tokenizer.save_pretrained(save_root + \"custom_tokenizer_init_\" + str(vocab_size) + \".json\")"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The save directory is named 'custom_tokenizer_init_x.json', suggest changing the directory name to not end with .json.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed directory name to custom_tokenizer_init_x_json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to push this ?

tutorials/llm/llama/domain-adaptive-pretraining/README.md Outdated Show resolved Hide resolved
domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. Specifically, LLama 2 foundation models are continually pre-trained with 20B plus tokens on domain-specific chip design data, including code, documents, etc., and then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that
domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities.

Here, we share a tutorial with best practices on custom tokenization + DAPT (domain-adaptive pre-training) for a ChipNeMo-like code generation use case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this


* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)

* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true? Since we are not releasing the processed dataset?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this.


* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers.

* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"./code/data should contain" --> "./code/data contains"

Should sounds like we are asking them to download stuff to that location. Since these assets are part of the github, just tell them it contains.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is not part of the github since its to large to checkin. The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this. Please let me know if its still unclear.


* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)

* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have defined DAPT above already. Not required to expand it "(Domain Adaptive Pre-training)" here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. removed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added them to the Readme. Please let me know if it needs anything additional.

with open(new_model_path, 'wb') as f:
f.write(m.SerializeToString())

if split > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of function from here on will expand the embedding&output layer by append to original embedding/output, and then save to disk. However: 1. the tokens we obtain here are not filtered/sorted by frequency yet. 2. there is similar function in extend_tokenizer_utils.py:extend_tokenizer_high_freq_tokens that will save the extended embedding for high freq tokens again and write to the same file. Should this part be deleted?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvamaraju can you pls address this?

import numpy as np


def binary_search(arr, low, high, bar=0.98):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several qs about the use of binary search here:

  1. the input to binary search is desc sorted frequencies, (e.g. [[4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), but low is the min freq(1) high being max freq(4). The binary search operates on indices. why is minfreq and max freq passed in as low and high? For the above freq list input, the output of this binary_search func is 3, which doesn't look correct to me according to "top-K tokens in a way that their cumulative frequency accounts for approximately 98%".
  2. What about using bisect in python? Could we first calc the cumulative sum and then use bisect to perform binary search directly? The binary_search implementation is recursive and also calc arr[0:mid].sum() every recursive step, could be costly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please switch to python bisect

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvamaraju can you pls look into this and switch to bisect if possible?

topics = []
p_ths = []
for key in freq_dict:
print(key)
Copy link
Collaborator

@HuiyingLi HuiyingLi Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printed the key/topics twice, here and ln 53, consider removing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this print statement.

# state_dict['args'].vocab_size = state_dict['args'].padded_vocab_size - 768
# print("vocab_size: ", state_dict['args'].vocab_size)
# print("padded_vocab_size: ", state_dict['args'].padded_vocab_size)
torch.save(state_dict, f"{save_path}/consolidated.0{i}.pth")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent dir needs to be created first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check to ensure that the directory exists before trying to save

@@ -0,0 +1,16 @@
# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 70b with NeMo Framework
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we

  1. clear the notebook output?
  2. clear the unnecessary comments in the code?

Copy link

@sugsharma sugsharma Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have cleaned the unnecessary notebook output but we have decided to keep some of the outputs in the notebook that are informative, and useful for the reader in understanding what to expect when they run those code blocks.

Also cleared the unnecessary comments and print statements in the code.

p_ths.append(p_th)

tokens = {}
i = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a separate index i instead of enumerate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was mostly for having explicit control over the index variable for readability or debugging purposes.

freqs.append(term[-1])
ids.append(term[0])
freqs_np = np.array(freqs)
th = binary_search(freqs_np, freqs_np.min(), freqs_np.max(), bar=p_ths[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freqs_np.min(), freqs_np.max() these 2 values are wrong. It should be index, like 0, and len(freqs_np)? but again, also switch to bisect

L = []
for key in tokens:
L = L + tokens[key]
# print(len(L))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up all debug comments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

state_dict['tok_embeddings.weight'] = batch_dict['word_embeddings']
print("embedding shape: ", state_dict['tok_embeddings.weight'].shape)
print("output shape: ", state_dict['output.weight'].shape)
# state_dict['args'].padded_vocab_size = state_dict['model']['language_model']['embedding']['word_embeddings']['weight'].shape[0] * 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned

print("error")

L = []
for key in tokens:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz change to list comprehension?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 are you recommending changing the for loop to a single line using list comprehension?

if th > 0:
tokens[topic] = ids[0:th]
else:
print("error")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise Error?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added raise ValueError instead

word_embedding = word_embedding.bfloat16()
output_layer = output_layer.bfloat16()

KK, K = word_embedding.shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix naming heuristic. avoid using KK, K, T, R. Use meaningful vocab_size, dim = ..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Janaki <jvamaraju@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Copy link
Contributor

github-actions bot commented Nov 1, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 1, 2024
@jvamaraju
Copy link
Author

Finalizing reviews and comments this week

@github-actions github-actions bot removed the stale label Nov 5, 2024
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Signed-off-by: sugandhas <sugandhas@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants