DAPT with NeMo FW #10689

jvamaraju · 2024-09-30T23:40:37Z

What does this PR do ?

Tutorial for DAPT with NeMo Framework

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[Y ] Make sure you read and followed Contributor guidelines
[N ] Did you write any new necessary tests?
[N ] Did you add or update any necessary documentation?
[N ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ Y] New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

yaoyu-33 · 2024-10-02T16:27:02Z

tutorials/llm/llama/sdg-law-title-generation/img/e2e-lora-train-and-deploy.png

we avoid uploading large images in the pr directly, because it will increase the git size and live in history forever. To use a image, plz first upload it in release assets: https://github.com/NVIDIA/NeMo/releases/tag/r2.0.0rc1 and use its path.

@yaoyu-33 This was already committed to the current repo. I did not add them. I just renamed the folder.

I see that's fine then.

HuiyingLi · 2024-10-02T18:01:32Z

tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

+   ],
+   "source": [
+    "# download llam2-70b model weights and tokenizer \n",
+    "# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",


Uncomment this line to actually download

HuiyingLi · 2024-10-02T18:02:32Z

tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

+    "# download llam2-70b model weights and tokenizer \n",
+    "# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",
+    "\n",
+    "#Cope original tokenizer to a different folder\n",


typo: "Copy"

fixed the typo

HuiyingLi · 2024-10-02T18:12:15Z

tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

+    }
+   ],
+   "source": [
+    "# download llam2-70b model weights and tokenizer \n",


Users will need to login via huggingface-cli login first since it is a restricted repo.

Added code block for huggingface-cli login with instructions in comments and the text block right before.

HuiyingLi · 2024-10-02T20:21:26Z

tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

+    "# ! huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models/weight/llama2-hf/\n",
+    "\n",
+    "#Cope original tokenizer to a different folder\n",
+    "! cp ./models/weight/llama2-hf/tokenizer.model ./models/tokenizer/llama2/original_tokenizer\n",


Throwing cp: No such file or directory error, the target dir needs to be created first

Added a code block to create all the target directories in Step2

HuiyingLi · 2024-10-02T20:32:49Z

tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

+    "tokenizer = train_tokenizer(data_root, batch_size, vocab_size, tokenizer, keys)\n",
+    "\n",
+    "#Save and print paths\n",
+    "tokenizer.save_pretrained(save_root + \"custom_tokenizer_init_\" + str(vocab_size) + \".json\")"


The save directory is named 'custom_tokenizer_init_x.json', suggest changing the directory name to not end with .json.

changed directory name to custom_tokenizer_init_x_json

tutorials/llm/llama/domain-adaptive-pretraining/code/extend_tokenizer_utils.py

shashank3959 · 2024-10-02T21:43:37Z

tutorials/llm/llama/domain-adaptive-pretraining/.gitignore

do we need to push this ?

tutorials/llm/llama/domain-adaptive-pretraining/README.md

shashank3959 · 2024-10-02T22:42:03Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

+domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. Specifically, LLama 2 foundation models are continually pre-trained with 20B plus tokens on domain-specific chip design data, including code, documents, etc., and then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that
+domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities.
+
+Here, we share a tutorial with best practices on custom tokenization + DAPT (domain-adaptive pre-training) for a ChipNeMo-like code generation use case.


Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case.

shashank3959 · 2024-10-02T22:43:06Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

+
+* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)
+
+* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)


Is this still true? Since we are not releasing the processed dataset?

The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this.

shashank3959 · 2024-10-02T22:43:43Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

+
+* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers.
+
+* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)


"./code/data should contain" --> "./code/data contains"

Should sounds like we are asking them to download stuff to that location. Since these assets are part of the github, just tell them it contains.

The data is not part of the github since its to large to checkin. The user will need to curate their own dataset using the data curation tutorial in the NeMo Curator that we point to here. I edited the text to clarify this. Please let me know if its still unclear.

shashank3959 · 2024-10-02T22:45:33Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

+
+* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)
+
+* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training) 


we have defined DAPT above already. Not required to expand it "(Domain Adaptive Pre-training)" here

sounds good. removed it.

shashank3959 · 2024-10-02T22:47:15Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

Can we add pre-requisites and instructions to launch this notebook

Example: https://github.com/NVIDIA/NeMo/tree/c4afdfacc2b55c3a5e60177980dc281edbfeae0d/tutorials/llm/llama-3/pruning-distillation

I have added them to the Readme. Please let me know if it needs anything additional.

HuiyingLi · 2024-10-02T22:56:58Z

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py

+    with open(new_model_path, 'wb') as f:
+        f.write(m.SerializeToString())
+
+    if split > 1:


The rest of function from here on will expand the embedding&output layer by append to original embedding/output, and then save to disk. However: 1. the tokens we obtain here are not filtered/sorted by frequency yet. 2. there is similar function in extend_tokenizer_utils.py:extend_tokenizer_high_freq_tokens that will save the extended embedding for high freq tokens again and write to the same file. Should this part be deleted?

@jvamaraju can you pls address this?

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py

HuiyingLi · 2024-10-03T00:44:45Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+import numpy as np
+
+
+def binary_search(arr, low, high, bar=0.98):


Several qs about the use of binary search here:

the input to binary search is desc sorted frequencies, (e.g. [[4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), but low is the min freq(1) high being max freq(4). The binary search operates on indices. why is minfreq and max freq passed in as low and high? For the above freq list input, the output of this binary_search func is 3, which doesn't look correct to me according to "top-K tokens in a way that their cumulative frequency accounts for approximately 98%".

What about using bisect in python? Could we first calc the cumulative sum and then use bisect to perform binary search directly? The binary_search implementation is recursive and also calc arr[0:mid].sum() every recursive step, could be costly.

please switch to python bisect

@jvamaraju can you pls look into this and switch to bisect if possible?

HuiyingLi · 2024-10-03T00:46:02Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+    topics = []
+    p_ths = []
+    for key in freq_dict:
+        print(key)


printed the key/topics twice, here and ln 53, consider removing?

removed this print statement.

HuiyingLi · 2024-10-03T00:52:10Z

tutorials/llm/llama/domain-adaptive-pretraining/code/util.py

+        #             state_dict['args'].vocab_size = state_dict['args'].padded_vocab_size - 768
+        #             print("vocab_size: ", state_dict['args'].vocab_size)
+        #             print("padded_vocab_size: ", state_dict['args'].padded_vocab_size)
+        torch.save(state_dict, f"{save_path}/consolidated.0{i}.pth")


parent dir needs to be created first

added a check to ensure that the directory exists before trying to save

HuiyingLi · 2024-10-03T01:13:14Z

tutorials/llm/llama/domain-adaptive-pretraining/README.md

@@ -0,0 +1,16 @@
+# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 70b with NeMo Framework


Should we

clear the notebook output?

clear the unnecessary comments in the code?

I have cleaned the unnecessary notebook output but we have decided to keep some of the outputs in the notebook that are informative, and useful for the reader in understanding what to expect when they run those code blocks.

Also cleared the unnecessary comments and print statements in the code.

HuiyingLi · 2024-10-03T05:46:26Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+        p_ths.append(p_th)
+
+    tokens = {}
+    i = 0


Why a separate index i instead of enumerate?

I think this was mostly for having explicit control over the index variable for readability or debugging purposes.

yaoyu-33 · 2024-10-03T16:42:39Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+            freqs.append(term[-1])
+            ids.append(term[0])
+        freqs_np = np.array(freqs)
+        th = binary_search(freqs_np, freqs_np.min(), freqs_np.max(), bar=p_ths[i])


freqs_np.min(), freqs_np.max() these 2 values are wrong. It should be index, like 0, and len(freqs_np)? but again, also switch to bisect

yaoyu-33 · 2024-10-03T16:42:52Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+    L = []
+    for key in tokens:
+        L = L + tokens[key]
+    # print(len(L))


clean up all debug comments

yaoyu-33 · 2024-10-03T16:43:16Z

tutorials/llm/llama/domain-adaptive-pretraining/code/util.py

+            state_dict['tok_embeddings.weight'] = batch_dict['word_embeddings']
+            print("embedding shape: ", state_dict['tok_embeddings.weight'].shape)
+            print("output shape: ", state_dict['output.weight'].shape)
+        #             state_dict['args'].padded_vocab_size = state_dict['model']['language_model']['embedding']['word_embeddings']['weight'].shape[0] * 8


yaoyu-33 · 2024-10-03T16:44:51Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+            print("error")
+
+    L = []
+    for key in tokens:


plz change to list comprehension?

@yaoyu-33 are you recommending changing the for loop to a single line using list comprehension?

yaoyu-33 · 2024-10-03T16:45:00Z

tutorials/llm/llama/domain-adaptive-pretraining/code/get_high_freq_tokens.py

+        if th > 0:
+            tokens[topic] = ids[0:th]
+        else:
+            print("error")


raise Error?

Good point. Added raise ValueError instead

yaoyu-33 · 2024-10-03T16:57:40Z

tutorials/llm/llama/domain-adaptive-pretraining/code/extend_tokenizer_utils.py

+    word_embedding = word_embedding.bfloat16()
+    output_layer = output_layer.bfloat16()
+
+    KK, K = word_embedding.shape


fix naming heuristic. avoid using KK, K, T, R. Use meaningful vocab_size, dim = ..

Signed-off-by: Janaki <jvamaraju@nvidia.com>

Signed-off-by: sugandhas <sugandhas@nvidia.com>

github-actions · 2024-11-01T02:04:05Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

jvamaraju · 2024-11-04T15:47:16Z

Finalizing reviews and comments this week

Signed-off-by: sugandhas <sugandhas@nvidia.com>

yaoyu-33 reviewed Oct 2, 2024

View reviewed changes

HuiyingLi reviewed Oct 2, 2024

View reviewed changes

tutorials/llm/llama/domain-adaptive-pretraining/code/extend_tokenizer_utils.py Show resolved Hide resolved

shashank3959 reviewed Oct 2, 2024

View reviewed changes

HuiyingLi reviewed Oct 2, 2024

View reviewed changes

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py Show resolved Hide resolved

HuiyingLi reviewed Oct 2, 2024

View reviewed changes

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py Show resolved Hide resolved

HuiyingLi reviewed Oct 2, 2024

View reviewed changes

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py Show resolved Hide resolved

HuiyingLi reviewed Oct 3, 2024

View reviewed changes

tutorials/llm/llama/domain-adaptive-pretraining/code/tokenization_helper.py Outdated Show resolved Hide resolved

HuiyingLi reviewed Oct 3, 2024

View reviewed changes

yaoyu-33 reviewed Oct 3, 2024

View reviewed changes

sugsharma force-pushed the main branch from 25d5864 to 7fe9d78 Compare October 9, 2024 19:27

jvamaraju force-pushed the main branch 2 times, most recently from a2369d9 to aab9c96 Compare October 10, 2024 17:53

jvamaraju closed this Oct 10, 2024

jvamaraju force-pushed the main branch from aab9c96 to 861f805 Compare October 10, 2024 17:55

DAPT with NeMo FW

cc1adae

Signed-off-by: Janaki <jvamaraju@nvidia.com>

jvamaraju reopened this Oct 10, 2024

sugsharma force-pushed the main branch from b7b4348 to cc1adae Compare October 16, 2024 16:23

sugsharma added 21 commits October 16, 2024 13:25

uncomment model download

f9a914c

Signed-off-by: sugandhas <sugandhas@nvidia.com>

add hf cli login, correct typo

4e93876

Signed-off-by: sugandhas <sugandhas@nvidia.com>

add hf cli install and login

5c52de5

Signed-off-by: sugandhas <sugandhas@nvidia.com>

change dir name .json to _json

1edf663

Signed-off-by: sugandhas <sugandhas@nvidia.com>

update wt init and readme

3f7854f

Signed-off-by: sugandhas <sugandhas@nvidia.com>

update wt init to mean

daf35df

Signed-off-by: sugandhas <sugandhas@nvidia.com>

add requirements to Readme

a282f10

Signed-off-by: sugandhas <sugandhas@nvidia.com>

remove unnecessary comment

e9bb992

Signed-off-by: sugandhas <sugandhas@nvidia.com>

add parent directory check

4f9b7d5

Signed-off-by: sugandhas <sugandhas@nvidia.com>

remove print statement

bf4253c

Signed-off-by: sugandhas <sugandhas@nvidia.com>

clean debug comments

bd847ef

Signed-off-by: sugandhas <sugandhas@nvidia.com>

clean debug comments from utils

045d0c2

Signed-off-by: sugandhas <sugandhas@nvidia.com>

raise error when threhold <= 0

6705ef5

Signed-off-by: sugandhas <sugandhas@nvidia.com>

fixed variable naming to use meaningful names

3c28ecd

Signed-off-by: sugandhas <sugandhas@nvidia.com>

added parent dir check and fixed variable naming

5bdae02

Signed-off-by: sugandhas <sugandhas@nvidia.com>

added parent dir check

22f3a92

Signed-off-by: sugandhas <sugandhas@nvidia.com>

added create directories in step2

7ec6ec9

Signed-off-by: sugandhas <sugandhas@nvidia.com>

remove unnecessary comments/prints

c978a76

Signed-off-by: sugandhas <sugandhas@nvidia.com>

fix function name and source of new_tokens

7fd2e88

Signed-off-by: sugandhas <sugandhas@nvidia.com>

remove unnecesary print statements

fc4e9d8

Signed-off-by: sugandhas <sugandhas@nvidia.com>

remove unnecesary output

522be41

Signed-off-by: sugandhas <sugandhas@nvidia.com>

github-actions bot added the stale label Nov 1, 2024

github-actions bot removed the stale label Nov 5, 2024

sugsharma added 2 commits November 7, 2024 12:31

DAPT pretraining notebook

0322713

Signed-off-by: sugandhas <sugandhas@nvidia.com>

Update readme to add pretraining

5e8b61b

Signed-off-by: sugandhas <sugandhas@nvidia.com>


		* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)

		* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)


		* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers.

		* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)


		* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)

		* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training)

		import numpy as np


		def binary_search(arr, low, high, bar=0.98):

		@@ -0,0 +1,16 @@
		# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 70b with NeMo Framework

DAPT with NeMo FW #10689

Are you sure you want to change the base?

DAPT with NeMo FW #10689

Conversation

jvamaraju commented Sep 30, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuiyingLi Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sugsharma Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 1, 2024

jvamaraju commented Nov 4, 2024

HuiyingLi Oct 3, 2024 •

edited

Loading

sugsharma Oct 17, 2024 •

edited

Loading