Skip to content

Commit

Permalink
Updated DAPT readme
Browse files Browse the repository at this point in the history
Signed-off-by: Janaki <jvamaraju@nvidia.com>
  • Loading branch information
jvamaraju committed Oct 2, 2024
1 parent bd23c35 commit aab9c96
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion tutorials/llm/llama/domain-adaptive-pretraining/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@ Here, we share a tutorial with best practices on custom tokenization + DAPT (dom

* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)

* `./code/general_data should contain open-source general purpose data that the llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. This data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)
* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)

* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT (Domain Adaptive Pre-training)

0 comments on commit aab9c96

Please sign in to comment.