Name		Name	Last commit message	Last commit date
parent directory ..
classifiers		classifiers
k8s		k8s
nemo_run		nemo_run
slurm		slurm
README.md		README.md
blend_and_shuffle.py		blend_and_shuffle.py
classifier_filtering.py		classifier_filtering.py
download_arxiv.py		download_arxiv.py
download_common_crawl.py		download_common_crawl.py
download_wikipedia.py		download_wikipedia.py
exact_deduplication.py		exact_deduplication.py
find_pii_and_deidentify.py		find_pii_and_deidentify.py
fuzzy_deduplication.py		fuzzy_deduplication.py
identify_languages_and_fix_unicode.py		identify_languages_and_fix_unicode.py
raw_download_common_crawl.py		raw_download_common_crawl.py
semdedup_example.py		semdedup_example.py
task_decontamination.py		task_decontamination.py
translation_example.py		translation_example.py

README.md

NeMo Curator Python API examples

This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions. The goal of these examples is to give the user an overview of many of the ways your text data can be curated. These include:

Python Script	Description
blend_and_shuffle.py	Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset.
classifier_filtering.py	Train a fastText classifier, then use it to filter high and low quality data.
download_arxiv.py	Download Arxiv tar files and extract them.
download_common_crawl.py	Download Common Crawl WARC snapshots and extract them.
download_wikipedia.py	Download the latest Wikipedia dumps and extract them.
exact_deduplication.py	Use the `ExactDuplicates` class to perform exact deduplication on text data.
find_pii_and_deidentify.py	Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data.
fuzzy_deduplication.py	Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data.
identify_languages_and_fix_unicode.py	Use `FastTextLangId` to filter data by language, then fix the unicode in it.
raw_download_common_crawl.py	Download the raw compressed WARC files from Common Crawl without extracting them.
semdedup_example.py	Use the `SemDedup` class to perform semantic deduplication on text data.
task_decontamination.py	Remove segments of downstream evaluation tasks from a dataset.
translation_example.py	Create and use an `IndicTranslation` model for language translation.

Before running any of these scripts, we strongly recommend displaying python <script name>.py --help to ensure that any needed or relevant arguments are specified.

The classifiers, k8s, nemo_run, and slurm subdirectories contain even more examples of NeMo Curator's capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

NeMo Curator Python API examples

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

NeMo Curator Python API examples