This repository is heavily inspired by BigScience’s ROOTs project and EleutherAI’s The Pile.
The overall pipeline is as follows:
flowchart LR
A(Defining <br/>Datasources) --> B(Defining Filters <br/>per Datasource)
B --> C(Defining Cleaners <br/>per Datasource)
In this library, we define filtering as data instances being removed from the dataset based on some criteria and cleaning as data instances being modified in some way.
pip install squeakily
First, we need to define a datasource. squeakily
accepts any Dataset
object from the HuggingFace
Datasets library. For
example, we can use the
wikitext dataset:
from datasets import load_dataset
ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]")
We simply need to wrap the Dataset
object in a dictionary, with the
key being the name of the datasource and the value being the Dataset
object, the filter and cleaners. For example:
from squeakily.filter import check_char_repetition, check_flagged_words
from squeakily.clean import remove_empty_lines, normalize_whitespace
datasources = [
{
"dataset": ds,
"name": "wikitext",
"columns": ["text"],
"filters": [check_char_repetition, check_flagged_words],
"cleaners": [remove_empty_lines, normalize_whitespace],
},
# ...
]
Warning
Note: The order of the filters and cleaning functions matter. Filters and cleaners are applied in the order they are defined.
Important
Note: As of now, we only use the first column of the given column names. This is because the
squeakily
library is designed to work with language datasets, which usually have a single column of text. Future versions will support multiple columns.
Finally, we can apply the filters and cleaners to the datasouces using a
Pipeline
object:
from squeakily.core import Pipeline
pipeline = Pipeline(datasources)
pipeline.run()
[11/16/22 04:32:57] INFO Running datasource: wikitext core.py:41
INFO Running filter: check_char_repetition on text core.py:54
INFO Running filter: check_flagged_words on text core.py:54
INFO Running cleaner: remove_empty_lines on text core.py:57
[11/16/22 04:32:59] INFO Running cleaner: normalize_whitespace on text core.py:57
Note
Note: If you want to to export the processed data source to a desired path, you can specify an export path and the output type (csv or json) in the
export_to_path
function.export_path = "/path/to/desired/path" output_types = ['csv', 'json'] # Optional, default is "csv" json_indication = "records" # Optional, default is "records" pipeline.export_to_path(export_path, output_types[1], json_indication=indication)
Note
Note: If you want to run cleaners first, you can pass
cleaning_first=True
to therun
function.pipeline.run(cleaning_first=True)
If you need to run a filter or cleaner at the dataset level rather than
the example level, you can pass global_filters
or global_cleaners
to
the
Pipeline.run
function. For example:
from squeakily.filter import minhash_dedup
pipeline.run(global_filters=[minhash_dedup])
Note
Note: If you use global filters or cleaners, all datasets must have a common column name in order to properly concatenate them.
Note
Note: You can also specifiy if you want a specific dataset to be skipped by setting the
skip_global
parameter toTrue
when defining the datasource.datasources = [ { "dataset": ds, "columns": ["text"], "filters": [check_char_repetition, check_flagged_words], "cleaners": [remove_empty_lines, normalize_whitespace], "skip_global": True, }, # ... ]
Additionally, you can run the pipeline in a dry run mode by passing
dry_run=True
to the run
function. This will make no modifications to
the datasets’ documents, but will add additional columns to the datasets
with the results of the filters and cleaners. For example, if you if you
ran the pipeline with the
check_char_repetition
filter, you would get a new column called
check_char_repetition
with a float value between 0 and 1 indicating the percentage of
characters that are repeated in the document.
::: {.cell}
``` {.python .cell-code}
pipeline = Pipeline(datasources)
pipeline.run(dry_run=True)
pipeline.datasources[0]["dataset"].features
:::