basic-text-preprocessing

Library to perform a basic text preprocessing on a corpus of documents.
Explore the docs »

Usage

Execute in a Unix terminal:

chmod 775 clean-artifacts.sh

./clean-artifacts.sh -a IN_PATH -b OUT_PATH

Processed files are stored in OUT_PATH

Step 1/5: Force unix text files. Mostly, for newline characters.
Step 2/5: Remove common artifacts. Use GitHub repo developed by Aitor to fix common encoding errors.
Step 3/5: Remove common HTML errors. Substitute HTML characters that are usually missed when transforming HTML to plain text.
Step 4/5: Quick substitution of common errors (substitute other common patterns that may cause errors when using plain text). As well, replace all whitespaces by \n or ' ' and force NFKC Unicode normalization.
Step 5/5: Check if there are lines starting with lowercase. We need to manually go to those files and check if those newlines are parsing/conversion mistakes.

This repo provides as well a script to detect near-duplicated documents in the corpus. Its usage is:

python find_duplicates.py --datapath CORPUS_PATH

It prints in terminal the list of duplicated files.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
utils		utils
README.md		README.md
check-newlines.py		check-newlines.py
clean-artifacts.sh		clean-artifacts.sh
find_duplicates.py		find_duplicates.py
quick-prepro.py		quick-prepro.py
split.py		split.py