Skip to content

Corpus Creation Workflow

drupchen edited this page Dec 15, 2019 · 2 revisions

1. First Run

1.1. Tokenize with Default Profile

pybo tok ~/path/to/input/folder/`

Command’s actions: creates a “pybo” folder in Documents, with two profiles, main and custom, and in pybo, the config file and the pickled trie. tokenizes all the files and writes the output in the folder ~/path/to/input/folder_pos/

1.2. Manually correct segmentation and POS tagging

To be done in-place in the folder “~/path/to/input/folder_pos/”

Todo: check the segmentation and adjust it by correcting the tokens check the POS tags and adjust where necessary add information to the tokens when it is deemed necessary. The format to follow is as follows:

<form>/<pos>/<lemma>/<sense>/<frequency>

For ex, if you wish to add a new sense and a new frequency, replace <current-form>/<current-POS> in the file by <current-form>/<current-POS>/<new-lemma>//<new-frequency> if you wish to only add a new sense: <current-form>/<current-POS>//<new-sense> note: the ending slashes can be omitted, but not those before the last field to be added.

1.3. Create and Update a New Profile

1.3.1 Create a New Profile

Rename ~/Documents/pybo/main/ to ~Documents/pybo/<NEW_PROFILE>. (Like this, the default “main” profile will be recreated next time pybo tok is run without given profile)

1.3.2 Extract New Words and Entries

pybo profile-update ~/path/to/input/folder_pos ~/Documents/pybo/<NEW_PROFILE>/

Command’s actions: It compares all the entries found in the manually corrected files (~/path/to/input/folder_pos/) with the content of the current profile (~/Documents/pybo/<NEW_PROFILE>/) and identifies all the new words and all the words for which the information such as POS, lemma, etc. is new. The list of the new entries is written to ~/Documents/workflow/batch1_pos_words.tsv.

1.3.3 Extract New Adjustment Rules

pybo rdr ~/path/to/input/folder_pos/

Command’s actions: First runs RDRPOSTagger on all the manually corrected files in the folder Converts the RDR rules into adjustment rules and writes them to ~/path/to/input/folder_pos_rules.tsv

1.3.4 Update the New Profile

Review the new word entries: add, modify or delete the entries Move batch1_pos_words.tsv to ~/Documents/pybo/<NEW_PROFILE>/words/ Review the rules: add, modify or delete rules Move batch1_pos_rules.tsv to ~/Documents/pybo/<NEW_PROFILE>/adjustment/

1.3.5 Adjust New Profile

a) Generate the Report

pybo profile-report ~/Documents/pybo/<NEW_PROFILE>/

Command’s action: Creates ~/Documents/pybo/<NEW_PROFILE>/<NEW_PROFILE>_report.tsv

The report contains all the entries found in the whole profile. It also presents all the duplicate entries, giving the file names and line numbers where the duplicates are located.

b) Adjust the Profile

Remove duplicates and improve existing entries, add new ones if needed and delete unneeded ones.

Rerun a) and b) until the profile is clean.

2. Subsequent Runs

2.1. Tokenize with Updated Profile

pybo tok -r -p ~/Documents/pybo/<NEW_PROFILE>/ ~/path/to/next/input/folder/

Command’s actions: rebuilds the compiled trie to take into account the new entries (-r switch) using the given path to the new profile (-p ~/Documents/pybo<NEW_PROFILE>/) tokenizes all the files and writes the output in the folder ~/path/to/next/input/folder_pos/

2.2. Manually correct segmentation and POS tagging

See 1.2.

2.3. Update New Profile

2.3.1 Extract New Words and Entries

pybo profile-update ~/path/to/next/input/folder_pos ~/Documents/pybo/<NEW_PROFILE>/

See 1.3.2.

2.3.2. Extract New Adjustment Rules

pybo rdr ~/path/to/next/input/folder_pos/

See 1.3.3

2.3.3 Update the New Profile

See 1.3.4

2.4. Adjust New Profile

See 1.3.5