-
Notifications
You must be signed in to change notification settings - Fork 13
Corpus Creation Workflow
pybo tok ~/path/to/input/folder/`
Command’s actions:
creates a “pybo” folder in Documents, with two profiles, main and custom, and in pybo, the config file and the pickled trie.
tokenizes all the files and writes the output in the folder ~/path/to/input/folder_pos/
To be done in-place in the folder “~/path/to/input/folder_pos/”
Todo: check the segmentation and adjust it by correcting the tokens check the POS tags and adjust where necessary add information to the tokens when it is deemed necessary. The format to follow is as follows:
<form>/<pos>/<lemma>/<sense>/<frequency>
For ex, if you wish to add a new sense and a new frequency, replace <current-form>/<current-POS>
in the file by <current-form>/<current-POS>/<new-lemma>//<new-frequency>
if you wish to only add a new sense: <current-form>/<current-POS>//<new-sense>
note: the ending slashes can be omitted, but not those before the last field to be added.
Rename ~/Documents/pybo/main/
to ~Documents/pybo/<NEW_PROFILE>
. (Like this, the default “main” profile will be recreated next time pybo tok
is run without given profile)
pybo profile-update ~/path/to/input/folder_pos ~/Documents/pybo/<NEW_PROFILE>/
Command’s actions: It compares all the entries found in the manually corrected files (~/path/to/input/folder_pos/
) with the content of the current profile (~/Documents/pybo/<NEW_PROFILE>/
) and identifies all the new words and all the words for which the information such as POS, lemma, etc. is new. The list of the new entries is written to ~/Documents/workflow/batch1_pos_words.tsv
.
pybo rdr ~/path/to/input/folder_pos/
Command’s actions:
First runs RDRPOSTagger on all the manually corrected files in the folder
Converts the RDR rules into adjustment rules and writes them to ~/path/to/input/folder_pos_rules.tsv
Review the new word entries: add, modify or delete the entries
Move batch1_pos_words.tsv
to ~/Documents/pybo/<NEW_PROFILE>/words/
Review the rules: add, modify or delete rules
Move batch1_pos_rules.tsv
to ~/Documents/pybo/<NEW_PROFILE>/adjustment/
pybo profile-report ~/Documents/pybo/<NEW_PROFILE>/
Command’s action: Creates ~/Documents/pybo/<NEW_PROFILE>/<NEW_PROFILE>_report.tsv
The report contains all the entries found in the whole profile. It also presents all the duplicate entries, giving the file names and line numbers where the duplicates are located.
Remove duplicates and improve existing entries, add new ones if needed and delete unneeded ones.
Rerun a) and b) until the profile is clean.
pybo tok -r -p ~/Documents/pybo/<NEW_PROFILE>/ ~/path/to/next/input/folder/
Command’s actions:
rebuilds the compiled trie to take into account the new entries (-r
switch) using the given path to the new profile (-p ~/Documents/pybo<NEW_PROFILE>/
)
tokenizes all the files and writes the output in the folder ~/path/to/next/input/folder_pos/
See 1.2.
pybo profile-update ~/path/to/next/input/folder_pos ~/Documents/pybo/<NEW_PROFILE>/
See 1.3.2.
pybo rdr ~/path/to/next/input/folder_pos/
See 1.3.3
See 1.3.4
See 1.3.5