Repair ligatures in NCI #117
Labels
corpora
About adding or updating a corpus
enhancement
New feature or request
NCI
Processing the New Corpus of Ireland
In the NCI vert file, ligatures seem to be sometimes replaced by
?
and split as a separate token. For example:A list of candidate words is easily collected by scanning for words containing
fi
orfl
(and other letter combinations that can form ligatures).Thanks to @orla-niloinsigh-adapt for reporting the issue.
TODO:
?
--> decide whether to ignore the issue, fix it manually or automate the repairfios
vs.flos
(second word made up). This will help to decide whether a simple context-free replacement strategy is sufficient.?
and decision which tokens to amalgamate)?
is not a ligature (some cases may be ambiguous;?
may also be in use for other characters not supported by the NCI; most of these should show up as cases with no good substitute)Related: Issue #124
The text was updated successfully, but these errors were encountered: