Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair ligatures in NCI #117

Open
7 tasks
jowagner opened this issue Mar 14, 2023 · 0 comments
Open
7 tasks

Repair ligatures in NCI #117

jowagner opened this issue Mar 14, 2023 · 0 comments
Labels
corpora About adding or updating a corpus enhancement New feature or request NCI Processing the New Corpus of Ireland

Comments

@jowagner
Copy link
Collaborator

jowagner commented Mar 14, 2023

In the NCI vert file, ligatures seem to be sometimes replaced by ? and split as a separate token. For example:

$ fgrep " caithfidh " nci-plain.txt | wc -l
3072
$ fgrep " caith ? dh " nci-plain.txt | wc -l
11
$ fgrep " ráfla " nci-plain.txt | wc -l
189
$ fgrep " rá ? a " nci-plain.txt | wc -l
3

A list of candidate words is easily collected by scanning for words containing fi or fl (and other letter combinations that can form ligatures).

Thanks to @orla-niloinsigh-adapt for reporting the issue.

TODO:

  • check how frequent this issue is: find candidate words and count how often they appear with ? --> decide whether to ignore the issue, fix it manually or automate the repair
  • check for conflicting candidate replacements (pairs of words that only differ in which ligature they could use, e.g. fios vs. flos (second word made up). This will help to decide whether a simple context-free replacement strategy is sufficient.
  • collect candidates and check strategies
  • if necessary, build and evaluate a classifier or sequence tagger to choose the repair operation (replacement of ? and decision which tokens to amalgamate)
  • support cases where the ligature appears at the start or end of a word, where multiple ligatures are in a single word or where multiple ligatures appear nearby but are not part of the same word
  • support cases where the ? is not a ligature (some cases may be ambiguous; ? may also be in use for other characters not supported by the NCI; most of these should show up as cases with no good substitute)
  • decide whether to use a ligature in the replacement or individual letters; as the model may encounter ligatures in test data / applications, it will be good for the model to know ligatures; however, it may be more effective to use data augmentation to achieve this robustness, randomly replacing letter combinations with ligatures during training

Related: Issue #124

@jowagner jowagner added enhancement New feature or request corpora About adding or updating a corpus labels Mar 14, 2023
@jowagner jowagner added the NCI Processing the New Corpus of Ireland label May 26, 2023
@jowagner jowagner mentioned this issue Jul 17, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpora About adding or updating a corpus enhancement New feature or request NCI Processing the New Corpus of Ireland
Projects
None yet
Development

No branches or pull requests

1 participant