-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaner to normalize unicode glyphs? #50
Comments
We have one or two OCR fixes that we automatically apply, so I'm definitely interested in a general solution if you head in that direction. Years ago, I assumed there must be...something. Some word list or cleanup tools or something that some academic or open source person or somebody had created, but I came up with absolutely nothing. I sort of concluded that the reason there was nothing was because the best OCR tools already have this built in, but I only half believe that. Here's the function I made. One or two fixes was right:
This does feel out of scope for eyecite though, no? |
One citation-specific angle here that might mean this wants to move upstream from CAP to eyecite eventually is that off-the-shelf OCR software seems to be particularly typo-prone in citation strings relative to the rest of the text, because it doesn't have a language model trained on legal citations to predict what the character is supposed to be. So it seems to be much more likely to get reporter strings wrong than other strings -- others I've noticed just flipping through cases are Probably best to just let this simmer in CAP and we'll see if our collection of edge cases adds up to anything coherent. |
Separately I think a punctuation-normalizing filter probably does want to be in eyecite, since the algorithm depends on matching ascii punctuation like quotes and dashes. |
That's interesting. I bet turning all umlauts into u's would be a net benefit. I guess I could also see some of these common citation OCR misses ( |
Be careful when normalizing unicode. It gets rid of things that can be very important such as § (Sec.) and §§ (Secs.) |
Do you all have any insight about cleaning text for non-ascii characters? We have two parts of this in play for CAP:
Quotes and dashes (and maybe others?) can come in as curly quotes or mdashes or whatever. Some set of replacements should probably be made on our text like
‘ ’ ´ “ ” –
->' ' ' " " -
; don't know if there's a good complete list. This one probably applies to most text.OCR'd cites can come in with accents and umlauts and such, so for OCR'd English text we probably want to replace
é
andü
and so on with English-language ascii lookalikes. This might be less generally applicable.I'm thinking of throwing everything through https://pypi.org/project/Unidecode/ , which I think will do both of those things:
I haven't measured performance yet though; might be overkill. Any other suggestions? And does some form of this want to make it into the built-in eyecite cleaners? That part doesn't matter for CAP's purposes, just curious if it'd be helpful.
The text was updated successfully, but these errors were encountered: