Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner to normalize unicode glyphs? #50

Open
jcushman opened this issue Mar 26, 2021 · 5 comments
Open

Cleaner to normalize unicode glyphs? #50

jcushman opened this issue Mar 26, 2021 · 5 comments

Comments

@jcushman
Copy link
Contributor

jcushman commented Mar 26, 2021

Do you all have any insight about cleaning text for non-ascii characters? We have two parts of this in play for CAP:

  • Quotes and dashes (and maybe others?) can come in as curly quotes or mdashes or whatever. Some set of replacements should probably be made on our text like ‘ ’ ´ “ ” – -> ' ' ' " " -; don't know if there's a good complete list. This one probably applies to most text.

  • OCR'd cites can come in with accents and umlauts and such, so for OCR'd English text we probably want to replace é and ü and so on with English-language ascii lookalikes. This might be less generally applicable.

I'm thinking of throwing everything through https://pypi.org/project/Unidecode/ , which I think will do both of those things:

> print(unidecode('‘’´“”–éü'))
'''""-eu

I haven't measured performance yet though; might be overkill. Any other suggestions? And does some form of this want to make it into the built-in eyecite cleaners? That part doesn't matter for CAP's purposes, just curious if it'd be helpful.

@mlissner
Copy link
Member

We have one or two OCR fixes that we automatically apply, so I'm definitely interested in a general solution if you head in that direction. Years ago, I assumed there must be...something. Some word list or cleanup tools or something that some academic or open source person or somebody had created, but I came up with absolutely nothing. I sort of concluded that the reason there was nothing was because the best OCR tools already have this built in, but I only half believe that.

Here's the function I made. One or two fixes was right:

def cleanup_ocr_text(txt: str) -> str:
    """Do some basic cleanup to make OCR text better.

    Err on the side of safety. Don't make fixes that could cause other issues.

    :param txt: The txt output from the OCR engine.
    :return: Txt output, cleaned up.
    """
    simple_replacements = (
        ("Fi|ed", "Filed"),
        (" Il ", " II "),
    )
    for replacement in simple_replacements:
        txt = txt.replace(replacement[0], replacement[1])
    return txt

This does feel out of scope for eyecite though, no?

@jcushman
Copy link
Contributor Author

One citation-specific angle here that might mean this wants to move upstream from CAP to eyecite eventually is that off-the-shelf OCR software seems to be particularly typo-prone in citation strings relative to the rest of the text, because it doesn't have a language model trained on legal citations to predict what the character is supposed to be. So it seems to be much more likely to get reporter strings wrong than other strings -- others I've noticed just flipping through cases are R2d -> P.2d, Yt. -> Vt., la. -> Ia., Pae. -> Pac., 5.Ct. -> S.Ct.. I also believe I've seen speckles on the page be more likely to turn into umlauts and accents and colons and such within citations, though I don't have examples handy.

Probably best to just let this simmer in CAP and we'll see if our collection of edge cases adds up to anything coherent.

@jcushman
Copy link
Contributor Author

Separately I think a punctuation-normalizing filter probably does want to be in eyecite, since the algorithm depends on matching ascii punctuation like quotes and dashes.

@mlissner
Copy link
Member

That's interesting. I bet turning all umlauts into u's would be a net benefit. I guess I could also see some of these common citation OCR misses (R2d for example), showing up in reporters_db somehow. Seems messy though.

@devlux76
Copy link

Be careful when normalizing unicode. It gets rid of things that can be very important such as § (Sec.) and §§ (Secs.)
So before doing that, it might be good to parse the text for legal glyphs and convert them to their English equivalents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants