Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clean_ligatures to core cleaners #1326

Merged
merged 2 commits into from
Sep 7, 2023
Merged

Conversation

walsha2
Copy link
Contributor

@walsha2 walsha2 commented Sep 7, 2023

Background

Ligatures can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin f related ligatures which can be very subtle to spot by eye (see example below), but can wreak havoc later.

"ff": "ff",
"fi": "fi",
"fl": "fl",
"ffi": "ffi",
"ffl": "ffl",

Several libraries already do something like this. Most recently, pdfplumber added this sort of capability as part of the text extraction process, see jsvine/pdfplumber#598

Instead of incorporating any sort of breaking change to the PDF text processing in unstructured, it is best to add this as another cleaner and allow users to opt in. In turn, the clean_ligatures method has been added in this PR - with accompanying tests.

Example

Here is an example PDF that causes the issue. For example: Benefits, which should be Benefits.

example.pdf

curl -X 'POST' \
    'https://api.unstructured.io/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \
    -F 'files=@example.pdf' \
    -s | jq -C .

Notes

An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.

@cragwolfe
Copy link
Contributor

@walsha2 , please add a bullet near the top of CHANGELOG.md. this is a great cleaning addition, thanks!
i'm tempted for this to be included by default in many partition cases, tbh, but that can follow in a later PR.

@walsha2
Copy link
Contributor Author

walsha2 commented Sep 7, 2023

add a bullet near the top of CHANGELOG.md

Done.

i'm tempted for this to be included by default in many partition cases

I thought about it as well, but I just started messing with unstructured this week. Will defer that decision about partition incorporation to someone more in the know! 😄


First PR to get feet wet with repo. Awesome package. Looking forward to making more contributions in the future!

@walsha2
Copy link
Contributor Author

walsha2 commented Sep 7, 2023

@cragwolfe I cant really make sense of the CI actions. Not sure why the CI / lint (3.8) is failing. Does not look to be PR related at all.

make: black: No such file or directory

https://github.com/Unstructured-IO/unstructured/actions/runs/6105531406/job/16569692294?pr=1326

@cragwolfe
Copy link
Contributor

@cragwolfe I cant really make sense of the CI actions. Not sure why the CI / lint (3.8) is failing. Does not look to be PR related at all.

make: black: No such file or directory

https://github.com/Unstructured-IO/unstructured/actions/runs/6105531406/job/16569692294?pr=1326

yes, github CI was having a rough time for some unknown reason -- frustrating! it seems to be fine now.
i just merged in changes to main, let's see how it goes 🤞

@cragwolfe cragwolfe enabled auto-merge (squash) September 7, 2023 20:54
@cragwolfe cragwolfe merged commit e4e25c9 into Unstructured-IO:main Sep 7, 2023
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants