Add clean_ligatures to core cleaners #1326
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Ligatures can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin
f
related ligatures which can be very subtle to spot by eye (see example below), but can wreak havoc later.Several libraries already do something like this. Most recently,
pdfplumber
added this sort of capability as part of the text extraction process, see jsvine/pdfplumber#598Instead of incorporating any sort of breaking change to the PDF text processing in
unstructured
, it is best to add this as another cleaner and allow users to opt in. In turn, theclean_ligatures
method has been added in this PR - with accompanying tests.Example
Here is an example PDF that causes the issue. For example:
Benefits
, which should beBenefits
.example.pdf
Notes
An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.