extract_text() returns a unicode character \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi when it comes across the word Office, #598

colemanr03 · 2022-02-08T20:52:08Z

I am extracting information from over 1000 large PDF files. I am looking for rooms in a sketch with lots of pictures.

I love pdfplumber, it is so much better than the other methods I found. I do have one small problem.

When I find a room that is labeled 'Office', the exctract_Text routine returns \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi

Oﬃce/2nd Floor instead of Office/2ndFloor.

They look the same but the first one has \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi

Have you seen anything like this? Do you have a work around?

Thank you

samkit-jain · 2022-02-09T11:11:08Z

Hi @colemanr03 Appreciate your interest in the library. Could you please provide more details like the version of pdfplumber you are using, the PDF (redacting any sensitive information) that is causing the issue and a minimum reproducible code example?

jsvine · 2022-07-20T22:21:46Z

Just chiming in to echo @samkit-jain: Are you able to share the PDF, @colemanr03? It'd help resolve this issue, as I haven't come across another PDF that demonstrates this particular situation.

jeffkile · 2023-03-03T00:08:56Z

I'm running into the same issue. I can reproduce it consistently with this PDF: https://www.ck12info.org/wp-content/uploads/2008/12/CK12_Earth_Science_rev.pdf

"Fiction" and "Scientific" on page 9 (first page of chapter 1) end up as ﬁction and scientiﬁc (it may not be obvious at first glance but the "fi"s are a single character in these words)

I'm using version 0.8.0

jsvine · 2023-03-09T15:48:54Z

Hi @jeffkile, and thanks. For reference's sake, here's what I think you're pointing to:

import pdfplumber                                                                                         
pdf = pdfplumber.open("CK12_Earth_Science_rev.pdf")                                                       
page = pdf.pages[8]
print(page.extract_text())

... produces:

Chapter 1
What is Earth Science?
1.1 Nature of Science
Lesson Objectives
• Explain the importance of asking questions.
• State the steps of the scientiﬁc method.
• Describe the three major types of scientiﬁc models.
• Use appropriate safety precautions inside and outside the science laboratory.
Introduction
Think of your favorite science ﬁction movie. What is it about? Maybe it’s about spaceships
going to distant planets, or people being cloned in laboratories, or undersea civilizations, or
robots that walk among us. These entertaining imaginings are make-believe fantasies, that’s
why they’re called science “ﬁction.” They are not real. But why are they called “science”
ﬁction?
The answer is that science uses a disciplined process to answer questions. In science, “disci-
plined” does not mean well-behaved. It means following orderly steps in order to come up
with the best answers. Science involves observing, wondering, categorizing, communicating,
calculating, analyzing, and much more. In order to convert creativity into reality, we need
science. In order to travel beyond where anyone has gone before, we need science. In order
to understand the world, make sense of it, and conserve it, we need science. In order to
conﬁrm our best guesses about the universe and the things in it, we need science. Science
ﬁction stories extend and expand on all the ideas of science and technology in creative ways.
1
www.ck12.org

... and that the word we would recognize as fiction appears as ﬁction (i.e., ﬁ-ligature followed by ction).

I'm inferring (correct me if I'm mistaken), that you'd prefer .extract_text(...) to convert all ﬁ ligatures to the two-character representation, fi.

That makes sense to me and, although it would introduce a breaking change for some users, it probably fits the most common use-case. Still, I'd want some way for users to preserve ligatures in the extracted text, perhaps through a preserve_ligatures=True parameter.

I'd be curious for your thoughts @samkit-jain!

samkit-jain · 2023-03-20T17:38:49Z

@jsvine As a user, yes, I too, would prefer to have it read fi instead of ﬁ. Can add a new parameter but wouldn't it be too much for just one ligature? Or planning to add support for many?

prakhs123 · 2023-03-26T12:33:07Z

This was bothering me as well, so I am using
text.replace('ﬁ', 'fi')
in my code

jeffkile · 2023-04-05T00:18:37Z

Yeah I'm doing something similar with
text.replace("ﬀ", "ff").replace("ﬁ", "fi").replace("ﬂ", "fl").replace("ﬃ", "ffi").replace("ﬄ", "ffl")

The problem is these are just the ones I've found so far, there may be other ones that I haven't found yet which users could see in production, so a generalized fix would really be preferred.

colemanr03 · 2023-04-05T05:13:35Z

Sorry, that I did not supply the pdf file. I see someone else did. I found a similar work around with a text replace. I did not realize there were others I needed to look for. I can't believe that anyone would want their English text converted to ligatures.

Addresses issue #598

jsvine · 2023-04-13T13:02:34Z

With v0.9.0, pdfplumber's text-extraction methods now expand the most common Latin-alphabet ligatures into their constituent characters. (It does not do so for ligatures that are considered to be their own letter, such as German's ß.) Let me know if/how it works for you! Thanks to commenters here for the suggestions and examples.

Tom-Hudson · 2023-05-09T12:48:05Z

@jsvine I am using 0.9.0 and I am still getting occasions where fi are being substituted with strange chars:

I am just using extract_text method. Any ideas?

jsvine · 2023-05-09T12:50:36Z

@Tom-Hudson Thank you for flagging. Can you share the PDF? (Hard to diagnose the issue without it. If it's a document you don't want to share publicly, you can email it to me — my address is in my GitHub profile.)

Tom-Hudson · 2023-05-09T13:05:44Z

@jsvine I have just emailed you the PDF now. Thanks so much for looking into this!

Tom-Hudson · 2023-05-10T09:02:36Z

Posting here to benefit anyone who encounters a similar issue and wrongly blames pdfplumber like me 😞

@jsvine was super helpful and pointed out that the PDF I was generating had null unicode characters to represent some ligatures.

We generate PDFs with Chromium from HTML and after reading this issue on StackOverflow: https://stackoverflow.com/questions/39504775/disabling-font-ligatures-css-letter-combining, I added CSS to disable ligatures in Chrome which solved my problem!

body {
  font-variant-ligatures: none;
  font-feature-settings: "liga" 0;
}

Here is the before and after which you can see in the PDF:

Before

After

After some testing, the actual cause was a combination of Chromium using ligatures and our usage of a webfont - I didn't dig too deep but it feels like there is something broken with ligatures and webfonts when printing to a PDF. If I re-enable ligatures with the default font, the ligatures are present but pdfplumber sorts them out because they use the correct unicode characters.

Thanks again @jsvine!

# Background [Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)) can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin `f` related ligatures which can be **very subtle** to spot by eye (see example below), but can wreak havoc later. ```python "ﬀ": "ff", "ﬁ": "fi", "ﬂ": "fl", "ﬃ": "ffi", "ﬄ": "ffl", ``` Several libraries already do something like this. Most recently, `pdfplumber` added this sort of capability as part of the text extraction process, see jsvine/pdfplumber#598 Instead of incorporating any sort of breaking change to the PDF text processing in `unstructured`, it is best to add this as another cleaner and allow users to opt in. In turn, the `clean_ligatures` method has been added in this PR - with accompanying tests. # Example Here is an example PDF that causes the issue. For example: `Beneﬁts`, which should be `Benefits`. [example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf) ```bash curl -X 'POST' \ 'https://api.unstructured.io/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \ -F 'files=@example.pdf' \ -s | jq -C . ``` # Notes An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.

samkit-jain added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Feb 9, 2022

jsvine added a commit that referenced this issue Apr 13, 2023

By default, expand ligatures into their letters

86e935d

Addresses issue #598

jsvine mentioned this issue Apr 13, 2023

v0.9.0 #862

Merged

jsvine closed this as completed Apr 13, 2023

walsha2 mentioned this issue Sep 7, 2023

Add clean_ligatures to core cleaners Unstructured-IO/unstructured#1326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text() returns a unicode character \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi when it comes across the word Office, #598

extract_text() returns a unicode character \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi when it comes across the word Office, #598

colemanr03 commented Feb 8, 2022

samkit-jain commented Feb 9, 2022

jsvine commented Jul 20, 2022 •

edited

Loading

jeffkile commented Mar 3, 2023

jsvine commented Mar 9, 2023

samkit-jain commented Mar 20, 2023

prakhs123 commented Mar 26, 2023

jeffkile commented Apr 5, 2023

colemanr03 commented Apr 5, 2023

jsvine commented Apr 13, 2023

Tom-Hudson commented May 9, 2023

jsvine commented May 9, 2023

Tom-Hudson commented May 9, 2023

Tom-Hudson commented May 10, 2023

extract_text() returns a unicode character \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi when it comes across the word Office, #598

extract_text() returns a unicode character \ufb03 LATIN SMALL LIGATURE FFI instead of the letters ffi when it comes across the word Office, #598

Comments

colemanr03 commented Feb 8, 2022

samkit-jain commented Feb 9, 2022

jsvine commented Jul 20, 2022 • edited Loading

jeffkile commented Mar 3, 2023

jsvine commented Mar 9, 2023

samkit-jain commented Mar 20, 2023

prakhs123 commented Mar 26, 2023

jeffkile commented Apr 5, 2023

colemanr03 commented Apr 5, 2023

jsvine commented Apr 13, 2023

Tom-Hudson commented May 9, 2023

jsvine commented May 9, 2023

Tom-Hudson commented May 9, 2023

Tom-Hudson commented May 10, 2023

jsvine commented Jul 20, 2022 •

edited

Loading