-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalize_unicode=False/True
parameter to text extraction methods
#905
Comments
Hi @jsvine, is there a workaround for this in the meantime? Can I manually apply a normalize function to all text in the PDF? |
Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case? |
@jsvine thanks for the answer. Basically, I am trying to do some split by |
The definitive rules are defined in the Unicode spec (
https://unicode.org/reports/tr15/). It needs careful reading ("Taken
step-by-step, the Unicode Normalization Algorithm is fairly complex"). It
specifically discusses the Greek question mark. There are different formal
approaches
>>
The four Unicode Normalization Forms are summarized in *Table 1.*
Table 1. Normalization Forms
<https://unicode.org/reports/tr15/#Normalization_Forms_Table>
FormDescription
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition,
followed by Canonical Composition
=====
10 Respecting Canonical Equivalence
<https://unicode.org/reports/tr15/#Canonical_Equivalence>
This section describes the relationship of normalization to respecting (or
preserving) canonical equivalence. A process (or function) *respects* canonical
equivalence when canonical-equivalent inputs always produce
canonical-equivalent outputs. For a function that transforms one string
into another, this may also be called *preserving* canonical equivalence.
There are a number of important aspects to this concept:
1. The outputs are *not* required to be identical, only canonically
equivalent.
2. *Not* all processes are required to respect canonical equivalence.
For example:
- A function that collects a set of the General_Category values
present in a string will and should produce a different value
for <*angstrom
sign, semicolon>* than for <*A, combining ring above, greek question
mark>*, even though they are canonically equivalent.
- A function that does a binary comparison of strings will also find
these two sequences different.
3. Higher-level processes that transform or compare strings, or that
perform other higher-level functions, must respect canonical equivalence or
problems will result.
<<<
It's important we adhere precisely to Unicode terminology and philosophy
For me (a crystallographer) it's the equivalence between Aring and Angstrom
(which are frequently misused. Note that Aring if further complicated and
may have to be normalised 0041 (A) + 030A (combining ring) => 00C5 (Aring)
The problems frequently arise when authors pick symbols from menus without
realising what character results.
There are a lot of further illiteracies which probably can't be dealt with,
e.g. em-dash for minus
…On Tue, Jul 16, 2024 at 2:23 PM Agus Luques ***@***.***> wrote:
@jsvine <https://github.com/jsvine> thanks for the answer. Basically, I
am trying to do some split by ; (U+003B) but the PDF seems to have a
different ; (U+037E). I am doing some manual replacement but it will be
great to have this at the moment of reading the PDF so I don't have any
point of risk in case I forget to include the cleaning logic
—
Reply to this email directly, view it on GitHub
<#905 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS2BQYIOJARAT3TN5ULZMUNGHAVCNFSM6AAAAABKVC3SXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQHA4DEMZVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Allows user to pre-normalize Unicode characters. h/t @petermr + @agusluques in #905
Feature now added in 03a477f On the Give it a whirl and let me know if it suits your needs / meets your expectations? |
Per @petermr's suggestion in #904 (comment), I think it's a good idea to add such a parameter/option, using
unicodedata.normalize(...)
— in a similar vein to theexpand_ligatures
parameter added in v0.9.0. I'll look into this.Some useful reference links, as a note-to-self:
The text was updated successfully, but these errors were encountered: