-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing spaces in extract_text() method #1328
Comments
any workaround on this so far? |
Hi. The problem seemed to be the difference between the font size retrieved value and the actual space. If I round up the actual values, I think it will work. Is this helpful? Try rounding up Line 1992 in 8dd9fcb
|
Sorry, I have an addition. |
This reverts commit 5400f5a.
This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test
…nt size comparison to ratio
…he assertion process
* BUG: Missing spaces in extract_text() method (#1328) * Revert "BUG: Missing spaces in extract_text() method (#1328)" This reverts commit 5400f5a. * BUG: Missing spaces in extract_text() method (#1328) * BUG: Missing spaces in extract_text() method (#1328) add test * Revert "BUG: Missing spaces in extract_text() method (#1328)" This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (#1328) BUG: Missing spaces in extract_text() method (#1328) add test * BUG: Missing spaces in extract_text() method (#1328) Convert font size comparison to ratio * Correction to new file URL. Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com> * BUG: Missing spaces in extract_text() method (#1328) calculation efficiency * BUG: Missing spaces in extract_text() method (#1328) Simplify the assertion process --------- Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
## Version 5.0.1, 2024-09-29 ### New Features (ENH) - Add `full` parameter to PdfWriter constructor (py-pdf#2865) ### Bug Fixes (BUG) - Update pyproject.toml with minimum Python version of 3.8 (py-pdf#2859) - Cope with unbalanced delimiters in dictionary object (py-pdf#2878) - Cope with encoding with too many differences (py-pdf#2873) - Missing spaces in extract_text() method (py-pdf#1328) (py-pdf#2868) - Tolerate truncated files and no warning when jumping startxref (py-pdf#2855) ### Robustness (ROB) - Repair PDF with invalid Root object (py-pdf#2880) - Continue parsing dictionary object when error is detected (py-pdf#2872) - Merge documents with invalid pages in named destinations (py-pdf#2857) - Tolerate comments(%) in arrays (py-pdf#2856) ### Documentation (DOC), Testing (TST), Code Style (STY), Developer Experience (DEV), Maintenance (MAINT) - (py-pdf#2844), (py-pdf#2862), (py-pdf#2863), (py-pdf#2847), (py-pdf#2860), (py-pdf#2867), (py-pdf#2874), (py-pdf#2879) [Full Changelog](py-pdf/pypdf@5.0.0...5.0.1)
## Version 5.0.1, 2024-09-29 ### New Features (ENH) - Add `full` parameter to PdfWriter constructor (#2865) ### Bug Fixes (BUG) - Update pyproject.toml with minimum Python version of 3.8 (#2859) - Cope with unbalanced delimiters in dictionary object (#2878) - Cope with encoding with too many differences (#2873) - Missing spaces in extract_text() method (#1328) (#2868) - Tolerate truncated files and no warning when jumping startxref (#2855) ### Robustness (ROB) - Repair PDF with invalid Root object (#2880) - Continue parsing dictionary object when error is detected (#2872) - Merge documents with invalid pages in named destinations (#2857) - Tolerate comments in arrays (#2856) ### Developer Experience (DEV) - Use latest Python version for benchmarking (#2879) ### Maintenance (MAINT) - Add tests to source distributions (#2874) - Refactor _update_field_annotation (#2862) [Full Changelog](5.0.0...5.0.1)
Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.
Environment
$ python -c "import pypdf;print(pypdf.__version__)" pypdf==3.14.0
Code + PDF
PDF: 0004.pdf
gives:
expected (copy-pasted with Google chrome):
0000.pdf
Yes, you may add to the tests. It is public
data
from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspxp,s, Thank you for the great package!
The text was updated successfully, but these errors were encountered: