Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: assert self._pageseq != 0 #48

Closed
chrisgrieser opened this issue Nov 23, 2021 · 3 comments · Fixed by #49
Closed

Bug: assert self._pageseq != 0 #48

chrisgrieser opened this issue Nov 23, 2021 · 3 comments · Fixed by #49
Assignees
Labels

Comments

@chrisgrieser
Copy link

Hi, thanks again for pdfannots! I recently encountered a small issue where an unsupported annotation type completely shut down the annotation extraction. While it's understandable that not every fancy annotation type can be extracted, pdfannots shouldn't completely abort, bur rather simply skip the annotation.

It took me a bit to find the problematic annotation, which was even exacerbated by the fact that it wasn't an annotation visible in normal PDF readers, but probably some result of a bad quality PDF OCR scan.

The error was: WARNING: Unsupported annotation subtype: /'Popup'

@0xabu
Copy link
Owner

0xabu commented Nov 24, 2021

pdfannots shouldn't completely abort, bur rather simply skip the annotation.

That's exactly the intention -- it should be just a warning, but it shouldn't prevent extraction of any other annotations. From your description it sounds like pdfannots failed to produce any output until you removed the problematic annotation. Is that correct? Are you able to share the affected PDF?

@chrisgrieser
Copy link
Author

sure, here is the warning message and the pdf:

extract.pdf

WARNING: Unsupported annotation subtype: /'Popup'
WARNING: Unsupported annotation subtype: /'Popup'
Traceback (most recent call last):
  File "/opt/homebrew/bin/pdfannots", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/cli.py", line 141, in main
    doc = process_file(
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/__init__.py", line 472, in process_file
    page.annots.sort()
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py",
 line 226, in __lt__
    return self.pos < other.pos
  File "/opt/homebrew/lib/python3.9/site-packages/pdfannots/types.py", line 182, in __lt__
    assert self._pageseq != 0
AssertionError

@0xabu 0xabu changed the title Bug: Do not abort when unknown annotation type is encountered Bug: assert self._pageseq != 0 Nov 24, 2021
@0xabu
Copy link
Owner

0xabu commented Nov 24, 2021

Ok, so the warning messages are a red herring here, the real issue is the assertion failure -- looks like one of the (supported) annotations wasn't encountered during the text traversal. I'll have a closer look on the weekend. Thanks for providing the sample!

@0xabu 0xabu added the bug label Nov 24, 2021
@0xabu 0xabu self-assigned this Nov 24, 2021
0xabu added a commit that referenced this issue Nov 27, 2021
…mponents

issue #48 demonstrates a PDF where all text is chars within a figure, and there
are no lines/boxes
@0xabu 0xabu closed this as completed in #49 Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants