Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality of b_ #2726

Closed
j-t-1 opened this issue Jun 27, 2024 · 10 comments · Fixed by #2792
Closed

Functionality of b_ #2726

j-t-1 opened this issue Jun 27, 2024 · 10 comments · Fixed by #2792
Labels
needs-discussion The PR/issue needs more discussion before we can continue

Comments

@j-t-1
Copy link
Contributor

j-t-1 commented Jun 27, 2024

B_CACHE: Dict[Union[str, bytes], bytes] = {}


def b_(s: Union[str, bytes]) -> bytes:
    if isinstance(s, bytes):
        return s
    bc = B_CACHE
    if s in bc:
        return bc[s]
    try:
        r = s.encode("latin-1")
        if len(s) < 2:
            bc[s] = r
        return r
    except Exception:
        r = s.encode("utf-8")
        if len(s) < 2:
            bc[s] = r
        return r

It is non-obvious why we are only caching strings of length zero (the empty string) and of length one.

There is a need to add comments to this function.

@pubpub-zz
Copy link
Collaborator

I'm not sure this is at will. you may found a way to improve performances...

@stefan6419846
Copy link
Collaborator

This has been added/changed in #124 by @mozbugbox.

@j-t-1
Copy link
Contributor Author

j-t-1 commented Jun 28, 2024

This has been added/changed in #124 by @mozbugbox.

The first of the commit messages is "Speedup b_() for short string for Python3", which suggests the current functionality is deliberate.

The code tries latin-1 and if there is an exception uses utf-8. Given the rise of utf-8, I think we should just use utf-8.

@stefan6419846
Copy link
Collaborator

@j-t-1 I recommend you to further dig into the commit history if unsure about specific cases - the latin-1 workaround has been required for some files. Doing a "blame" shows you that the corresponding change is from #463 and there already has been a request for an example file about two months ago: #463 (comment)

@j-t-1
Copy link
Contributor Author

j-t-1 commented Jul 3, 2024

@pubpub-zz
Line 226 of test_generic.py:
# to test latin-1 aka stdencoding

Is this equivalent to this (and if so we could change it)?
# to test PDFDocEncoding (latin-1)

@pubpub-zz
Copy link
Collaborator

Aka =also known as
Your rewording is correct

@pubpub-zz
Copy link
Collaborator

@j-t-1
what should we do about his issue ?

@j-t-1
Copy link
Contributor Author

j-t-1 commented Jul 22, 2024

Each time through the function we have an if statement to decide if a string of length zero or one is in the cache. If it is not, we have another if statement deciding whether len(s) < 2. If we remove the cache we would remove at least one if statement every time b_ is invoked.

Keeping it as is, only make sense if:

  1. A lot of calls to b_ satisfy len(s) < 2
  2. s.encode("latin-1") is an expensive operation compared with an if statement and cache retrieval of a string of length 0 or 1.

To gauge 1. depends on particular PDF and use case because b_ is used in a few places.
How would we test 2.?

dis.dis("".encode)
TypeError: don't know how to disassemble builtin_function_or_method objects

This comment suggests we keep b_ as is.

@shartzog
Copy link
Contributor

With regard to 1: In my experience, automated reporting frameworks and 'document to PDF' conversion tools LOVE to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementations. Long story short, I suspect the len(s) < 2 is hit far more often than you'd think during text extraction ops...

j-t-1 added a commit to j-t-1/pypdf that referenced this issue Jul 24, 2024
@j-t-1
Copy link
Contributor Author

j-t-1 commented Jul 24, 2024

@shartzog this is the information needed!

In a future PR we could possibly add your comment above as a comment into b_:

Automated reporting frameworks and 'document to PDF' conversion tools love to
use a 'render every character one at a time' paradigm for creating a 'justified'
text layout, and many of them just behave that way by default to simplify their
implementation.

Referencing you in the commit message.

@stefan6419846 stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jul 24, 2024
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 8, 2024
stefan6419846 pushed a commit that referenced this issue Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion The PR/issue needs more discussion before we can continue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants