Functionality of b_ #2726

j-t-1 · 2024-06-27T08:14:56Z

B_CACHE: Dict[Union[str, bytes], bytes] = {}


def b_(s: Union[str, bytes]) -> bytes:
    if isinstance(s, bytes):
        return s
    bc = B_CACHE
    if s in bc:
        return bc[s]
    try:
        r = s.encode("latin-1")
        if len(s) < 2:
            bc[s] = r
        return r
    except Exception:
        r = s.encode("utf-8")
        if len(s) < 2:
            bc[s] = r
        return r

It is non-obvious why we are only caching strings of length zero (the empty string) and of length one.

There is a need to add comments to this function.

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-06-27T17:37:09Z

I'm not sure this is at will. you may found a way to improve performances...

stefan6419846 · 2024-06-28T07:24:57Z

This has been added/changed in #124 by @mozbugbox.

j-t-1 · 2024-06-28T12:12:41Z

This has been added/changed in #124 by @mozbugbox.

The first of the commit messages is "Speedup b_() for short string for Python3", which suggests the current functionality is deliberate.

The code tries latin-1 and if there is an exception uses utf-8. Given the rise of utf-8, I think we should just use utf-8.

stefan6419846 · 2024-06-28T12:18:03Z

@j-t-1 I recommend you to further dig into the commit history if unsure about specific cases - the latin-1 workaround has been required for some files. Doing a "blame" shows you that the corresponding change is from #463 and there already has been a request for an example file about two months ago: #463 (comment)

j-t-1 · 2024-07-03T10:38:03Z

@pubpub-zz
Line 226 of test_generic.py:
# to test latin-1 aka stdencoding

Is this equivalent to this (and if so we could change it)?
# to test PDFDocEncoding (latin-1)

pubpub-zz · 2024-07-03T14:29:34Z

Aka =also known as
Your rewording is correct

pubpub-zz · 2024-07-18T19:51:35Z

@j-t-1
what should we do about his issue ?

j-t-1 · 2024-07-22T09:25:21Z

Each time through the function we have an if statement to decide if a string of length zero or one is in the cache. If it is not, we have another if statement deciding whether len(s) < 2. If we remove the cache we would remove at least one if statement every time b_ is invoked.

Keeping it as is, only make sense if:

A lot of calls to b_ satisfy len(s) < 2
s.encode("latin-1") is an expensive operation compared with an if statement and cache retrieval of a string of length 0 or 1.

To gauge 1. depends on particular PDF and use case because b_ is used in a few places.
How would we test 2.?

dis.dis("".encode)
TypeError: don't know how to disassemble builtin_function_or_method objects

This comment suggests we keep b_ as is.

shartzog · 2024-07-23T02:07:49Z

With regard to 1: In my experience, automated reporting frameworks and 'document to PDF' conversion tools LOVE to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementations. Long story short, I suspect the len(s) < 2 is hit far more often than you'd think during text extraction ops...

@shartzog

Comment courtesy of @shartzog, issue py-pdf#2726.

j-t-1 · 2024-07-24T08:04:30Z

@shartzog this is the information needed!

In a future PR we could possibly add your comment above as a comment into b_:

Automated reporting frameworks and 'document to PDF' conversion tools love to
use a 'render every character one at a time' paradigm for creating a 'justified'
text layout, and many of them just behave that way by default to simplify their
implementation.

Referencing you in the commit message.

closes py-pdf#2726 superseed proposal py-pdf#2791

Closes #2726. Closes #2791.

j-t-1 mentioned this issue Jun 28, 2024

MAINT: Amend byte cache #2730

Closed

j-t-1 added a commit to j-t-1/pypdf that referenced this issue Jul 24, 2024

STY: Comment and refactor b_

de1f554

Comment courtesy of @shartzog, issue py-pdf#2726.

j-t-1 mentioned this issue Jul 24, 2024

STY: Refactor b_ #2772

Merged

stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jul 24, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 8, 2024

MNT: Remove b_ and str_

1f83d52

closes py-pdf#2726 superseed proposal py-pdf#2791

pubpub-zz mentioned this issue Aug 8, 2024

MAINT: Remove b_ and str_ #2792

Merged

stefan6419846 closed this as completed in #2792 Aug 12, 2024

stefan6419846 pushed a commit that referenced this issue Aug 12, 2024

MAINT: Remove b_ and str_ (#2792)

46c89dd

Closes #2726. Closes #2791.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality of b_ #2726

Functionality of b_ #2726

j-t-1 commented Jun 27, 2024 •

edited

Loading

pubpub-zz commented Jun 27, 2024

stefan6419846 commented Jun 28, 2024

j-t-1 commented Jun 28, 2024

stefan6419846 commented Jun 28, 2024

j-t-1 commented Jul 3, 2024

pubpub-zz commented Jul 3, 2024

pubpub-zz commented Jul 18, 2024

j-t-1 commented Jul 22, 2024

shartzog commented Jul 23, 2024

j-t-1 commented Jul 24, 2024 •

edited

Loading

Functionality of b_ #2726

Functionality of b_ #2726

Comments

j-t-1 commented Jun 27, 2024 • edited Loading

pubpub-zz commented Jun 27, 2024

stefan6419846 commented Jun 28, 2024

j-t-1 commented Jun 28, 2024

stefan6419846 commented Jun 28, 2024

j-t-1 commented Jul 3, 2024

pubpub-zz commented Jul 3, 2024

pubpub-zz commented Jul 18, 2024

j-t-1 commented Jul 22, 2024

shartzog commented Jul 23, 2024

j-t-1 commented Jul 24, 2024 • edited Loading

j-t-1 commented Jun 27, 2024 •

edited

Loading

j-t-1 commented Jul 24, 2024 •

edited

Loading