Skip to content

Make b64decode with validate=True faster by compiling regex #11634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jaysonsantos
Copy link

Hi there, I was running some code that had to de code a lot of base64 encoded data and
I was using validate=True to also make sure it had valida data but it seemed to be way
slower than validate=False and I saw that it uses a regex without compiling it and
thought of sending a PR.
I didn't put any issue number because I didn't find any and the docs
say that for trivial changes there is no need to do it. Let me know if it would be needed.
And I also didn't create a test because Lib/test/test_base64.py already covers validate=True.

With the following benchmark the change goes from 15.174073122 to 8.906353553999999.

import timeit

import re
import binascii
from base64 import _bytes_from_decode_data

def original_base64(s, altchars=None, validate=False):
    s = _bytes_from_decode_data(s)
    if altchars is not None:
        altchars = _bytes_from_decode_data(altchars)
        assert len(altchars) == 2, repr(altchars)
        s = s.translate(bytes.maketrans(altchars, b'+/'))
    if validate and not re.match(b'^[A-Za-z0-9+/]*={0,2}$', s):
        raise binascii.Error('Non-base64 digit found')
    return binascii.a2b_base64(s)

VALID_BASE64_REGEX = re.compile(b'^[A-Za-z0-9+/]*={0,2}$')

def changed_base64(s, altchars=None, validate=False):
    s = _bytes_from_decode_data(s)
    if altchars is not None:
        altchars = _bytes_from_decode_data(altchars)
        assert len(altchars) == 2, repr(altchars)
        s = s.translate(bytes.maketrans(altchars, b'+/'))
    if validate and not VALID_BASE64_REGEX.match(s):
        raise binascii.Error('Non-base64 digit found')
    return binascii.a2b_base64(s)

for b64 in (original_base64, changed_base64):
    print(b64)
    print(timeit.timeit('assert b64(b"UHl0aG9u", validate=True) == b"Python"', globals={'b64': b64}, number=10_000_000))

This is done by compiling the regex that validates that the encoded data
only has valid base64 characters.
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@asvetlov
Copy link
Contributor

IIRC the similar PR was rejected recently because regex compilation increased import time.

See https://bugs.python.org/issue35559

@jaysonsantos
Copy link
Author

Hey @asvetlov thanks for pointing out.
That could also be a lazy initialization, let me know if it is worth it to do otherwise I would close the pull request.

@tirkarthi
Copy link
Member

I agree with @asvetlov since I opened the issue and that causes the import time to increase. Please benchmark the import time and this would also need discussion in a separate bpo if you are proceeding further on this.

Thanks

@jaysonsantos
Copy link
Author

Check the new commit 30aa7d4, it should be able to avoid slow import time and the first call to b64decode would be slow but the other ones would be faster

@asvetlov
Copy link
Contributor

Please provide time numbers

@jaysonsantos
Copy link
Author

Hey there, here it goes:

sh-3.2$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
sh-3.2$ ./python.exe -m perf timeit 'import base64; base64.b64decode("feed", validate=True)'
.....................
Mean +- std dev: 7.76 us +- 0.13 us
sh-3.2$ git checkout base64-decode-speed-improvement
Switched to branch 'base64-decode-speed-improvement'
Your branch is up to date with 'origin/base64-decode-speed-improvement'.
sh-3.2$ ./python.exe -m perf timeit 'import base64; base64.b64decode("feed", validate=True)'
.....................
Mean +- std dev: 5.22 us +- 0.12 us
sh-3.2$

@jaysonsantos
Copy link
Author

And here are the import times:

sh-3.2$ ./python.exe -m perf timeit 'import base64'
.....................
Mean +- std dev: 360 ns +- 10 ns
sh-3.2$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
sh-3.2$ ./python.exe -m perf timeit 'import base64'
.....................
Mean +- std dev: 359 ns +- 11 ns

@csabella csabella requested a review from asvetlov June 5, 2019 23:25
@hauntsaninja
Copy link
Contributor

hauntsaninja commented Feb 9, 2023

Thank you for the PR! This is no longer relevant after #27272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants