Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding data containing <NUL> values #114

Closed
bruchar1 opened this issue Nov 2, 2022 · 6 comments
Closed

Decoding data containing <NUL> values #114

bruchar1 opened this issue Nov 2, 2022 · 6 comments
Assignees
Labels

Comments

@bruchar1
Copy link

bruchar1 commented Nov 2, 2022

It seems that if I encode then decode data containing <NUL> characters, the data is correctly encoded and decoded, but base64_decode returns 0 instead of 1. Is that the expected behavior?

@aklomp aklomp self-assigned this Nov 2, 2022
@aklomp
Copy link
Owner

aklomp commented Nov 2, 2022

The return value of base64_decode should be independent of the data being processed (obviously). If the data is decoded successfully, then base64_decode should always return 1. If what you state is correct, then this is a pretty big bug.

What susprises me is that roundtrip encoding and decoding of 0x00 is part of the test suite. Edit: that part of the test suite does not include the null terminator in its test.

I also cannot reproduce this bug with the following command line:

echo -ne 'hello\0world' | bin/base64 | bin/base64 -d >/dev/null; echo $?

This prints 0 (success) as it should.

So either this is an obscure bug hiding in one of the optimized SIMD decoders, or something else is going on.

Which codec are you using? Does the test suite run correctly on your machine?

Can you post any code to reproduce the issue?

@aklomp aklomp added the bug label Nov 2, 2022
@aklomp
Copy link
Owner

aklomp commented Nov 2, 2022

I added a very quick and dirty test to do roundtrip encoding on data with \0x00 in it. All tests in CI pass.

@bruchar1
Copy link
Author

bruchar1 commented Nov 2, 2022

I figured it out. My allocated buffer for encoding was not big enough, and it caused the last padding = to be missing.

I guess the readme is missleading:

The buffer in out has been allocated by the caller and is at least 4/3 the size of the input.

The right buffer size is probably (n + 2) / 3 * 4 in integer arithmetic.

@bruchar1 bruchar1 closed this as completed Nov 2, 2022
@BurningEnlightenment
Copy link
Contributor

TBH I think it would make sense to add two macros or inline functions to do buffer size calculations. As an added bonus it'd be way more readable than to repeat the size calculation everywhere.

it caused the last padding = to be missing.

While we are at it, I think a flag to encode/decode without padding would make sense, too. See also RFC4648 section 3.2.

@aklomp
Copy link
Owner

aklomp commented Nov 3, 2022

Agreed that it makes sense for the library to expose a macro to do the buffer size calculation.

I've been hesitant to add one (or to publish the "ultimate buffer sizing formula" in the sense of the post above) because I'm not positive that all encoders will always stay within the bounds of such a buffer. The SIMD encoders write 16 bytes of output at a time.

Then again, those SIMD encoders always operate on a full, guaranteed 12 bytes of input, so their output should always fit into the output buffer if it's sized according to something like the formula above.

Regarding the suggestion of adding a flag to disable padding: the suggestion makes sense, but I'm wary of adding more behavioral flags at this point. I'd first like to do a comprehensive rewrite to sort out various pain points, and rethink the architecture of this library first.

@aklomp
Copy link
Owner

aklomp commented Nov 3, 2022

I created #116 to discuss the addition of size macros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants