Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FromBase64 for &[u8]. #15810

Merged
merged 1 commit into from
Jul 21, 2014
Merged

Conversation

SimonSapin
Copy link
Contributor

The algorithm was already based on bytes internally.

Also use byte literals instead of casting u8 to char for matching.

[breaking-change] The InvalidBase64Character variant of the FromBase64Error enum was renamed to InvalidBase64Byte, and contains a u8 instead of a char.

r? @alexcrichton

The algorithm was already based on bytes internally.

Also use byte literals instead of casting u8 to char for matching.
@sfackler
Copy link
Member

This impl used to exist and was removed when that module was overhauled. (Not that that means that it couldn't be added back.)

@alexcrichton
Copy link
Member

Yes this was an explicit decision because base64 encodes to a string and decodes to a stream of bytes, and the two halves provide a clear separation of what's on each side of each boundary.

Is the impetus for this that the utf8 validation is too expensive to perform ahead of time?

@SimonSapin
Copy link
Contributor Author

TL;DR: I think &[u8]::from_base64() still makes sense, and yes it is to avoid an unnecessary UTF-8 check.

I came across this in Servo when "fetching" URLs in the data: scheme. The result of percent-decoding (which applies to all schemes) is bytes. These URLs can be big (I’ve seen them used for multi-megabytes JPEG images). Although I haven’t measured it, UTF-8 validation is definitely wasteful. It’s unnecessary: the base64 decoder rejects non-alphabet bytes/characters anyway, and both variants of the base64 alphabet are subsets of ASCII.

Many network protocol and file formats designed before Unicode was considered universal use "bytes in an undetermined ASCII-compatible encoding" to represent text. This means that non-ASCII bytes are opaque (until the encoding is determined), but they’re not necessarily invalid and should be preserved.

While I wish UTF-8 was everywhere and I advocate cleanly separating bytes and Unicode to avoid Mojibake, I don’t think we should be dogmatic and say that anything text-like must be Unicode. The new b"foo" byte literals are based on the same idea.

I suppose we could have type AsciiCompatible<'a> = &'a [u8] to express this intent.

@SimonSapin
Copy link
Contributor Author

I realize that "streaming" base64 might also be interested. Does it sound like something you’d want in libserialize? Should it be based on Iterator, Reader, Writer, or something else?

@sfackler
Copy link
Member

I have prototypes of streaming implementations of base64 and hex that I was working on a while ago. I didn't push them upstream because they ended up with a ~2x perf hit compared to the current implementation, but that should be improve-able. I'll update them and open a PR.

@SimonSapin
Copy link
Contributor Author

@sfackler What kind of stream do you use?

@sfackler
Copy link
Member

It used Reader/Writer adapter types: Base64Writer encodes to base 64 and Base64Reader decodes base64.

@SimonSapin
Copy link
Contributor Author

@sfackler, why these two among the four possible combinations of writers/readers and encoding/decoding?

@sfackler
Copy link
Member

Those seemed like the more common cases, but there's no reason that all 4 shouldn't be implemented.

@sfackler
Copy link
Member

For reference, Apache Commons-Codec provides Base64OutputStream which can both encode and decode, though it defaults to encoding, and the reverse for Base64InputStream: http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64OutputStream.html#Base64OutputStream(java.io.OutputStream, boolean)

bors added a commit that referenced this pull request Jul 20, 2014
The algorithm was already based on bytes internally.

Also use byte literals instead of casting u8 to char for matching.

[breaking-change] The `InvalidBase64Character` variant of the `FromBase64Error` enum was renamed to `InvalidBase64Byte`, and contains a `u8` instead of a `char`.

r? @alexcrichton
@bors bors closed this Jul 21, 2014
@bors bors merged commit 56218f5 into rust-lang:master Jul 21, 2014
@SimonSapin SimonSapin deleted the base64-from-bytes branch July 30, 2014 08:11
lnicola pushed a commit to lnicola/rust that referenced this pull request Jan 3, 2024
…eykril

fix: assists panic when trying to edit usage inside macro

When we try to make a syntax node mutable inside a macro to edit it, it seems like the edits aren't properly reflected and will cause a panic when trying to make another syntax node mutable.

This PR changes `bool_to_enum` and `promote_local_to_const` to use the original syntax range instead to edit the original file instead of the macro file. I'm not sure how to do it for `inline_call` with the example I mentioned in the issue, so I've left it out for now.

Fixes rust-lang#15807
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants