Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added hexadecimal encoding module #8287

Closed
wants to merge 6 commits into from
Closed

Conversation

sfackler
Copy link
Member

@sfackler sfackler commented Aug 4, 2013

FromHex ignores whitespace and parses either upper or lower case hex
digits. ToHex outputs lower case hex digits with no whitespace. Unlike
ToBase64, ToHex doesn't allow you to configure the output format. I
don't feel that it's super useful in this case.


/// A trait for converting hexadecimal encoded values
pub trait FromHex {
/// Converts the value of `self`, interpreted as base64 encoded data, into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base64 typo

@alexcrichton
Copy link
Member

Overall I think this is fine to include, although we're starting to get a fair number of encodings into libextra. I think that go's organization is nice with an encoding super-package, so perhaps we want to start moving things into:

  • encoding/json
  • encoding/base64
  • encoding/hex

And then all future encodings can be thrown into there as well. Another thing which I think would be nice is to have a unifying trait for encoding/decoding things to/from types. I'm not quite sure how that would look, but it'd be a shame if we had different traits for all the encodings. One possible route would be for things like json/xml to implement the serialize::{Encoder, Decoder} and for things like base64/base32/hex to only convert between bytes and strings.

Regardless, I think that we should get another opinion on whether this should be included in libextra (I'm in favor), and if there's any sort of reorganization of encodings it should happen on the mailing list/issue list and not here.

@sfackler
Copy link
Member Author

sfackler commented Aug 4, 2013

+1 on an encoding super-package. I think the root extra package needs to be reorganized in general.

I don't think there's a reasonable trait for base64, hex, etc to share. What would the signature be?

@sfackler
Copy link
Member Author

sfackler commented Aug 4, 2013

I did notice some very weird performance numbers from hex and base64 benchmarks. The From{Base64,Hex} benchmark runs at about 2-3x the speed of the To{Base64,Hex} test even though the logic to convert from is significantly more complex and they should be touching the same amount of memory.

test base64::test::from_base64 ... bench: 1000 ns/iter (+/- 349) = 204 MB/s
test base64::test::to_base64 ... bench: 2390 ns/iter (+/- 1130) = 63 MB/s
...
test hex::tests::bench_from_hex ... bench: 884 ns/iter (+/- 220) = 341 MB/s
test hex::tests::bench_to_hex ... bench: 2453 ns/iter (+/- 919) = 61 MB/s

@bluss
Copy link
Member

bluss commented Aug 5, 2013

Regarding the speed of to_hex, you can do something like the following

// can't use bytes!() here
static HEXDIGITS: [u8, ..16] = [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102];

fn to_hex_C(&self) -> ~str {
    let mut v = vec::with_capacity(self.len() * 2);
    for &byte in self.iter() {
        v.push(HEXDIGITS[byte >> 4]);
        v.push(HEXDIGITS[byte & 0xf]);
    }
    unsafe {
        str::raw::from_bytes_owned(v)
    }
}

I tested some different options and this was 3x faster than the original using .push_char (which is complex). Maybe the unsafe call to str::raw::from_bytes_owned is overkill, but it can be used since you know you output valid UTF-8.

@huonw
Copy link
Member

huonw commented Aug 5, 2013

I don't have a compiler to check, but why can't one use bytes!?

@sfackler
Copy link
Member Author

sfackler commented Aug 5, 2013

bytes! works fine, you just have to change the type to &'static [u8]. Pushing the change in a couple of minutes.

@alexcrichton
Copy link
Member

I was thinking about this earlier today, and currently you provide a ToStr implementation for &str and &[u8]. In my opinion, these encodings are strictly bytes on one side, strings on the other. I think it would be best to not intermingle the two of them. If someone wants to hex-encode their valid str type, I think it's better to be clearer and add the as_bytes method (which is a no-op essentially).

@sfackler
Copy link
Member Author

sfackler commented Aug 5, 2013

Sounds reasonable to me. I only had a ToHex implementation for &str since ToBase64 had one. I'll yank them both. Do you think the From{Hex,Base64} implementation for &[u8] should go too? I'm leaning towards yes.

@alexcrichton
Copy link
Member

@sfackler: yes

@omasanori
Copy link
Contributor

@alexcrichton I'm prefer to distinguish serializers (JSON, ASN.1, etc.) from codecs (Base64, quoted-printable, etc.)
The names serialize::{Encoder, Decoder} is not so good IMHO. (Serialize(r) and Deserialize(r) better?)

@alexcrichton
Copy link
Member

@omasanori, that's a good point! Let's move discussion to #8310

@sfackler, this looks good to me, I would be willing to r+ it, but again someone else should probably chime in about whether we should start shoving lots of encoding formats into libextra

FromHex ignores whitespace and parses either upper or lower case hex
digits. ToHex outputs lower case hex digits with no whitespace. Unlike
ToBase64, ToHex doesn't allow you to configure the output format. I
don't feel that it's super useful in this case.
The overhead of str::push_char is high enough to cripple the performance
of these two functions. I've switched them to build the output in a
~[u8] and then convert to a string later. Since we know exactly the
bytes going into the vector, we can use the unsafe version to avoid the
is_utf8 check.

I could have riced it further with vec::raw::get, but it only added
~10MB/s so I didn't think it was worth it. ToHex is still ~30% slower
than FromHex, which is puzzling.

Before:

```
test base64::test::from_base64 ... bench: 1000 ns/iter (+/- 349) = 204 MB/s
test base64::test::to_base64 ... bench: 2390 ns/iter (+/- 1130) = 63 MB/s
...
test hex::tests::bench_from_hex ... bench: 884 ns/iter (+/- 220) = 341 MB/s
test hex::tests::bench_to_hex ... bench: 2453 ns/iter (+/- 919) = 61 MB/s
```

After:

```
test base64::test::from_base64 ... bench: 1271 ns/iter (+/- 600) = 160 MB/s
test base64::test::to_base64 ... bench: 759 ns/iter (+/- 286) = 198 MB/s
...
test hex::tests::bench_from_hex ... bench: 875 ns/iter (+/- 377) = 345 MB/s
test hex::tests::bench_to_hex ... bench: 593 ns/iter (+/- 240) = 254 MB/s
```
String NULL terminators are going away soon, so we may as well get rid
of this now so it doesn't rot.
Encoding should really only be done from [u8]<->str. The extra
convenience implementations don't really have a place, especially since
they're so trivial.

Also improved error messages in FromBase64.
@sfackler
Copy link
Member Author

sfackler commented Aug 7, 2013

@brson retry?

bors added a commit that referenced this pull request Aug 7, 2013
FromHex ignores whitespace and parses either upper or lower case hex
digits. ToHex outputs lower case hex digits with no whitespace. Unlike
ToBase64, ToHex doesn't allow you to configure the output format. I
don't feel that it's super useful in this case.
@bors bors closed this Aug 7, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jan 17, 2022
…5, r=Manishearth

Erase late bound regions in `iter_not_returning_iterator`

fixes rust-lang#8285

changelog: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants