-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for escaping in Bytes and bytes macro #64
Conversation
Quick note here, I went to implement
So another way to go about this is to refactor |
I'm fine with this approach for now. |
} | ||
}; | ||
} | ||
res.extend(c.to_string().as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified?
res.extend(c.to_string().as_bytes()); | |
res.push(c as u8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! No we can't... or at least not that way. I have not found a way to get to the bytes (plural) underlying a char
isn't a byte. It's actually even sized
to 4 bytes, but holds up to 4 bytes.
Here an example that hopefully showcases the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating... you're right. TIL!
TL;DR, I'm approving and merging because I'm perfectly fine with what you have here.
But, if you're curious, I did some digging around and found you actually can do this
let required = c.len_utf8();
let available = res.capacity() - res.len();
if available < required {
res.reserve(required);
}
let mut buffer = [0; 4];
c.encode_utf8(&mut buffer);
res.extend_from_slice(&buffer[..required]);
After digging through the Rust source for char
I saw that it does something very similar so I just had to benchmark it. Turns out, encode_utf8
is 20x faster than the intermediate string conversion.
Index: interpreter/benches/runtime.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/interpreter/benches/runtime.rs b/interpreter/benches/runtime.rs
--- a/interpreter/benches/runtime.rs (revision b458dd53bdf2c169f58389e54201e9b38bec2659)
+++ b/interpreter/benches/runtime.rs (date 1720826951595)
@@ -66,5 +66,62 @@
group.finish();
}
-criterion_group!(benches, criterion_benchmark, map_macro_benchmark);
+pub fn map_char_copy(c: &mut Criterion) {
+ let mut group = c.benchmark_group("char to vec");
+
+ group.bench_function("char to string to empty vec", |b| {
+ let mut empty = Vec::default();
+ let c = 'a';
+ b.iter(|| empty.extend(c.to_string().into_bytes()))
+ });
+
+ group.bench_function("char to string to full vec", |b| {
+ let mut full = Vec::with_capacity(4);
+ full.extend(vec![0, 0, 0, 0]);
+ assert_eq!(full.len(), 4);
+ assert_eq!(full.capacity(), 4);
+
+ let c = 'a';
+ b.iter(|| full.extend(c.to_string().into_bytes()))
+ });
+
+ group.bench_function("char encode to empty vec", |b| {
+ let mut empty = Vec::default();
+ let c = 'a';
+
+ b.iter(|| {
+ let required = c.len_utf8();
+ let available = empty.capacity() - empty.len();
+ if available < required {
+ empty.reserve(required);
+ }
+ let mut buffer = [0; 4];
+ c.encode_utf8(&mut buffer);
+ empty.extend_from_slice(&buffer[..required]);
+ })
+ });
+
+ group.bench_function("char encode to full vec", |b| {
+ let mut full = Vec::with_capacity(4);
+ full.extend(vec![0, 0, 0, 0]);
+ assert_eq!(full.len(), 4);
+ assert_eq!(full.capacity(), 4);
+ let c = 'a';
+
+ b.iter(|| {
+ let required = c.len_utf8();
+ let available = full.capacity() - full.len();
+ if available < required {
+ full.reserve(required);
+ }
+ let mut buffer = [0; 4];
+ c.encode_utf8(&mut buffer);
+ full.extend_from_slice(&buffer[..required]);
+ })
+ });
+
+ group.finish();
+}
+
+criterion_group!(benches, map_char_copy);
criterion_main!(benches);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I can dig a little deeper in this. I did a quick run on compiler explorer and the result didn't look outrageous, but indeed… it looks stupid (including when I wrote it), I was surprised there was no obvious way to get the bytes.
I might just steal your bench and adapt it to the parse_byte
and alter it to encode_utf8
to the buffer directly when no escaping is there… Good stuff! Thanks for looking into this!
No description provided.