support for escaping in Bytes and bytes macro #64

alexsnaps · 2024-07-10T21:43:21Z

No description provided.

alexsnaps · 2024-07-11T18:23:34Z

Quick note here, I went to implement \ x HEXDIGIT HEXDIGIT and \ [0-3] [0-7] [0-7] for escaping byte sequences in b"" BYTES_LIT, but from the spec's lexis, all the escaping from STRING_LIT should be supported...

BYTES_LIT      ::= [bB] STRING_LIT
ESCAPE         ::= \ [abfnrtv\?"'`]
                 | \ x HEXDIGIT HEXDIGIT
                 | \ u HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
                 | \ U HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
                 | \ [0-3] [0-7] [0-7]

So another way to go about this is to refactor parse_string to parse_bytes_literal(s: &str) -> Result<Vec<u8>, ParseError> instead, and have the "newer" parse_string call into it to then String::from_utf8 the parsed bytes... I think that'd be closer to the spec, tho a bigger change. This change mostly allows us to now represent arbitrary byte sequence and work with them, which wasn't possible before.

clarkmcc · 2024-07-12T14:32:37Z

So another way to go about this

I'm fine with this approach for now.

clarkmcc · 2024-07-12T15:37:09Z

parser/src/parse.rs

+                }
+            };
+        }
+        res.extend(c.to_string().as_bytes());


Could this be simplified?

Suggested change

res.extend(c.to_string().as_bytes());

res.push(c as u8);

Ah! No we can't... or at least not that way. I have not found a way to get to the bytes (plural) underlying a char isn't a byte. It's actually even sized to 4 bytes, but holds up to 4 bytes.

Here an example that hopefully showcases the issue.

Fascinating... you're right. TIL!

TL;DR, I'm approving and merging because I'm perfectly fine with what you have here.

But, if you're curious, I did some digging around and found you actually can do this

let required = c.len_utf8(); let available = res.capacity() - res.len(); if available < required { res.reserve(required); } let mut buffer = [0; 4]; c.encode_utf8(&mut buffer); res.extend_from_slice(&buffer[..required]);

After digging through the Rust source for char I saw that it does something very similar so I just had to benchmark it. Turns out, encode_utf8 is 20x faster than the intermediate string conversion.

Index: interpreter/benches/runtime.rs IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 =================================================================== diff --git a/interpreter/benches/runtime.rs b/interpreter/benches/runtime.rs --- a/interpreter/benches/runtime.rs (revision b458dd53bdf2c169f58389e54201e9b38bec2659) +++ b/interpreter/benches/runtime.rs (date 1720826951595) @@ -66,5 +66,62 @@ group.finish(); } -criterion_group!(benches, criterion_benchmark, map_macro_benchmark); +pub fn map_char_copy(c: &mut Criterion) { + let mut group = c.benchmark_group("char to vec"); + + group.bench_function("char to string to empty vec", |b| { + let mut empty = Vec::default(); + let c = 'a'; + b.iter(|| empty.extend(c.to_string().into_bytes())) + }); + + group.bench_function("char to string to full vec", |b| { + let mut full = Vec::with_capacity(4); + full.extend(vec![0, 0, 0, 0]); + assert_eq!(full.len(), 4); + assert_eq!(full.capacity(), 4); + + let c = 'a'; + b.iter(|| full.extend(c.to_string().into_bytes())) + }); + + group.bench_function("char encode to empty vec", |b| { + let mut empty = Vec::default(); + let c = 'a'; + + b.iter(|| { + let required = c.len_utf8(); + let available = empty.capacity() - empty.len(); + if available < required { + empty.reserve(required); + } + let mut buffer = [0; 4]; + c.encode_utf8(&mut buffer); + empty.extend_from_slice(&buffer[..required]); + }) + }); + + group.bench_function("char encode to full vec", |b| { + let mut full = Vec::with_capacity(4); + full.extend(vec![0, 0, 0, 0]); + assert_eq!(full.len(), 4); + assert_eq!(full.capacity(), 4); + let c = 'a'; + + b.iter(|| { + let required = c.len_utf8(); + let available = full.capacity() - full.len(); + if available < required { + full.reserve(required); + } + let mut buffer = [0; 4]; + c.encode_utf8(&mut buffer); + full.extend_from_slice(&buffer[..required]); + }) + }); + + group.finish(); +} + +criterion_group!(benches, map_char_copy); criterion_main!(benches);

Interesting! I can dig a little deeper in this. I did a quick run on compiler explorer and the result didn't look outrageous, but indeed… it looks stupid (including when I wrote it), I was surprised there was no obvious way to get the bytes.
I might just steal your bench and adapt it to the parse_byte and alter it to encode_utf8 to the buffer directly when no escaping is there… Good stuff! Thanks for looking into this!

alexsnaps added 2 commits July 10, 2024 17:42

Added escaping when parsing bytes

d89de0e

Add bytes()

b458dd5

alexsnaps mentioned this pull request Jul 10, 2024

Unresolved method/function calls panic #65

Closed

clarkmcc reviewed Jul 12, 2024

View reviewed changes

clarkmcc approved these changes Jul 12, 2024

View reviewed changes

clarkmcc merged commit 4a3607c into clarkmcc:master Jul 12, 2024
1 check passed

alexsnaps deleted the bytes branch July 13, 2024 01:08

alexsnaps mentioned this pull request Jul 13, 2024

Faster bytes parsing #69

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for escaping in Bytes and bytes macro #64

support for escaping in Bytes and bytes macro #64

alexsnaps commented Jul 10, 2024

alexsnaps commented Jul 11, 2024

clarkmcc commented Jul 12, 2024

clarkmcc Jul 12, 2024

alexsnaps Jul 12, 2024

clarkmcc Jul 12, 2024 •

edited

Loading

alexsnaps Jul 13, 2024

support for escaping in Bytes and bytes macro #64

support for escaping in Bytes and bytes macro #64

Conversation

alexsnaps commented Jul 10, 2024

alexsnaps commented Jul 11, 2024

clarkmcc commented Jul 12, 2024

clarkmcc Jul 12, 2024

Choose a reason for hiding this comment

alexsnaps Jul 12, 2024

Choose a reason for hiding this comment

clarkmcc Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

alexsnaps Jul 13, 2024

Choose a reason for hiding this comment

clarkmcc Jul 12, 2024 •

edited

Loading