Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for escaping in Bytes and bytes macro #64

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

alexsnaps
Copy link
Contributor

No description provided.

@alexsnaps
Copy link
Contributor Author

Quick note here, I went to implement \ x HEXDIGIT HEXDIGIT and \ [0-3] [0-7] [0-7] for escaping byte sequences in b"" BYTES_LIT, but from the spec's lexis, all the escaping from STRING_LIT should be supported...

BYTES_LIT      ::= [bB] STRING_LIT
ESCAPE         ::= \ [abfnrtv\?"'`]
                 | \ x HEXDIGIT HEXDIGIT
                 | \ u HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
                 | \ U HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
                 | \ [0-3] [0-7] [0-7]

So another way to go about this is to refactor parse_string to parse_bytes_literal(s: &str) -> Result<Vec<u8>, ParseError> instead, and have the "newer" parse_string call into it to then String::from_utf8 the parsed bytes... I think that'd be closer to the spec, tho a bigger change. This change mostly allows us to now represent arbitrary byte sequence and work with them, which wasn't possible before.

@clarkmcc
Copy link
Owner

So another way to go about this

I'm fine with this approach for now.

}
};
}
res.extend(c.to_string().as_bytes());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified?

Suggested change
res.extend(c.to_string().as_bytes());
res.push(c as u8);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! No we can't... or at least not that way. I have not found a way to get to the bytes (plural) underlying a char isn't a byte. It's actually even sized to 4 bytes, but holds up to 4 bytes.

Here an example that hopefully showcases the issue.

Copy link
Owner

@clarkmcc clarkmcc Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating... you're right. TIL!

TL;DR, I'm approving and merging because I'm perfectly fine with what you have here.

But, if you're curious, I did some digging around and found you actually can do this

let required = c.len_utf8();
let available = res.capacity() - res.len();
if available < required {
    res.reserve(required);
}
let mut buffer = [0; 4];
c.encode_utf8(&mut buffer);
res.extend_from_slice(&buffer[..required]);

After digging through the Rust source for char I saw that it does something very similar so I just had to benchmark it. Turns out, encode_utf8 is 20x faster than the intermediate string conversion.
image

Index: interpreter/benches/runtime.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/interpreter/benches/runtime.rs b/interpreter/benches/runtime.rs
--- a/interpreter/benches/runtime.rs	(revision b458dd53bdf2c169f58389e54201e9b38bec2659)
+++ b/interpreter/benches/runtime.rs	(date 1720826951595)
@@ -66,5 +66,62 @@
     group.finish();
 }
 
-criterion_group!(benches, criterion_benchmark, map_macro_benchmark);
+pub fn map_char_copy(c: &mut Criterion) {
+    let mut group = c.benchmark_group("char to vec");
+
+    group.bench_function("char to string to empty vec", |b| {
+        let mut empty = Vec::default();
+        let c = 'a';
+        b.iter(|| empty.extend(c.to_string().into_bytes()))
+    });
+
+    group.bench_function("char to string to full vec", |b| {
+        let mut full = Vec::with_capacity(4);
+        full.extend(vec![0, 0, 0, 0]);
+        assert_eq!(full.len(), 4);
+        assert_eq!(full.capacity(), 4);
+
+        let c = 'a';
+        b.iter(|| full.extend(c.to_string().into_bytes()))
+    });
+
+    group.bench_function("char encode to empty vec", |b| {
+        let mut empty = Vec::default();
+        let c = 'a';
+
+        b.iter(|| {
+            let required = c.len_utf8();
+            let available = empty.capacity() - empty.len();
+            if available < required {
+                empty.reserve(required);
+            }
+            let mut buffer = [0; 4];
+            c.encode_utf8(&mut buffer);
+            empty.extend_from_slice(&buffer[..required]);
+        })
+    });
+
+    group.bench_function("char encode to full vec", |b| {
+        let mut full = Vec::with_capacity(4);
+        full.extend(vec![0, 0, 0, 0]);
+        assert_eq!(full.len(), 4);
+        assert_eq!(full.capacity(), 4);
+        let c = 'a';
+
+        b.iter(|| {
+            let required = c.len_utf8();
+            let available = full.capacity() - full.len();
+            if available < required {
+                full.reserve(required);
+            }
+            let mut buffer = [0; 4];
+            c.encode_utf8(&mut buffer);
+            full.extend_from_slice(&buffer[..required]);
+        })
+    });
+
+    group.finish();
+}
+
+criterion_group!(benches, map_char_copy);
 criterion_main!(benches);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I can dig a little deeper in this. I did a quick run on compiler explorer and the result didn't look outrageous, but indeed… it looks stupid (including when I wrote it), I was surprised there was no obvious way to get the bytes.
I might just steal your bench and adapt it to the parse_byte and alter it to encode_utf8 to the buffer directly when no escaping is there… Good stuff! Thanks for looking into this!

@clarkmcc clarkmcc merged commit 4a3607c into clarkmcc:master Jul 12, 2024
1 check passed
@alexsnaps alexsnaps deleted the bytes branch July 13, 2024 01:08
@alexsnaps alexsnaps mentioned this pull request Jul 13, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants