Implement string parser #24

inikolaev · 2023-10-07T17:33:42Z

I've been looking into implementing a proper string parser that supports interpretation of escape sequences as mentioned in the CEL specification.

I based my implementation on snailquote, but had to change it so that it can support things from CEL specification.

So far it supports all escape sequences and raw strings. I plan to look into supporting triple quotes as well, but in a separate PR.

Please let me know what you think about this approach - maybe you had some different thoughts on how to do this differently, but generally this approach seem to work and could be scaled to other tokens such as byte strings and even numbers.

I'm very new to Rust and would appreciate your feedback. I will look into adding some tests, but need to learn how to do that.

parser/src/parse.rs

inikolaev · 2023-10-07T17:38:49Z

parser/src/parse.rs

+                    }
+                    Some((idx, c2)) => {
+                        let s = match c2 {
+                            'a' => String::from("\u{07}"),


Not very happy about these String::from - seems a bit unnecessary, but haven't figured out a better way. On the other hand I don't expect parsing to be executed too frequently, so it might be fine for now.

The original code was returning a character here, but now there are cases when we need to return multiple characters, so I changed it to a string, because we can use push_str to append pieces to result.

clarkmcc · 2023-10-07T17:51:30Z

Very nice! I'll dive into this and get back to you. Tests would be great, I've been adding them to the bottom of the files of the code they're testing. Checkout the functions.rs file for an example of how to annotate a test function.

clarkmcc · 2023-10-08T14:57:06Z

I haven't looked at the PR itself yet, but a couple things right off the bat to consider:

It looks like there are some regressions on existing unit tests, so we'd want to make sure your PR addresses each of those first.
At some point in the near future when I get the time for the next major CEL project, I'd like to get Add chumsky parser #18 merged. That doesn't mean that we can't merge this PR first, it just means that if we do, we'll need to implement this twice (if it's not already supported in Add chumsky parser #18 which something in the back of my brain is tickling me).

inikolaev · 2023-10-08T16:10:50Z

I haven't looked at the PR itself yet, but a couple things right off the bat to consider:

It looks like there are some regressions on existing unit tests, so we'd want to make sure your PR addresses each of those first.

At some point in the near future when I get the time for the next major CEL project, I'd like to get Add chumsky parser #18 merged. That doesn't mean that we can't merge this PR first, it just means that if we do, we'll need to implement this twice (if it's not already supported in Add chumsky parser #18 which something in the back of my brain is tickling me).

Yes, I have noticed some test failing, but haven't looked into them until now. The regular expression was too greedy, so I asked ChartGPT for help and now tests are passing :)

I'm currently adding some tests for the string parser.

No worries if we have to replace the implementation if new parser comes into the picture. I use this as an opportunity to familiarise myself with Rust, so happy to help.

inikolaev · 2024-01-08T19:04:29Z

@clarkmcc I've got some time to work on this now, would you mind approving the workflow to see how it looks now?

clarkmcc · 2024-01-08T19:31:12Z

Workflow approved! It's been a while since I looked at this PR but if I remember right there were a lot of allocations happening (String::from). It would be great if that could be optimized and I'm even wondering if using nom is the most efficient way to do this.

In any case, move in whatever direction you'd like to take this and we can look at performance after we've got something working.

inikolaev · 2024-01-11T15:26:39Z

@clarkmcc I have the removed the majority of String::from usages - only cases that handle unicode characters still use it, so I thing this should be good for now.

nom does look interesting indeed!

inikolaev · 2024-01-11T15:33:03Z

Actually checked the code once again and noticed that I was creating a string from a character when handling unicode - changed it to returned char instead.

clarkmcc · 2024-01-11T17:17:11Z

Thanks! I'm out on vacation now but will review once I return next week.

clarkmcc · 2024-01-15T15:26:35Z

Thanks for getting the tests working! Are you interested in adding parser benchmarks? I've been meaning to add them for a while and now seems like a good change. If not, no stress and I can get those added in myself.

inikolaev · 2024-01-15T17:38:14Z

Thanks for getting the tests working! Are you interested in adding parser benchmarks? I've been meaning to add them for a while and now seems like a good change. If not, no stress and I can get those added in myself.

I've added some simple benchmarks - do you think these are good enough for now or do you have anything else in mind?

clarkmcc · 2024-01-17T17:08:24Z

This is great for now. I'm part way through a review with a few feedback items. I started this weekend but have been sidetracked this week. I hope to have a review finished this weekend.

clarkmcc

Great work! A few things to do:

Run cargo fmt on the project
Fix the changes noted as "Required" in the comments

A wrote some additional benchmarks for the parser itself and checked the before and after. There's ~20% performance regression, but we're still in nanosecond range and we're more in line with the CEL spec so this seems acceptable to me.

addition                time:   [725.75 µs 726.55 µs 727.36 µs]
                        change: [+20.150% +20.433% +20.709%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

addition with string    time:   [742.33 µs 743.22 µs 744.17 µs]
                        change: [+20.840% +21.333% +21.969%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

map expression          time:   [768.33 µs 769.99 µs 771.86 µs]
                        change: [+17.316% +18.121% +18.868%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild

string #1               time:   [753.15 µs 754.42 µs 755.87 µs]
                        change: [+17.638% +18.735% +19.709%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

text                    time:   [734.43 µs 735.77 µs 737.38 µs]
                        change: [+19.534% +20.091% +20.620%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

raw                     time:   [737.88 µs 739.25 µs 740.88 µs]
                        change: [+17.459% +18.305% +19.144%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

single unicode escape sequence
                        time:   [736.28 µs 737.41 µs 738.60 µs]
                        change: [+20.952% +21.364% +21.736%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

single hex escape sequence
                        time:   [734.55 µs 735.66 µs 736.88 µs]
                        change: [+20.123% +20.593% +21.042%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

single oct escape sequence
                        time:   [735.84 µs 737.05 µs 738.36 µs]
                        change: [+18.249% +18.864% +19.473%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

parser/src/parse.rs

clarkmcc · 2024-01-20T19:47:02Z

parser/src/parse.rs

+                            parse_unicode_hex(2, &mut chars).map_err(|x| {
+                                ParseError::InvalidUnicode {
+                                    source: x,
+                                    index: idx,
+                                    string: String::from(s),
+                                }
+                            }).unwrap()


Must Change
These errors are being .unwrap()'d which means that if there's an error, it will panic the program rather than returning the error. Since this function already returns ParseError, let's make sure to return these errors to the caller. Adding the ? would probably work instead of unwrap.

parse_unicode_hex(2, &mut chars).map_err(|x| { ParseError::InvalidUnicode { source: x, index: idx, string: String::from(s), } })?

Optional Code Style Suggestions
As a code style suggestion, there's a lot of repetition parsing the different hex and octal codes. You might consider trying to simplify by (for example) handling all hex parsing in one match arm. This code doesn't compile as-is but you get the idea.

'x' | 'u' | 'U' => { let count = match c2 { 'x' => 2, 'u' => 4, 'U' => 8, }; res.push(parse_unicode_hex(count, chars).map_err(|source| { ParseError::InvalidUnicode { source, index: idx, string: String::from(s), } })?); }

Another very small stylistic suggestion: You also might consider lifting your res.push(...) outside the match. If every match arm results in a push (which it would once you implement the error result handling suggested earlier), then it might be cleaner to do this

res.push(match c2 { 'a' => '\u{07}', 'b' => '\u{08}', 'v' => '\u{0B}', 'f' => '\u{0C}',

then this

match c2 { 'a' => res.push('\u{07}'), 'b' => res.push('\u{08}'), 'v' => res.push('\u{0B}'), 'f' => res.push('\u{0C}'),

Regarding the unwrap - I wonder if match should return a Result instead? E.g.:

res.push(match c2 { 'a' => Ok('\u{07}'), 'b' => Ok('\u{08}'), 'v' => Ok('\u{0B}'), 'f' => Ok('\u{0C}'),

Then it's easy to handle outside match as well. Or do you think it's too expensive to allocate it?

Otherwise I think I have to do something like this when parsing hex/oct values:

'x' | 'u' | 'U' => { let length = match c2 { 'x' => 2, 'u' => 4, 'U' => 8, _ => unreachable!(), }; let result = parse_unicode_hex(length, &mut chars) .map_err(|x| ParseError::InvalidUnicode { source: x, index: idx, string: String::from(s), }); match result { Ok(value) => value, Err(e) => { return Err(e); } } }

Can't you use the ? operator?

parse_unicode_hex(length, &mut chars) .map_err(|x| ParseError::InvalidUnicode { source: x, index: idx, string: String::from(s), })?;

Ok, I figured this out. I was not quite familiar with what ? operator does - it behaves slightly differently from other languages I'm familiar with and the other issue I had was Value used after being used error on x variable, so I implemented Clone trait on the ParseUnicodeError.

How does that look to you?

Looks good! I'll do one final bass and then merge if you're happy with where things are at.

parser/src/parse.rs

inikolaev commented Oct 7, 2023

View reviewed changes

parser/src/parse.rs Show resolved Hide resolved

inikolaev commented Oct 7, 2023

View reviewed changes

inikolaev force-pushed the implement-string-parser branch from a959ff6 to 5d3baed Compare October 7, 2023 17:39

inikolaev added 2 commits January 8, 2024 21:01

Implement string parser

2941019

Fix regular expression to properly match strings

0acc410

inikolaev force-pushed the implement-string-parser branch from eb977d9 to ff39df7 Compare January 8, 2024 19:01

Refactor string parsing into several functions and added tests

71b5b01

inikolaev force-pushed the implement-string-parser branch from ff39df7 to 71b5b01 Compare January 8, 2024 21:11

Remove string allocation when handling unicode

eaa1f9b

inikolaev marked this pull request as ready for review January 15, 2024 15:17

Add simple benchmarks for string parsing

04c899c

Add simple benchmarks for string parsing

258cba2

clarkmcc requested changes Jan 20, 2024

View reviewed changes

inikolaev added 3 commits January 21, 2024 20:26

Improve unicode parsing

7866913

Move res.push() outside match

43a4389

Format the code

ce04dad

inikolaev commented Jan 21, 2024

View reviewed changes

parser/src/parse.rs Show resolved Hide resolved

Use try operator when parsing unicode characters

7885ce9

clarkmcc approved these changes Feb 4, 2024

View reviewed changes

clarkmcc merged commit e4b235f into clarkmcc:master Feb 4, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement string parser #24

Implement string parser #24

inikolaev commented Oct 7, 2023

inikolaev Oct 7, 2023

clarkmcc commented Oct 7, 2023

clarkmcc commented Oct 8, 2023

inikolaev commented Oct 8, 2023

inikolaev commented Jan 8, 2024

clarkmcc commented Jan 8, 2024

inikolaev commented Jan 11, 2024

inikolaev commented Jan 11, 2024

clarkmcc commented Jan 11, 2024

clarkmcc commented Jan 15, 2024

inikolaev commented Jan 15, 2024

clarkmcc commented Jan 17, 2024

clarkmcc left a comment

clarkmcc Jan 20, 2024

inikolaev Jan 21, 2024

clarkmcc Jan 27, 2024 •

edited

Loading

inikolaev Jan 30, 2024

clarkmcc Jan 30, 2024

Implement string parser #24

Implement string parser #24

Conversation

inikolaev commented Oct 7, 2023

inikolaev Oct 7, 2023

Choose a reason for hiding this comment

clarkmcc commented Oct 7, 2023

clarkmcc commented Oct 8, 2023

inikolaev commented Oct 8, 2023

inikolaev commented Jan 8, 2024

clarkmcc commented Jan 8, 2024

inikolaev commented Jan 11, 2024

inikolaev commented Jan 11, 2024

clarkmcc commented Jan 11, 2024

clarkmcc commented Jan 15, 2024

inikolaev commented Jan 15, 2024

clarkmcc commented Jan 17, 2024

clarkmcc left a comment

Choose a reason for hiding this comment

clarkmcc Jan 20, 2024

Choose a reason for hiding this comment

inikolaev Jan 21, 2024

Choose a reason for hiding this comment

clarkmcc Jan 27, 2024 • edited Loading

Choose a reason for hiding this comment

inikolaev Jan 30, 2024

Choose a reason for hiding this comment

clarkmcc Jan 30, 2024

Choose a reason for hiding this comment

clarkmcc Jan 27, 2024 •

edited

Loading