Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement string parser #24

Merged
merged 10 commits into from
Feb 4, 2024
Merged

Conversation

inikolaev
Copy link
Contributor

I've been looking into implementing a proper string parser that supports interpretation of escape sequences as mentioned in the CEL specification.

I based my implementation on snailquote, but had to change it so that it can support things from CEL specification.

So far it supports all escape sequences and raw strings. I plan to look into supporting triple quotes as well, but in a separate PR.

Please let me know what you think about this approach - maybe you had some different thoughts on how to do this differently, but generally this approach seem to work and could be scaled to other tokens such as byte strings and even numbers.

I'm very new to Rust and would appreciate your feedback. I will look into adding some tests, but need to learn how to do that.

}
Some((idx, c2)) => {
let s = match c2 {
'a' => String::from("\u{07}"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very happy about these String::from - seems a bit unnecessary, but haven't figured out a better way. On the other hand I don't expect parsing to be executed too frequently, so it might be fine for now.

The original code was returning a character here, but now there are cases when we need to return multiple characters, so I changed it to a string, because we can use push_str to append pieces to result.

@clarkmcc
Copy link
Owner

clarkmcc commented Oct 7, 2023

Very nice! I'll dive into this and get back to you. Tests would be great, I've been adding them to the bottom of the files of the code they're testing. Checkout the functions.rs file for an example of how to annotate a test function.

@clarkmcc
Copy link
Owner

clarkmcc commented Oct 8, 2023

I haven't looked at the PR itself yet, but a couple things right off the bat to consider:

  • It looks like there are some regressions on existing unit tests, so we'd want to make sure your PR addresses each of those first.
  • At some point in the near future when I get the time for the next major CEL project, I'd like to get Add chumsky parser #18 merged. That doesn't mean that we can't merge this PR first, it just means that if we do, we'll need to implement this twice (if it's not already supported in Add chumsky parser #18 which something in the back of my brain is tickling me).

@inikolaev
Copy link
Contributor Author

I haven't looked at the PR itself yet, but a couple things right off the bat to consider:

  • It looks like there are some regressions on existing unit tests, so we'd want to make sure your PR addresses each of those first.
  • At some point in the near future when I get the time for the next major CEL project, I'd like to get Add chumsky parser #18 merged. That doesn't mean that we can't merge this PR first, it just means that if we do, we'll need to implement this twice (if it's not already supported in Add chumsky parser #18 which something in the back of my brain is tickling me).

Yes, I have noticed some test failing, but haven't looked into them until now. The regular expression was too greedy, so I asked ChartGPT for help and now tests are passing :)

I'm currently adding some tests for the string parser.

No worries if we have to replace the implementation if new parser comes into the picture. I use this as an opportunity to familiarise myself with Rust, so happy to help.

@inikolaev
Copy link
Contributor Author

@clarkmcc I've got some time to work on this now, would you mind approving the workflow to see how it looks now?

@clarkmcc
Copy link
Owner

clarkmcc commented Jan 8, 2024

Workflow approved! It's been a while since I looked at this PR but if I remember right there were a lot of allocations happening (String::from). It would be great if that could be optimized and I'm even wondering if using nom is the most efficient way to do this.

In any case, move in whatever direction you'd like to take this and we can look at performance after we've got something working.

@inikolaev
Copy link
Contributor Author

@clarkmcc I have the removed the majority of String::from usages - only cases that handle unicode characters still use it, so I thing this should be good for now.

nom does look interesting indeed!

@inikolaev
Copy link
Contributor Author

Actually checked the code once again and noticed that I was creating a string from a character when handling unicode - changed it to returned char instead.

@clarkmcc
Copy link
Owner

Thanks! I'm out on vacation now but will review once I return next week.

@inikolaev inikolaev marked this pull request as ready for review January 15, 2024 15:17
@clarkmcc
Copy link
Owner

Thanks for getting the tests working! Are you interested in adding parser benchmarks? I've been meaning to add them for a while and now seems like a good change. If not, no stress and I can get those added in myself.

@inikolaev
Copy link
Contributor Author

Thanks for getting the tests working! Are you interested in adding parser benchmarks? I've been meaning to add them for a while and now seems like a good change. If not, no stress and I can get those added in myself.

I've added some simple benchmarks - do you think these are good enough for now or do you have anything else in mind?

@clarkmcc
Copy link
Owner

This is great for now. I'm part way through a review with a few feedback items. I started this weekend but have been sidetracked this week. I hope to have a review finished this weekend.

Copy link
Owner

@clarkmcc clarkmcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! A few things to do:

  • Run cargo fmt on the project
  • Fix the changes noted as "Required" in the comments

A wrote some additional benchmarks for the parser itself and checked the before and after. There's ~20% performance regression, but we're still in nanosecond range and we're more in line with the CEL spec so this seems acceptable to me.

addition                time:   [725.75 µs 726.55 µs 727.36 µs]
                        change: [+20.150% +20.433% +20.709%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

addition with string    time:   [742.33 µs 743.22 µs 744.17 µs]
                        change: [+20.840% +21.333% +21.969%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

map expression          time:   [768.33 µs 769.99 µs 771.86 µs]
                        change: [+17.316% +18.121% +18.868%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild

string #1               time:   [753.15 µs 754.42 µs 755.87 µs]
                        change: [+17.638% +18.735% +19.709%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

text                    time:   [734.43 µs 735.77 µs 737.38 µs]
                        change: [+19.534% +20.091% +20.620%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

raw                     time:   [737.88 µs 739.25 µs 740.88 µs]
                        change: [+17.459% +18.305% +19.144%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

single unicode escape sequence
                        time:   [736.28 µs 737.41 µs 738.60 µs]
                        change: [+20.952% +21.364% +21.736%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

single hex escape sequence
                        time:   [734.55 µs 735.66 µs 736.88 µs]
                        change: [+20.123% +20.593% +21.042%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

single oct escape sequence
                        time:   [735.84 µs 737.05 µs 738.36 µs]
                        change: [+18.249% +18.864% +19.473%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

parser/src/parse.rs Outdated Show resolved Hide resolved
parser/src/parse.rs Outdated Show resolved Hide resolved
Comment on lines 206 to 212
parse_unicode_hex(2, &mut chars).map_err(|x| {
ParseError::InvalidUnicode {
source: x,
index: idx,
string: String::from(s),
}
}).unwrap()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must Change
These errors are being .unwrap()'d which means that if there's an error, it will panic the program rather than returning the error. Since this function already returns ParseError, let's make sure to return these errors to the caller. Adding the ? would probably work instead of unwrap.

parse_unicode_hex(2, &mut chars).map_err(|x| {
    ParseError::InvalidUnicode {
        source: x,
        index: idx,
        string: String::from(s),
    }
})?

Optional Code Style Suggestions
As a code style suggestion, there's a lot of repetition parsing the different hex and octal codes. You might consider trying to simplify by (for example) handling all hex parsing in one match arm. This code doesn't compile as-is but you get the idea.

'x' | 'u' | 'U' => {
    let count = match c2 {
        'x' => 2,
        'u' => 4,
        'U' => 8,
    };
    res.push(parse_unicode_hex(count, chars).map_err(|source| {
        ParseError::InvalidUnicode {
            source,
            index: idx,
            string: String::from(s),
        }
    })?);
}

Another very small stylistic suggestion: You also might consider lifting your res.push(...) outside the match. If every match arm results in a push (which it would once you implement the error result handling suggested earlier), then it might be cleaner to do this

res.push(match c2 {
    'a' => '\u{07}',
    'b' => '\u{08}',
    'v' => '\u{0B}',
    'f' => '\u{0C}',

then this

match c2 {
    'a' => res.push('\u{07}'),
    'b' => res.push('\u{08}'),
    'v' => res.push('\u{0B}'),
    'f' => res.push('\u{0C}'),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the unwrap - I wonder if match should return a Result instead? E.g.:

res.push(match c2 {
    'a' => Ok('\u{07}'),
    'b' => Ok('\u{08}'),
    'v' => Ok('\u{0B}'),
    'f' => Ok('\u{0C}'),

Then it's easy to handle outside match as well. Or do you think it's too expensive to allocate it?

Otherwise I think I have to do something like this when parsing hex/oct values:

'x' | 'u' | 'U' => {
    let length = match c2 {
        'x' => 2,
        'u' => 4,
        'U' => 8,
        _ => unreachable!(),
    };

    let result = parse_unicode_hex(length, &mut chars)
        .map_err(|x| ParseError::InvalidUnicode {
            source: x,
            index: idx,
            string: String::from(s),
        });

    match result {
        Ok(value) => value,
        Err(e) => {
            return Err(e);
        }
    }
}

Copy link
Owner

@clarkmcc clarkmcc Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use the ? operator?

parse_unicode_hex(length, &mut chars)
  .map_err(|x| ParseError::InvalidUnicode {
      source: x,
      index: idx,
      string: String::from(s),
  })?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I figured this out. I was not quite familiar with what ? operator does - it behaves slightly differently from other languages I'm familiar with and the other issue I had was Value used after being used error on x variable, so I implemented Clone trait on the ParseUnicodeError.

How does that look to you?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'll do one final bass and then merge if you're happy with where things are at.

@clarkmcc clarkmcc merged commit e4b235f into clarkmcc:master Feb 4, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants