-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement string parser #24
Conversation
parser/src/parse.rs
Outdated
} | ||
Some((idx, c2)) => { | ||
let s = match c2 { | ||
'a' => String::from("\u{07}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very happy about these String::from
- seems a bit unnecessary, but haven't figured out a better way. On the other hand I don't expect parsing to be executed too frequently, so it might be fine for now.
The original code was returning a character here, but now there are cases when we need to return multiple characters, so I changed it to a string, because we can use push_str
to append pieces to result.
a959ff6
to
5d3baed
Compare
Very nice! I'll dive into this and get back to you. Tests would be great, I've been adding them to the bottom of the files of the code they're testing. Checkout the functions.rs file for an example of how to annotate a test function. |
I haven't looked at the PR itself yet, but a couple things right off the bat to consider:
|
Yes, I have noticed some test failing, but haven't looked into them until now. The regular expression was too greedy, so I asked ChartGPT for help and now tests are passing :) I'm currently adding some tests for the string parser. No worries if we have to replace the implementation if new parser comes into the picture. I use this as an opportunity to familiarise myself with Rust, so happy to help. |
eb977d9
to
ff39df7
Compare
@clarkmcc I've got some time to work on this now, would you mind approving the workflow to see how it looks now? |
Workflow approved! It's been a while since I looked at this PR but if I remember right there were a lot of allocations happening (String::from). It would be great if that could be optimized and I'm even wondering if using nom is the most efficient way to do this. In any case, move in whatever direction you'd like to take this and we can look at performance after we've got something working. |
ff39df7
to
71b5b01
Compare
@clarkmcc I have the removed the majority of
|
Actually checked the code once again and noticed that I was creating a string from a character when handling unicode - changed it to returned |
Thanks! I'm out on vacation now but will review once I return next week. |
Thanks for getting the tests working! Are you interested in adding parser benchmarks? I've been meaning to add them for a while and now seems like a good change. If not, no stress and I can get those added in myself. |
I've added some simple benchmarks - do you think these are good enough for now or do you have anything else in mind? |
This is great for now. I'm part way through a review with a few feedback items. I started this weekend but have been sidetracked this week. I hope to have a review finished this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! A few things to do:
- Run
cargo fmt
on the project - Fix the changes noted as "Required" in the comments
A wrote some additional benchmarks for the parser itself and checked the before and after. There's ~20% performance regression, but we're still in nanosecond range and we're more in line with the CEL spec so this seems acceptable to me.
addition time: [725.75 µs 726.55 µs 727.36 µs]
change: [+20.150% +20.433% +20.709%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
addition with string time: [742.33 µs 743.22 µs 744.17 µs]
change: [+20.840% +21.333% +21.969%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
map expression time: [768.33 µs 769.99 µs 771.86 µs]
change: [+17.316% +18.121% +18.868%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) low mild
5 (5.00%) high mild
string #1 time: [753.15 µs 754.42 µs 755.87 µs]
change: [+17.638% +18.735% +19.709%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
text time: [734.43 µs 735.77 µs 737.38 µs]
change: [+19.534% +20.091% +20.620%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
raw time: [737.88 µs 739.25 µs 740.88 µs]
change: [+17.459% +18.305% +19.144%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
5 (5.00%) high mild
5 (5.00%) high severe
single unicode escape sequence
time: [736.28 µs 737.41 µs 738.60 µs]
change: [+20.952% +21.364% +21.736%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
single hex escape sequence
time: [734.55 µs 735.66 µs 736.88 µs]
change: [+20.123% +20.593% +21.042%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
single oct escape sequence
time: [735.84 µs 737.05 µs 738.36 µs]
change: [+18.249% +18.864% +19.473%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
parser/src/parse.rs
Outdated
parse_unicode_hex(2, &mut chars).map_err(|x| { | ||
ParseError::InvalidUnicode { | ||
source: x, | ||
index: idx, | ||
string: String::from(s), | ||
} | ||
}).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must Change
These errors are being .unwrap()
'd which means that if there's an error, it will panic the program rather than returning the error. Since this function already returns ParseError
, let's make sure to return these errors to the caller. Adding the ?
would probably work instead of unwrap.
parse_unicode_hex(2, &mut chars).map_err(|x| {
ParseError::InvalidUnicode {
source: x,
index: idx,
string: String::from(s),
}
})?
Optional Code Style Suggestions
As a code style suggestion, there's a lot of repetition parsing the different hex and octal codes. You might consider trying to simplify by (for example) handling all hex parsing in one match arm. This code doesn't compile as-is but you get the idea.
'x' | 'u' | 'U' => {
let count = match c2 {
'x' => 2,
'u' => 4,
'U' => 8,
};
res.push(parse_unicode_hex(count, chars).map_err(|source| {
ParseError::InvalidUnicode {
source,
index: idx,
string: String::from(s),
}
})?);
}
Another very small stylistic suggestion: You also might consider lifting your res.push(...)
outside the match. If every match arm results in a push (which it would once you implement the error result handling suggested earlier), then it might be cleaner to do this
res.push(match c2 {
'a' => '\u{07}',
'b' => '\u{08}',
'v' => '\u{0B}',
'f' => '\u{0C}',
then this
match c2 {
'a' => res.push('\u{07}'),
'b' => res.push('\u{08}'),
'v' => res.push('\u{0B}'),
'f' => res.push('\u{0C}'),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the unwrap
- I wonder if match should return a Result
instead? E.g.:
res.push(match c2 {
'a' => Ok('\u{07}'),
'b' => Ok('\u{08}'),
'v' => Ok('\u{0B}'),
'f' => Ok('\u{0C}'),
Then it's easy to handle outside match
as well. Or do you think it's too expensive to allocate it?
Otherwise I think I have to do something like this when parsing hex/oct values:
'x' | 'u' | 'U' => {
let length = match c2 {
'x' => 2,
'u' => 4,
'U' => 8,
_ => unreachable!(),
};
let result = parse_unicode_hex(length, &mut chars)
.map_err(|x| ParseError::InvalidUnicode {
source: x,
index: idx,
string: String::from(s),
});
match result {
Ok(value) => value,
Err(e) => {
return Err(e);
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use the ?
operator?
parse_unicode_hex(length, &mut chars)
.map_err(|x| ParseError::InvalidUnicode {
source: x,
index: idx,
string: String::from(s),
})?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I figured this out. I was not quite familiar with what ?
operator does - it behaves slightly differently from other languages I'm familiar with and the other issue I had was Value used after being used
error on x
variable, so I implemented Clone
trait on the ParseUnicodeError
.
How does that look to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'll do one final bass and then merge if you're happy with where things are at.
I've been looking into implementing a proper string parser that supports interpretation of escape sequences as mentioned in the CEL specification.
I based my implementation on snailquote, but had to change it so that it can support things from CEL specification.
So far it supports all escape sequences and raw strings. I plan to look into supporting triple quotes as well, but in a separate PR.
Please let me know what you think about this approach - maybe you had some different thoughts on how to do this differently, but generally this approach seem to work and could be scaled to other tokens such as byte strings and even numbers.
I'm very new to Rust and would appreciate your feedback. I will look into adding some tests, but need to learn how to do that.