-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite parser without nom #14
base: master
Are you sure you want to change the base?
Conversation
I was struggling with nom lack of good diagnostic messages, and I decided that it's easier to rewrite parser without nom. New parser in particular properly reports error location (line and column) which was a big pain of old parser. New parser successfully parses (although doesn't check for correctness) all protobuf `.proto` files: ``` cargo run --example test_against_protobuf_protos ~/devel/protobuf/ ```
fa980ea
to
1691c4b
Compare
Wow. This is a lot of work! I agree that not using nom might be better (performance is definitely not an issue) for support purpose. I am wondering if making a PR on your PR is better or not. Alternatively, I think, if you want, you should be owner of that repo because well, you've put lot of work in it and have probably rewritten most of it! Thanks again for all your work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing but I figured out I'd better start discussing about some ideas first.
I think we can probably clean some things by reusing the Chars
iterator a little more. Something like this:
use std::str::Chars;
const LINE_START: u32 = 1;
#[derive(Clone)]
pub struct CharLineCols<'a> {
chars: Chars<'a>,
line: u32,
col: u32,
}
impl<'a> CharLineCols<'a> {
pub fn new(input: &'a str) -> Self {
CharLineCols {
chars: input.chars(),
line: LINE_START,
col: 0,
}
}
pub fn lexer_eof(&self) -> bool {
self.chars.as_str().is_empty()
}
pub fn parser_source<F>(&mut self, f: F) -> Option<&'a str>
where F : FnOnce(&mut CharLineCols) -> bool
{
let start = self.chars.as_str();
let mut clone = self.clone();
if f(&mut clone) {
*self = clone;
let end = self.chars.as_str();
let consumed = start.len() - end.len(); // same logic as `std::char::char_indices`
return Some(&start[..consumed]);
}
None
}
pub fn take_while<F>(&mut self, f: F) -> &'a str
where F : Fn(char) -> bool
{
let start = self.chars.as_str();
let mut peek = self.chars.clone();
while peek.next().map_or(false, |c| f(c)) {
self.next();
}
let end = self.chars.as_str();
let consumed = start.len() - end.len();
&start[..consumed]
}
pub fn next_if<P>(&mut self, p: P) -> Option<char>
where P : FnOnce(char) -> bool
{
let mut clone = self.chars.clone();
if let Some(c) = clone.next() {
if p(c) {
return self.next();
}
}
None
}
pub fn skip_if_lookahead_is_str(&mut self, s: &str) -> bool {
assert!(s.len() > 0);
if self.chars.as_str().starts_with(s) {
for _ in s.chars() {
self.next();
}
true
} else {
false
}
}
}
impl<'a> Iterator for CharLineCols<'a> {
type Item = char;
fn next(&mut self) -> Option<char> {
self.chars.next()
.map(|c| {
if c == '\n' {
self.col = 0;
self.line += 1;
} else {
self.col += 1;
}
c
})
}
}
src/parser.rs
Outdated
} | ||
|
||
/// Apply a parser and return a string which matched | ||
fn parser_source<F>(&mut self, f: F) -> Option<&'a str> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, this functions tries some parser and returns the consumes characters, if any.
parser_source
doesn't mean much to me, what about try_parse
, try_read
, try_consume
... or even a take
similar to your next take_while
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called source
in jparsec library, so I chose this name. Couldn't find quickly what's the parser name in the original Haskell parsec library, but nevermind, I'll simply inline the function since it's used only in one place.
src/parser.rs
Outdated
|
||
/// No more chars | ||
fn lexer_eof(&self) -> bool { | ||
self.rem_chars().is_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but I'd prefer a direct self.pos == self.input.len()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/parser.rs
Outdated
where F : FnOnce(&mut Parser) -> bool | ||
{ | ||
let pos = self.pos; | ||
match self.parser_opt(|parser| if f(parser) { Some(())} else { None }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.parser_opt(|parser| if f(parser) { Some(())} else { None })
.map(|_| &self.input[pos..self.pos])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if rem.is_empty() { | ||
None | ||
} else { | ||
let c = rem.chars().next().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should have:
let (char_len, c) = rem.char_indices().next().unwrap(); // actually this might not be ok to unwrap here
self.pos += char_len;
// ...
} else {
self.col += 1; // we keep 1 here
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work, because char_len
is zero here.
src/parser.rs
Outdated
} | ||
|
||
fn next_lexer_char(&mut self) -> ParserResult<char> { | ||
match self.next_lexer_char_opt() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.next_lexer_char_opt().ok_or(ParserError::UnexpectedEof)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I know about ok_or
function. Thank you!
src/parser.rs
Outdated
} | ||
|
||
fn lookahead_lexer_char_is_in(&self, alphabet: &str) -> bool { | ||
match self.lookahead_lexer_char() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.lookahead_lexer_char().map_or(false, alphabet.contains(c))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline replies to your comments. I'll submit updated PR a bit later.
I tried to make it more readable and decided to try to rewrite it as lexer+parser. It will take a little time.
If you intend to use it in quick-protobuf codegen, I'd appreciate if you make me an owner or give me push permissions so I could push simple changes without waiting for merge. However, I would still appreciate code review for larger changes like this one. However, if you are not going to use this parser in quick-protobuf, I should simply merge it into the rust-protobuf project. |
Updated with lexer+parser. |
Hello! |
@Geal I've seen nom-locate, but I couldn't easily understand how to obtain line and column number of error from it. But the biggest issue is that I found |
I've copied contents of the PR into protobuf-codegen-pure crate to unblock development of new features. So protobuf-codegen-pure doesn't use protobuf-parser crate now. Maybe I'll switch to using protobuf-parser back later. |
I was struggling with nom lack of good diagnostic messages, and I
decided that it's easier to rewrite parser without nom.
The new parser in particular properly reports error location (line and
column) which was a big pain of old parser (fixes #13).
New parser correctly accepts syntax which is specified in
syntax = ...
header.New parser successfully parses (although doesn't check for correctness)
all protobuf
.proto
files: