Rewrite parser without nom #14

stepancheg · 2018-04-26T18:54:04Z

I was struggling with nom lack of good diagnostic messages, and I
decided that it's easier to rewrite parser without nom.

The new parser in particular properly reports error location (line and
column) which was a big pain of old parser (fixes #13).

New parser correctly accepts syntax which is specified in syntax = ... header.

New parser successfully parses (although doesn't check for correctness)
all protobuf .proto files:

cargo run --example test_against_protobuf_protos ~/devel/protobuf/

I was struggling with nom lack of good diagnostic messages, and I decided that it's easier to rewrite parser without nom. New parser in particular properly reports error location (line and column) which was a big pain of old parser. New parser successfully parses (although doesn't check for correctness) all protobuf `.proto` files: ``` cargo run --example test_against_protobuf_protos ~/devel/protobuf/ ```

tafia · 2018-04-30T08:49:33Z

Wow. This is a lot of work!

I agree that not using nom might be better (performance is definitely not an issue) for support purpose.
I do have several comments though, some esthetics (e.g. function names not consistent ...), some more practical (I think parser should implement Iterator<Item=char> which is a line/column augmented version of std::str::char_indices).

I am wondering if making a PR on your PR is better or not.

Alternatively, I think, if you want, you should be owner of that repo because well, you've put lot of work in it and have probably rewritten most of it!

Thanks again for all your work.

tafia

I haven't finished reviewing but I figured out I'd better start discussing about some ideas first.

I think we can probably clean some things by reusing the Chars iterator a little more. Something like this:

use std::str::Chars;

const LINE_START: u32 = 1;

#[derive(Clone)]
pub struct CharLineCols<'a> {
    chars: Chars<'a>,
    line: u32,
    col: u32,
}

impl<'a> CharLineCols<'a> {
    pub fn new(input: &'a str) -> Self {
        CharLineCols {
            chars: input.chars(),
            line: LINE_START,
            col: 0,
        }
    }

    pub fn lexer_eof(&self) -> bool {
        self.chars.as_str().is_empty()
    }

    pub fn parser_source<F>(&mut self, f: F) -> Option<&'a str>
        where F : FnOnce(&mut CharLineCols) -> bool
    {
        let start = self.chars.as_str();
        let mut clone = self.clone();
        if f(&mut clone) {
            *self = clone;
            let end = self.chars.as_str();
            let consumed = start.len() - end.len(); // same logic as `std::char::char_indices`
            return Some(&start[..consumed]);
        }
        None
    }

    pub fn take_while<F>(&mut self, f: F) -> &'a str
        where F : Fn(char) -> bool
    {
        let start = self.chars.as_str();
        let mut peek = self.chars.clone();
        while peek.next().map_or(false, |c| f(c)) {
            self.next();
        }
        let end = self.chars.as_str();
        let consumed = start.len() - end.len();
        &start[..consumed]
    }

    pub fn next_if<P>(&mut self, p: P) -> Option<char>
        where P : FnOnce(char) -> bool
    {
        let mut clone = self.chars.clone();
        if let Some(c) = clone.next() {
            if p(c) {
                return self.next();
            }
        }
        None
    }

    pub fn skip_if_lookahead_is_str(&mut self, s: &str) -> bool {
        assert!(s.len() > 0);
        if self.chars.as_str().starts_with(s) {
            for _ in s.chars() {
                self.next();
            }
            true
        } else {
            false
        }
    }
}

impl<'a> Iterator for CharLineCols<'a> {
    type Item = char;
    fn next(&mut self) -> Option<char> {
        self.chars.next()
            .map(|c| {
                if c == '\n' {
                    self.col = 0;
                    self.line += 1;
                } else {
                    self.col += 1;
                }
                c
            })
    }
}

tafia · 2018-04-30T08:02:04Z

src/parser.rs

+    }
+
+    /// Apply a parser and return a string which matched
+    fn parser_source<F>(&mut self, f: F) -> Option<&'a str>


If I understand it correctly, this functions tries some parser and returns the consumes characters, if any.

parser_source doesn't mean much to me, what about try_parse, try_read, try_consume ... or even a take similar to your next take_while function?

It's called source in jparsec library, so I chose this name. Couldn't find quickly what's the parser name in the original Haskell parsec library, but nevermind, I'll simply inline the function since it's used only in one place.

tafia · 2018-04-30T08:02:51Z

src/parser.rs

+
+    /// No more chars
+    fn lexer_eof(&self) -> bool {
+        self.rem_chars().is_empty()


Not a big deal but I'd prefer a direct self.pos == self.input.len()

tafia · 2018-04-30T08:09:00Z

src/parser.rs

+        where F : FnOnce(&mut Parser) -> bool
+    {
+        let pos = self.pos;
+        match self.parser_opt(|parser| if f(parser) { Some(())} else { None }) {


self.parser_opt(|parser| if f(parser) { Some(())} else { None }) .map(|_| &self.input[pos..self.pos])

?

tafia · 2018-04-30T08:18:01Z

src/parser.rs

+        if rem.is_empty() {
+            None
+        } else {
+            let c = rem.chars().next().unwrap();


I believe we should have:

let (char_len, c) = rem.char_indices().next().unwrap(); // actually this might not be ok to unwrap here self.pos += char_len; // ... } else { self.col += 1; // we keep 1 here }

That doesn't work, because char_len is zero here.

tafia · 2018-04-30T08:26:34Z

src/parser.rs

+    }
+
+    fn next_lexer_char(&mut self) -> ParserResult<char> {
+        match self.next_lexer_char_opt() {


self.next_lexer_char_opt().ok_or(ParserError::UnexpectedEof)

Now I know about ok_or function. Thank you!

tafia · 2018-04-30T08:36:05Z

src/parser.rs

+    }
+
+    fn lookahead_lexer_char_is_in(&self, alphabet: &str) -> bool {
+        match self.lookahead_lexer_char() {


self.lookahead_lexer_char().map_or(false, alphabet.contains(c))

stepancheg

Inline replies to your comments. I'll submit updated PR a bit later.

stepancheg · 2018-05-01T01:59:09Z

I tried to make it more readable and decided to try to rewrite it as lexer+parser. It will take a little time.

Alternatively, I think, if you want, you should be owner of that repo because well, you've put lot of work in it and have probably rewritten most of it!

If you intend to use it in quick-protobuf codegen, I'd appreciate if you make me an owner or give me push permissions so I could push simple changes without waiting for merge. However, I would still appreciate code review for larger changes like this one.

However, if you are not going to use this parser in quick-protobuf, I should simply merge it into the rust-protobuf project.

stepancheg · 2018-05-02T04:42:26Z

Updated with lexer+parser.

Geal · 2018-05-04T21:26:27Z

Hello!
I guess it's a bit late to try and convince you to keep the nom parser, but did you see https://github.com/fflorent/nom_locate ?
It solves the issue of getting line and column info for any part of the input that's returned, and for errors too.

stepancheg · 2018-05-06T08:15:48Z

@Geal I've seen nom-locate, but I couldn't easily understand how to obtain line and column number of error from it.

But the biggest issue is that I found nom parser (with programmable macros) to be too hard to use: documentation is not perfect, and reading the sources is hard (you cannot cmd-click on tag! to understand what it does). I think a hand-written parser is easier to work with.

stepancheg · 2018-05-06T08:31:51Z

I've copied contents of the PR into protobuf-codegen-pure crate to unblock development of new features. So protobuf-codegen-pure doesn't use protobuf-parser crate now. Maybe I'll switch to using protobuf-parser back later.

stepancheg force-pushed the nonom branch 4 times, most recently from fa980ea to 1691c4b Compare April 27, 2018 01:41

tafia reviewed Apr 30, 2018

View reviewed changes

stepancheg commented May 1, 2018

View reviewed changes

stepancheg force-pushed the nonom branch from 1691c4b to 96367f2 Compare May 2, 2018 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite parser without nom #14

Rewrite parser without nom #14

stepancheg commented Apr 26, 2018 •

edited

Loading

tafia commented Apr 30, 2018

tafia left a comment

tafia Apr 30, 2018

stepancheg May 1, 2018

tafia Apr 30, 2018

stepancheg May 1, 2018

tafia Apr 30, 2018

stepancheg May 1, 2018

tafia Apr 30, 2018

stepancheg May 1, 2018

tafia Apr 30, 2018

stepancheg May 1, 2018

tafia Apr 30, 2018

stepancheg May 1, 2018

stepancheg left a comment

stepancheg commented May 1, 2018

stepancheg commented May 2, 2018

Geal commented May 4, 2018

stepancheg commented May 6, 2018

stepancheg commented May 6, 2018

Rewrite parser without nom #14

Are you sure you want to change the base?

Rewrite parser without nom #14

Conversation

stepancheg commented Apr 26, 2018 • edited Loading

tafia commented Apr 30, 2018

tafia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stepancheg left a comment

Choose a reason for hiding this comment

stepancheg commented May 1, 2018

stepancheg commented May 2, 2018

Geal commented May 4, 2018

stepancheg commented May 6, 2018

stepancheg commented May 6, 2018

stepancheg commented Apr 26, 2018 •

edited

Loading