Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite parser without nom #14

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

stepancheg
Copy link
Contributor

@stepancheg stepancheg commented Apr 26, 2018

I was struggling with nom lack of good diagnostic messages, and I
decided that it's easier to rewrite parser without nom.

The new parser in particular properly reports error location (line and
column) which was a big pain of old parser (fixes #13).

New parser correctly accepts syntax which is specified in syntax = ... header.

New parser successfully parses (although doesn't check for correctness)
all protobuf .proto files:

cargo run --example test_against_protobuf_protos ~/devel/protobuf/

I was struggling with nom lack of good diagnostic messages, and I
decided that it's easier to rewrite parser without nom.

New parser in particular properly reports error location (line and
column) which was a big pain of old parser.

New parser successfully parses (although doesn't check for correctness)
all protobuf `.proto` files:

```
cargo run --example test_against_protobuf_protos ~/devel/protobuf/
```
@stepancheg stepancheg force-pushed the nonom branch 4 times, most recently from fa980ea to 1691c4b Compare April 27, 2018 01:41
@tafia
Copy link
Owner

tafia commented Apr 30, 2018

Wow. This is a lot of work!

I agree that not using nom might be better (performance is definitely not an issue) for support purpose.
I do have several comments though, some esthetics (e.g. function names not consistent ...), some more practical (I think parser should implement Iterator<Item=char> which is a line/column augmented version of std::str::char_indices).

I am wondering if making a PR on your PR is better or not.

Alternatively, I think, if you want, you should be owner of that repo because well, you've put lot of work in it and have probably rewritten most of it!

Thanks again for all your work.

Copy link
Owner

@tafia tafia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing but I figured out I'd better start discussing about some ideas first.

I think we can probably clean some things by reusing the Chars iterator a little more. Something like this:

use std::str::Chars;

const LINE_START: u32 = 1;

#[derive(Clone)]
pub struct CharLineCols<'a> {
    chars: Chars<'a>,
    line: u32,
    col: u32,
}

impl<'a> CharLineCols<'a> {
    pub fn new(input: &'a str) -> Self {
        CharLineCols {
            chars: input.chars(),
            line: LINE_START,
            col: 0,
        }
    }

    pub fn lexer_eof(&self) -> bool {
        self.chars.as_str().is_empty()
    }

    pub fn parser_source<F>(&mut self, f: F) -> Option<&'a str>
        where F : FnOnce(&mut CharLineCols) -> bool
    {
        let start = self.chars.as_str();
        let mut clone = self.clone();
        if f(&mut clone) {
            *self = clone;
            let end = self.chars.as_str();
            let consumed = start.len() - end.len(); // same logic as `std::char::char_indices`
            return Some(&start[..consumed]);
        }
        None
    }

    pub fn take_while<F>(&mut self, f: F) -> &'a str
        where F : Fn(char) -> bool
    {
        let start = self.chars.as_str();
        let mut peek = self.chars.clone();
        while peek.next().map_or(false, |c| f(c)) {
            self.next();
        }
        let end = self.chars.as_str();
        let consumed = start.len() - end.len();
        &start[..consumed]
    }

    pub fn next_if<P>(&mut self, p: P) -> Option<char>
        where P : FnOnce(char) -> bool
    {
        let mut clone = self.chars.clone();
        if let Some(c) = clone.next() {
            if p(c) {
                return self.next();
            }
        }
        None
    }

    pub fn skip_if_lookahead_is_str(&mut self, s: &str) -> bool {
        assert!(s.len() > 0);
        if self.chars.as_str().starts_with(s) {
            for _ in s.chars() {
                self.next();
            }
            true
        } else {
            false
        }
    }
}

impl<'a> Iterator for CharLineCols<'a> {
    type Item = char;
    fn next(&mut self) -> Option<char> {
        self.chars.next()
            .map(|c| {
                if c == '\n' {
                    self.col = 0;
                    self.line += 1;
                } else {
                    self.col += 1;
                }
                c
            })
    }
}

src/parser.rs Outdated
}

/// Apply a parser and return a string which matched
fn parser_source<F>(&mut self, f: F) -> Option<&'a str>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, this functions tries some parser and returns the consumes characters, if any.

parser_source doesn't mean much to me, what about try_parse, try_read, try_consume ... or even a take similar to your next take_while function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called source in jparsec library, so I chose this name. Couldn't find quickly what's the parser name in the original Haskell parsec library, but nevermind, I'll simply inline the function since it's used only in one place.

src/parser.rs Outdated

/// No more chars
fn lexer_eof(&self) -> bool {
self.rem_chars().is_empty()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but I'd prefer a direct self.pos == self.input.len()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/parser.rs Outdated
where F : FnOnce(&mut Parser) -> bool
{
let pos = self.pos;
match self.parser_opt(|parser| if f(parser) { Some(())} else { None }) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.parser_opt(|parser| if f(parser) { Some(())} else { None })
    .map(|_| &self.input[pos..self.pos])

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if rem.is_empty() {
None
} else {
let c = rem.chars().next().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should have:

let (char_len, c) = rem.char_indices().next().unwrap(); // actually this might not be ok to unwrap here
self.pos += char_len;
// ...
} else {
    self.col += 1; // we keep 1 here
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work, because char_len is zero here.

src/parser.rs Outdated
}

fn next_lexer_char(&mut self) -> ParserResult<char> {
match self.next_lexer_char_opt() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.next_lexer_char_opt().ok_or(ParserError::UnexpectedEof)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I know about ok_or function. Thank you!

src/parser.rs Outdated
}

fn lookahead_lexer_char_is_in(&self, alphabet: &str) -> bool {
match self.lookahead_lexer_char() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.lookahead_lexer_char().map_or(false, alphabet.contains(c))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor Author

@stepancheg stepancheg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline replies to your comments. I'll submit updated PR a bit later.

@stepancheg
Copy link
Contributor Author

I tried to make it more readable and decided to try to rewrite it as lexer+parser. It will take a little time.

Alternatively, I think, if you want, you should be owner of that repo because well, you've put lot of work in it and have probably rewritten most of it!

If you intend to use it in quick-protobuf codegen, I'd appreciate if you make me an owner or give me push permissions so I could push simple changes without waiting for merge. However, I would still appreciate code review for larger changes like this one.

However, if you are not going to use this parser in quick-protobuf, I should simply merge it into the rust-protobuf project.

@stepancheg
Copy link
Contributor Author

Updated with lexer+parser.

@Geal
Copy link

Geal commented May 4, 2018

Hello!
I guess it's a bit late to try and convince you to keep the nom parser, but did you see https://github.com/fflorent/nom_locate ?
It solves the issue of getting line and column info for any part of the input that's returned, and for errors too.

@stepancheg
Copy link
Contributor Author

@Geal I've seen nom-locate, but I couldn't easily understand how to obtain line and column number of error from it.

But the biggest issue is that I found nom parser (with programmable macros) to be too hard to use: documentation is not perfect, and reading the sources is hard (you cannot cmd-click on tag! to understand what it does). I think a hand-written parser is easier to work with.

@stepancheg
Copy link
Contributor Author

I've copied contents of the PR into protobuf-codegen-pure crate to unblock development of new features. So protobuf-codegen-pure doesn't use protobuf-parser crate now. Maybe I'll switch to using protobuf-parser back later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include line:column in error
3 participants