Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for lack of word boundaries #415

Open
ceronman opened this issue Aug 21, 2024 · 3 comments
Open

Workaround for lack of word boundaries #415

ceronman opened this issue Aug 21, 2024 · 3 comments
Labels
question Further information is requested

Comments

@ceronman
Copy link

Hi,

This is just a question, not a bug report. I'm using Logos to write a lexer for the C programming language. I'm following the book "Writing a C Compiler" by Nora Sandler.

Something unusual from the book, is that the lexical grammar of the lexer uses word boundaries. So for example there are identifiers and constants:

Identifier: [a-zA-Z_]\w*\b
Constant: [0-9]+\b

Now the reason for using word boundaries is an input like 123foo should be a lexical error. I find this a bit unusual, because I think this kind of error is usually caught at the parser level, but here we're trying to catch it at lexer level.

Using Logos I initially had something like this:

#[derive(Logos)]
enum TokenKind {
    // ...
    #[regex(r"[a-zA-Z_]\w*")]
    Identifier,

    #[regex(r"[0-9]+")]
    Constant,
    //..
}

But this of course has the problem that it would not catch the error mentioned above. I'm wondering what could be the best workaround for this use case. Currently, what I'm doing is the following:

#[derive(Logos)]
enum TokenKind {
    // ...
    #[regex(r"[a-zA-Z_]\w*")]
    Identifier,

    #[regex(r"[0-9]+")]
    Constant,
    
    #[regex(r"[0-9]+[a-zA-Z_]+", |_| None)]
    BadIdentifier,
    //..
}

This kinda works, but it looks a bit ugly to me. I'm using a callback that returns None to signal the error. I'm wondering if there is a better way to do that. I saw some examples using #[error], but it seems that doesn't work in newer versions of Logos.

Are there any better ideas?

Thanks in advance.

@jeertmans jeertmans added the question Further information is requested label Aug 21, 2024
@jeertmans
Copy link
Collaborator

Hello @ceronman, thank you for this interesting question!

I think how you did is good, but I just suggest another way: define the whole grammar with tokens/regexes, and let the lexer return Err when it encounters something that does not match any valid token. I recommend you take a look at the JSON parser example that shows how you can create a complex logic when parsing elements. Ultimately, the parser will need some help to return a meaningful error if you want to indicate that you found an invalid identifier, and this usually needs to know the current context (i.e., that you were expecting an identifier).

@martinstuder
Copy link

Hello @ceronman, thank you for this interesting question!

I think how you did is good, but I just suggest another way: define the whole grammar with tokens/regexes, and let the lexer return Err when it encounters something that does not match any valid token. I recommend you take a look at the JSON parser example that shows how you can create a complex logic when parsing elements. Ultimately, the parser will need some help to return a meaningful error if you want to indicate that you found an invalid identifier, and this usually needs to know the current context (i.e., that you were expecting an identifier).

@jeertmans I think @ceronman's question is how you can make sure that an input like 123foo does not yield TokenKind::Constant and then TokenKind::Identifier in the lexing phase which is what Logos would do and is actually a problem I ran into recently myself. What I would somehow like to achieve is that Logos would give a lexing error if a sequence of characters have not been fully consumed up to the next token delimiter (whatever defines a token delimiter, e.g. whitespace). An approach that I am using at the moment is to have a callback that uses Lexer::remainder() to check that there are no remaining non-whitespace characters and raise an error if there are. That brings me to another thing: if Lexer were peekable (it isn't unfortunately), you could peek for the next token in a callback and make sure that it is an expected delimiting token.

@jeertmans
Copy link
Collaborator

Well @martinstuder another solution is to have a lexer with tokens for every possible case you want to cover, and you can use callbacks if you want to return errors instead of tokens. Here, in this example, you need to explicitly write a token that matches wrong identifier. This issue is that you then need to handle special cases, like strings or code comments, that usually require very complex regexes if you want to only rely on tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants