Skip to content
This repository has been archived by the owner on Jun 3, 2021. It is now read-only.

Close #356 - Add whitespace token for preprocessor #437

Merged
merged 25 commits into from
May 25, 2020

Conversation

hdamron17
Copy link
Collaborator

When complete, this will close #356 and maybe #395.

So far, the whitespace token has been added and properly outputs when using -E, but I still need to consume the whitespace before it's used by the parser.

@jyn514
Copy link
Owner

jyn514 commented May 15, 2020

I'd be interested if it helped with #361 as well.

@hdamron17
Copy link
Collaborator Author

Observation: Need to add whitespace consideration back to preprocessor (e.g. #ifndef x fails because of the whitespace token in the middle) but ensure whitespace tokens have no newlines. Unit tests also fail a lot because the comparisons do not take whitespace into account.

- Don't set `seen_line_token` for whitespace
- Ignore whitespace in `#if` expressions (since the parser doesn't know
what whitespace is)
- Run `cargo fmt`
src/lex/mod.rs Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
@jyn514
Copy link
Owner

jyn514 commented May 20, 2020

Note this will not fix #395. The \n in the issue title refers to the actual characters \ and then n, not a newline.

@hdamron17 hdamron17 changed the title [WIP] Close #356 - Add whitespace token for preprocessor Close #356 - Add whitespace token for preprocessor May 23, 2020
@hdamron17
Copy link
Collaborator Author

It's working now, but I probably should put some whitespace-dependent test cases for -E.

One possible issue that came up is that the parser iteration does not work if there is leading whitespace. However, it works in the final product so I don't think it matters. I changed the test case to have no leading whitespace in 8e0509e.

@hdamron17
Copy link
Collaborator Author

Yeah it doesn't keep whitespace properly, mostly with preprocessor stuff since I was just trying to get existing tests to pass...

@jyn514
Copy link
Owner

jyn514 commented May 23, 2020

One possible issue that came up is that the parser iteration does not work if there is leading whitespace. However, it works in the final product so I don't think it matters. I changed the test case to have no leading whitespace in 8e0509e.

If I understand right, this means that it works correctly using check_semantics but that passing a whitespace token to Parser as first will give a spurious error? That seems fine as long as it's documented.

src/lex/cpp.rs Outdated
vec![x, Ok(Token::Whitespace(String::from(" ")))]
}
})
.flatten()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add these spaces here in the define? Does tokens_until_newline not return whitespace tokens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens_until_newline (at the moment) does not include whitespace tokens. I did not want to change it since it is somewhat separate from the preprocessor. I can try adding whitespace tokens to it and see what happens now that it is in a stable state. I did notice that clang only puts one space when replacing preprocessor defines, regardless of the original spacing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting ... I suppose since the behavior is correct we can try to go back and improve the spacing later.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a comment to this effect either here or at tokens_until_newline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually went back and did it the proper way by changing tokens_until_newline. Unfortunately I had to rework some stuff for boolean_expr because whitespace show up in the replacement stage. I'll push as soon as I double check the tests. Also, I think the rework will be much less of an eyesore.

Copy link
Owner

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nits, overall this looks great :) I would like to see tests for a/* */b and maybe a few other things though.

src/lex/mod.rs Show resolved Hide resolved
src/lex/mod.rs Outdated Show resolved Hide resolved
@hdamron17
Copy link
Collaborator Author

Some more tests with preprocessor stuff would be nice, but other than that, I think it's complete.

@hdamron17
Copy link
Collaborator Author

Also, newlines are not preserved for preprocessor macros at the moment, so I'll need to fix that.

@jyn514
Copy link
Owner

jyn514 commented May 24, 2020

Also, newlines are not preserved for preprocessor macros at the moment, so I'll need to fix that.

Do you mean that \\\n isn't preserved?

#define f(a) { \
	int b = a; \
	return b; \
}
f(1)
 { int b = 1; return b; }

If so, that's fine, clang does the same. It will also be hard to do without major changes since deleting \\\n happens very early in the lexer: https://github.com/jyn514/rcc/blob/93b5e06/src/lex/mod.rs#L98

src/lex/cpp.rs Outdated Show resolved Hide resolved
fn preprocess_only() {
assert_same_exact("int \t\n\r main() {}", "int \t\n\r main() {}");
assert_same_exact("int/* */main() {}", "int main() {}");
assert_same_exact("int/*\n\n\n*/main() {}", "int\n\n\nmain() {}");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this behavior looks a little confusing. Is there a reason you kept newlines inside of block comments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the behavior of clang. For example,

int main() {
 /*




 */
}

preprocesses to

# 1 "test.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 363 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "test.c" 2
int main() {






}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need to look at the documentation though since

int main() {/*




*/}

mysteriously preprocesses to

# 1 "test.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 363 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "test.c" 2
int main() { }

Copy link
Owner

@jyn514 jyn514 May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant part of the standard (5.1.1.2 Translation phases):

The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.

This seems to me to say that clang's behavior in your first example is correct, and the behavior in the second is a bug. For context, gcc always behaves like clang in the second example, i.e. it always deletes newlines in comments:

$ gcc -x c -E -P -
int main() {
 /*




 */
}

preprocesses to

int main() {
}

and

int main() {/*




*/}

preprocesses to int main() { }.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tcc behaves the same as gcc.

@hdamron17
Copy link
Collaborator Author

One possible issue that came up is that the parser iteration does not work if there is leading whitespace. However, it works in the final product so I don't think it matters. I changed the test case to have no leading whitespace in 8e0509e.

If I understand right, this means that it works correctly using check_semantics but that passing a whitespace token to Parser as first will give a spurious error? That seems fine as long as it's documented.

Turns out it was an easy fix. I just changed the parser function to use next_non_whitespace.

@hdamron17
Copy link
Collaborator Author

Also, newlines are not preserved for preprocessor macros at the moment, so I'll need to fix that.

Do you mean that \\\n isn't preserved?

No, I mean #defines do not keep newlines after.

int main() {
#define a
#define b
#define c
}

preprocesses to

int main() {
}

but should preprocess to

# 1 "test.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 363 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "test.c" 2
int main() {



}

@hdamron17
Copy link
Collaborator Author

The newlines not being preserved after some macros is because tokens_until_newline does not actually stop at the newline but just consumes arbitrary whitespace until the line number increases. For example,

int main() {
#define a 1
           return a;
}

preprocesses to

int main() {
return 1;
}

@jyn514
Copy link
Owner

jyn514 commented May 25, 2020

tokens_until_newline does not actually stop at the newline but just consumes arbitrary whitespace until the line number increases

I don't see a way to fix this without fully separating the preprocessor from the lexer ... I guess I could make that function part of the lexer instead and stop at \n? That would require a fair bit of refactoring but not an unreasonable amount.

@jyn514
Copy link
Owner

jyn514 commented May 25, 2020

In any case I think it can be fixed later, the majority of things are working now.

@jyn514
Copy link
Owner

jyn514 commented May 25, 2020

r=me once @hdamron17 is happy with it

@hdamron17
Copy link
Collaborator Author

I went ahead and changed tokens_until_newline and it helps with #defines. All that's missing now are #else and #endif I think so I'll try to get those working and make enough tests to be worthwhile.

src/lex/cpp.rs Outdated Show resolved Hide resolved
Copy link
Owner

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing the changes to make sure I understand:

Code changes

  • Add a whitespace token
  • Fix up pretty printing, etc.
  • Return whitespace tokens from consume_whitespace, etc.
  • Add consume_whitespace_oneline, which only consumes whitespace on the current line. Previously, the lexer would follow whitespace as long as it saw it, even if it went across many lines. This fixes [ICE] the lexer and the preprocessor have trouble getting along #394.
  • consume_whitespace_oneline works by returning an error if the whitespace had a newline. This is an error since consume_whitespace_oneline is only called for preprocessor directives, which must always be on the same line. This is really clever, I don't know if I would have thought of it :)
  • Change tests to filter out whitespace by default; add some tests that don't ignore whitespace and make sure that bit works properly

Behavior

  • Keep all newlines within comments
  • Replace comments with a single space
  • Print whitespace when passed -E

Let me know if I missed anything :)

src/lex/cpp.rs Outdated Show resolved Hide resolved
src/lex/cpp.rs Outdated Show resolved Hide resolved
@jyn514 jyn514 linked an issue May 25, 2020 that may be closed by this pull request
@hdamron17
Copy link
Collaborator Author

r @jyn514

@hdamron17
Copy link
Collaborator Author

Your summary sounds about right. Also the behaviour of consume_whitespace_oneline was just copied from your code so I guess you're the clever one. :)

@jyn514 jyn514 merged commit ae5ac81 into jyn514:master May 25, 2020
@hdamron17 hdamron17 deleted the preprocessor-whitespace branch May 25, 2020 22:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ICE] the lexer and the preprocessor have trouble getting along Remember whitespace for -E
2 participants