-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate if ECJ's parser be simplified by incorporating overlooked last minute spec revisions #1045
Comments
Thanks for looking into this @srikanth-sankaran ! |
In an attempt to clarify what can/cannot be done I simplified the Java grammar to a minimum that still shows a remaining conflict: mini2.txt
In all the rest of the Java grammar, productions are chained in a way that anything on the RHS always has a reduced set of possibilities, ensuring that the right-most non-terminal will never branch back to the current LHS. In the reduced version As a result this expression is ambiguous:
The parser cannot decide which way to interpret (denoted by resulting AST):
IOW, this mini grammar (and transitively the Java grammar) cannot decide if the trailing I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule? Semantically, multiplying a cast with some factor normally doesn't make sense, but when casting to a numerical boxing type this may still do something. Casting a lambda, however, will never produce a numerical type, as none of those are functional interfaces, not sure if this can be turned into something useful at the syntax level ... |
The following change would fix the mini grammar, to be LALR(1): --- mini.g
+++ mini.g
@@ -37,6 +37,7 @@
Expr -> Mul
Expr -> Lambda
+Expr -> CastL
Mul -> Unary
Mul ::= Mul MULTIPLY Unary
@@ -45,7 +46,7 @@
Unary -> Cast
Cast ::= LParen Identifier RParen Primary
-Cast ::= LParen Identifier RParen Lambda
+CastL ::= LParen Identifier RParen Lambda
Primary -> Identifier Explanation: I pulled the casted-lambda rule out of the entire tree of productions applying some kind of arithmetic, assuming that you simply cannot "compose" a (casted) lambda with any other value. Would this illegally enable use of casted-lambda in a context where it's illegal? I think that could be compensated by either of
Indeed also the actual PS: I could imagine that such change would even be worth pushing upstream, i.e., to JLS. |
Java is funny. A casted lambda expression can actually be used as an operand in certain expressions. Need to check if we break anything regarding:
E.g.: if (((Supplier<String>) () -> a) instanceof Supplier)
System.out.print("yes");
if (((Supplier<String>) () -> a) != null)
System.out.print("yes"); This is accepted by javac and prints "yesyes" :) From experiments in this area I learned that Actually, ecj barfs on the above:
Let's hope that grammar simplification will avoid this unspeakable error message, and causes no further havoc. In fact I wonder if anyone would insist that ecj supports checking a lambda for null, or performing instanceof checks. In both cases the lambda will be left to gc before it can ever be invoked. |
I have raised #1755 |
Thanks for the various experiments and the suggestions Stephan, I will study these. I am missing my Dragon book today, will have it in my hands tomorrow. |
Genie reminded me that we have some unfinished work still in gerrit:
Seeing mention of |
I don't know either - but I am not averse to, (at some point) - if such an options are not available, building them on an experimental basis to see where that would lead us. IIRC, YACC allows you to say precedence between two rules for shift/reduce (and where no precedence is specified always prefers shift) and for reduce/reduce it always opt for the earlier listed rule.
|
Two big blockers for certain types of experiments in the area of parser/scanner are:
This method is a fantastic side-effect free mechanism to query the parser of its current state - It is axiomatic that an LALR parser will never shift on invalid input - so this is a great tool to steer the scan and the parse. Presently this method is used to disambiguate when It is possible to build this disambiguation into DiagnoseParser's internal token stream but that is some work.
In general building smarts falls in some buckets. (a) Feedback from the parser to the scanner: this allows consumed input to steer classification of unconsumed input. This is looking back in the rear view mirror. (b) Looking ahead to detect structure of what is to come to classify current token (perhaps inject some synthetics to steer the parser in a particular path) - this is Vanguard parser's forte. |
I see that there are interesting tasks still ahead of us, but does any of what you mention here block the current grammar refactoring? |
I spent a little while on jikespg sources and tried if any options would have the desired effect, but to no avail.
and
And then my C-reading skills are not sufficient to figure out where such an option could be added on our own. Or perhaps it's my lack of understanding of what a parser generator actually does :) |
last minute spec revisions eclipse-jdt#1045 WIP regargding eclipse-jdt#1045
No, they don't block any work that eliminates conflicts in the first place by rewriting the grammar. I was jotting down the thoughts about what our options are for handling a conflict. |
@stephan-herrmann @mpalat @jarthana - Can use some feedback on this following question: I see that handshake between the Parser and Scanner has been steadily growing. The comment above See for example, the state held by
Now there are two problems with this: one is that The comment in "// this.caseStartPosition > this.startPositionpossible on recovery - bother only about correct ones." But if that is the approach we want to take - which may not be unreasonable - there is no need for so many state driven hacky handshakes.
|
The question I am asking is: Given that the state based handshake between the Parser and the Scanner won't work in DiagnoseParse and VanguardParse, why not ditch that and opt for a cleaner solution in disambiguating token by using See that (See also |
@srikanth-sankaran can we, pretty please, try to exercise some separation of concerns where it is still possible? This issue is about modifying the grammar to the end that hopefully less disambiguation / lookahead / scanner-parser communication is necessary. If you want to discuss how to improve the remaining kludges for handling conditional keywords and grammar ambiguities, please create a separate space for that. Maybe add a paragraph or two in the structure below https://github.com/eclipse-jdt/eclipse.jdt.core/wiki/ECJ ? Have you seen https://github.com/eclipse-jdt/eclipse.jdt.core/wiki/ECJ-Parse ? |
Point taken. I'll open a separate thread in a day or two. Thanks. |
Per observations from @stephan-herrmann here: #125 (comment)
we may have inadvertently overlooked last minute changes to Java 8 spec that may have allowed us to simplify the Parser/Scanner in ECJ.
See
https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c4
https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c10
The text was updated successfully, but these errors were encountered: