Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if ECJ's parser be simplified by incorporating overlooked last minute spec revisions #1045

Open
srikanth-sankaran opened this issue May 10, 2023 · 16 comments
Assignees

Comments

@srikanth-sankaran
Copy link
Contributor

Per observations from @stephan-herrmann here: #125 (comment)
we may have inadvertently overlooked last minute changes to Java 8 spec that may have allowed us to simplify the Parser/Scanner in ECJ.

See

https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c4
https://bugs.eclipse.org/bugs/show_bug.cgi?id=561934#c10

@trancexpress
Copy link
Contributor

Thanks for looking into this @srikanth-sankaran !

@stephan-herrmann
Copy link
Contributor

In an attempt to clarify what can/cannot be done I simplified the Java grammar to a minimum that still shows a remaining conflict:

mini2.txt
Rename this file to mini.g and pass it to jikespg, it will show the dreaded

*** Shift/reduce conflict on "MULTIPLY" with rule 2

This grammar is not LALR(1).

In all the rest of the Java grammar, productions are chained in a way that anything on the RHS always has a reduced set of possibilities, ensuring that the right-most non-terminal will never branch back to the current LHS.

In the reduced version Cast ::= LParent Identifier RParent Lambda violates that rule, because Lambda may end with a generic Expr.

As a result this expression is ambiguous:

(Type) () -> a * b

The parser cannot decide which way to interpret (denoted by resulting AST):

  • Mul ( Cast ( Lambda (Identifier)), Identifier )
  • Cast ( Lambda ( Mul (Identifier, Identifier) ) )

IOW, this mini grammar (and transitively the Java grammar) cannot decide if the trailing *b binds to the inner expression a or to the entire expression so far (Type) () -> a.

I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule?

Semantically, multiplying a cast with some factor normally doesn't make sense, but when casting to a numerical boxing type this may still do something.

Casting a lambda, however, will never produce a numerical type, as none of those are functional interfaces, not sure if this can be turned into something useful at the syntax level ...

@stephan-herrmann
Copy link
Contributor

stephan-herrmann commented Dec 17, 2023

The following change would fix the mini grammar, to be LALR(1):

--- mini.g
+++ mini.g
@@ -37,6 +37,7 @@
 
 Expr -> Mul
 Expr -> Lambda
+Expr -> CastL
 
 Mul -> Unary
 Mul ::= Mul MULTIPLY Unary
@@ -45,7 +46,7 @@
 Unary -> Cast
 
 Cast ::= LParen Identifier RParen Primary
-Cast ::= LParen Identifier RParen Lambda
+CastL ::= LParen Identifier RParen Lambda
 
 Primary -> Identifier

Explanation: I pulled the casted-lambda rule out of the entire tree of productions applying some kind of arithmetic, assuming that you simply cannot "compose" a (casted) lambda with any other value.

Would this illegally enable use of casted-lambda in a context where it's illegal? I think that could be compensated by either of

  • find a more suitable nonterminal to hook into
  • detect illegal use later during compilation

Indeed also the actual java.g can be disambiguated using this trick 🎉 😄 ✨

PS: I could imagine that such change would even be worth pushing upstream, i.e., to JLS.

@stephan-herrmann
Copy link
Contributor

stephan-herrmann commented Dec 17, 2023

Java is funny.

A casted lambda expression can actually be used as an operand in certain expressions.

Need to check if we break anything regarding:

  • instanceof
  • != and == comparisons
  • ternaries
  • assignment expressions

E.g.:

  	if (((Supplier<String>) () -> a) instanceof Supplier)
	    System.out.print("yes");
  	if (((Supplier<String>) () -> a) != null)
	    System.out.print("yes");

This is accepted by javac and prints "yesyes" :)

From experiments in this area I learned that instanceof normally binds to the inner expression, to bind it to the lambda parens are needed. This could mean that these rules will never directly branch into (casted) lambda but need ( Expression ) to get there, in which case the parens will disambiguate.

Actually, ecj barfs on the above:

1. ERROR in /tmp/X.java (at line 13)
        if (((Supplier<String>) () -> "a") != null)
                                ^^^^^^^^^
Missing code implementation in the compiler

Let's hope that grammar simplification will avoid this unspeakable error message, and causes no further havoc. In fact I wonder if anyone would insist that ecj supports checking a lambda for null, or performing instanceof checks. In both cases the lambda will be left to gc before it can ever be invoked.

@srikanth-sankaran
Copy link
Contributor Author

if (((Supplier<String>) () -> a) instanceof Supplier)
	    System.out.print("yes");
  	if (((Supplier<String>) () -> a) != null)
	    System.out.print("yes");

I have raised #1755

@srikanth-sankaran
Copy link
Contributor Author

Thanks for the various experiments and the suggestions Stephan, I will study these. I am missing my Dragon book today, will have it in my hands tomorrow.

@stephan-herrmann
Copy link
Contributor

Genie reminded me that we have some unfinished work still in gerrit:

Seeing mention of fallBackToSpringForward() and friends, I guess we want to wait for the grammar simplification, before polishing this code area. Perhaps some of the magic will no longer be needed at some point :)

@srikanth-sankaran
Copy link
Contributor Author

In an attempt to clarify what can/cannot be done I simplified the Java grammar to a minimum that still shows a remaining conflict:

mini2.txt Rename this file to mini.g and pass it to jikespg, it will show the dreaded

*** Shift/reduce conflict on "MULTIPLY" with rule 2

This grammar is not LALR(1).

In all the rest of the Java grammar, productions are chained in a way that anything on the RHS always has a reduced set of possibilities, ensuring that the right-most non-terminal will never branch back to the current LHS.

In the reduced version Cast ::= LParent Identifier RParent Lambda violates that rule, because Lambda may end with a generic Expr.

As a result this expression is ambiguous:

(Type) () -> a * b

The parser cannot decide which way to interpret (denoted by resulting AST):

  • Mul ( Cast ( Lambda (Identifier)), Identifier )
  • Cast ( Lambda ( Mul (Identifier, Identifier) ) )

IOW, this mini grammar (and transitively the Java grammar) cannot decide if the trailing *b binds to the inner expression a or to the entire expression so far (Type) () -> a.

I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule?

I don't know either - but I am not averse to, (at some point) - if such an options are not available, building them on an experimental basis to see where that would lead us. IIRC, YACC allows you to say precedence between two rules for shift/reduce (and where no precedence is specified always prefers shift) and for reduce/reduce it always opt for the earlier listed rule.

Semantically, multiplying a cast with some factor normally doesn't make sense, but when casting to a numerical boxing type this may still do something.

Casting a lambda, however, will never produce a numerical type, as none of those are functional interfaces, not sure if this can be turned into something useful at the syntax level ...

@srikanth-sankaran
Copy link
Contributor Author

Two big blockers for certain types of experiments in the area of parser/scanner are:

  1. The DiagnoseParser with its too much voodoo in the works. The fact that there are multiple machines that are run in parallel with too much of the automaton's innards and tables being manipulated and more importantly, the fact that tokens are read ahead of time and buffered into a stream mean that we can never have a satisfactory implementation of org.eclipse.jdt.internal.compiler.parser.diagnose.DiagnoseParser.automatonWillShift(int) (that can be useful for the scanner)

This method is a fantastic side-effect free mechanism to query the parser of its current state - It is axiomatic that an LALR parser will never shift on invalid input - so this is a great tool to steer the scan and the parse.

Presently this method is used to disambiguate when when is a contextual keyword vs a plain old identifier. Ask the parser if it will shift the keyword when and if it answers yes, then the when in the input stream is a contextual keyword and otherwise otherwise.

It is possible to build this disambiguation into DiagnoseParser's internal token stream but that is some work.

  1. CompletionParser which works with incomplete evolving, broken code.

In general building smarts falls in some buckets.

(a) Feedback from the parser to the scanner: this allows consumed input to steer classification of unconsumed input. This is looking back in the rear view mirror. (b) Looking ahead to detect structure of what is to come to classify current token (perhaps inject some synthetics to steer the parser in a particular path) - this is Vanguard parser's forte.
and
(c) The now abandoned String Templates work had a brand new way of dealing conflicts in the grammar - by making some non-terminals symbols (with their own grammatical structure) appear as (pseudo) terminal symbols. Unfortunately before we could gain good hands on experience with that model, we had to rip it out.

@stephan-herrmann
Copy link
Contributor

Two big blockers for certain types of experiments in the area of parser/scanner are:

I see that there are interesting tasks still ahead of us, but does any of what you mention here block the current grammar refactoring?

@stephan-herrmann
Copy link
Contributor

I couldn't yet find documentation of all of those options of jikespg, if something there allows to define precedence to the "inner" rule?

I don't know either - but I am not averse to, (at some point) - if such an options are not available, building them on an experimental basis to see where that would lead us. IIRC, YACC allows you to say precedence between two rules for shift/reduce (and where no precedence is specified always prefers shift) and for reduce/reduce it always opt for the earlier listed rule.

I spent a little while on jikespg sources and tried if any options would have the desired effect, but to no avail.
The most promising option name was SHIFT-DEFAULT, but this flag seems only to enable an optimization, whereby for each terminal a default shift is determined, such that this default can be eliminated from the state automaton. I guessed this from these code comments:

/****************************************************************************/
/*                            COMPUTE_SHIFT_DEFAULT:                        */
/****************************************************************************/
/* This procedure updates the vector SHIFTDF, indexable by the terminals in */
/* the grammar. Its task is to assign to each element of SHIFTDF, the action*/
/* most frequently defined on the symbol in question.                       */
/****************************************************************************/

and

/*****************************************************************************/
/*                             COMPUTE_GOTO_DEFAULT:                         */
/*****************************************************************************/
/*   COMPUTE_GOTO_DEFAULT constructs the vector GOTODEF, which is indexed by */
/* the non-terminals in the grammar. Its task is to assign to each element   */
/* of the array the Action which is most frequently defined on the symbol in */
/* question, and remove all such actions from the state automaton.           */
/*****************************************************************************/

And then my C-reading skills are not sufficient to figure out where such an option could be added on our own. Or perhaps it's my lack of understanding of what a parser generator actually does :)

@srikanth-sankaran
Copy link
Contributor Author

Two big blockers for certain types of experiments in the area of parser/scanner are:

I see that there are interesting tasks still ahead of us, but does any of what you mention here block the current grammar refactoring?

No, they don't block any work that eliminates conflicts in the first place by rewriting the grammar. I was jotting down the thoughts about what our options are for handling a conflict.

@srikanth-sankaran
Copy link
Contributor Author

@stephan-herrmann @mpalat @jarthana - Can use some feedback on this following question:

I see that handshake between the Parser and Scanner has been steadily growing. The comment above org.eclipse.jdt.internal.compiler.parser.Scanner.VanguardParser.parse(Goal) clearly says: "Canonical LALR pushdown automaton identical to Parser.parse() minus side effects of any kind" but many consume* methods of the parser have started introducing side effects to clue in the scanner about where they are in the parse.

See for example, the state held by

org.eclipse.jdt.internal.compiler.parser.Scanner.caseStartPosition,
org.eclipse.jdt.internal.compiler.parser.Scanner.yieldColons
org.eclipse.jdt.internal.compiler.parser.Scanner.multiCaseLabelComma

Now there are two problems with this: one is that VanguardParser doesn't execute the consume* methods so the side effects will be missing. Secondly DiagnoseParser won't execute the consume* methods so the handshake between the Scanner and Parser will be absent in DiagnoseParse which means in the presence of syntax errors, the scanner will likely be clueless while tokenizing some advanced constructs that need cooperation from parser and scanner.

The comment in org.eclipse.jdt.internal.compiler.parser.Scanner.VanguardScanner.getNextToken() is interesting:

"// this.caseStartPosition > this.startPositionpossible on recovery - bother only about correct ones."

But if that is the approach we want to take - which may not be unreasonable - there is no need for so many state driven hacky handshakes.

org.eclipse.jdt.internal.compiler.parser.ConflictedParser.automatonWillShift(int) may help us address many of situations without using Vanguard parser : see #456 for one instance of what I mean

@srikanth-sankaran
Copy link
Contributor Author

@stephan-herrmann @mpalat @jarthana - Can use some feedback on this following question:

The question I am asking is: Given that the state based handshake between the Parser and the Scanner won't work in DiagnoseParse and VanguardParse, why not ditch that and opt for a cleaner solution in disambiguating token by using org.eclipse.jdt.internal.compiler.parser.ConflictedParser.automatonWillShift(int) where possible which approach would also not work with the DiagnoseParse (without heroic effort ... actually scratch that - without super-heroic effort) but which may render a VanguardParse unnecessary in that scenario.

See that org.eclipse.jdt.internal.compiler.parser.Scanner.disambiguatedRestrictedIdentifierWhen(int) does not run the VanguardParser to disambiguate when -

(See also org.eclipse.jdt.internal.compiler.parser.diagnose.DiagnoseParser.automatonWillShift(int) which is clueless how to answer the question - the comment in the earlier method org.eclipse.jdt.internal.compiler.parser.diagnose.DiagnoseParser.atConflictScenario(int)) is applicable here too)

@stephan-herrmann
Copy link
Contributor

@srikanth-sankaran can we, pretty please, try to exercise some separation of concerns where it is still possible?

This issue is about modifying the grammar to the end that hopefully less disambiguation / lookahead / scanner-parser communication is necessary.

If you want to discuss how to improve the remaining kludges for handling conditional keywords and grammar ambiguities, please create a separate space for that. Maybe add a paragraph or two in the structure below https://github.com/eclipse-jdt/eclipse.jdt.core/wiki/ECJ ? Have you seen https://github.com/eclipse-jdt/eclipse.jdt.core/wiki/ECJ-Parse ?

@srikanth-sankaran
Copy link
Contributor Author

@srikanth-sankaran can we, pretty please, try to exercise some separation of concerns where it is still possible?

Point taken. I'll open a separate thread in a day or two. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants