From 141fd0a9cdd4e4e7aa05ee9ac15ee0f4d29d0018 Mon Sep 17 00:00:00 2001 From: Nick Cameron Date: Wed, 28 May 2014 13:14:25 +1200 Subject: [PATCH] Disallow struct literals in ambiguous positions. Do not identify struct literals by searching for `:`. Instead define a sub-category of expressions which excludes struct literals and re-define `for`, `if`, and other expressions which take an expression followed by a block (or non-terminal which can be replaced by a block) to take this sub-category, instead of all expressions. --- 0000-struct-grammar.md | 100 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+) create mode 100644 0000-struct-grammar.md diff --git a/0000-struct-grammar.md b/0000-struct-grammar.md new file mode 100644 index 00000000000..79103e3c109 --- /dev/null +++ b/0000-struct-grammar.md @@ -0,0 +1,100 @@ +- Start Date: +- RFC PR #: +- Rust Issue #: + +# Summary + +Do not identify struct literals by searching for `:`. Instead define a sub- +category of expressions which excludes struct literals and re-define `for`, +`if`, and other expressions which take an expression followed by a block (or +non-terminal which can be replaced by a block) to take this sub-category, +instead of all expressions. + +# Motivation + +Parsing by looking ahead is fragile - it could easily be broken if we allow `:` +to appear elsewhere in types (e.g., type ascription) or if we change struct +literals to not require the `:` (e.g., if we allow empty structs to be written +with braces, or if we allow struct literals to unify field names to local +variable names, as has been suggested in the past and which we currently do for +struct literal patterns). We should also be able to give better error messages +today if users make these mistakes. More worringly, we might come up with some +language feature in the future which is not predictable now and which breaks +with the current system. + +Hopefully, it is pretty rare to use struct literals in these positions, so there +should not be much fallout. Any problems can be easily fixed by assigning the +struct literal into a variable. However, this is a backwards incompatible +change, so it should block 1.0. + +# Detailed design + +Here is a simplified version of a subset of Rust's abstract syntax: + +``` +e ::= x + | e `.` f + | name `{` (x `:` e)+ `}` + | block + | `for` e `in` e block + | `if` e block (`else` block)? + | `|` pattern* `|` e + | ... +block ::= `{` (e;)* e? `}` +``` + +Parsing this grammar is ambiguous since `x` cannot be distinguished from `name`, +so `e block` in the for expression is ambiguous with the struct literal +expression. We currently solve this by using lookahead to find a `:` token in +the struct literal. + +I propose the following adjustment: + +``` +e ::= e' + | name `{` (x `:` e)+ `}` + | `|` pattern* `|` e + | ... +e' ::= x + | e `.` f + | block + | `for` e `in` e' block + | `if` e' block (`else` block)? + | `|` pattern* `|` e' + | ... +block ::= `{` (e;)* e? `}` +``` + +`e' is just e without struct literal expressions. We use e' instead of e +`wherever e is followed directly by block or any other non-terminal which may +`have block as its first terminal (after any possible expansions). + +For any expressions where a sub-expression is the final lexical element +(closures in the subset above, but also unary and binary operations), we require +two versions of the meta-expression - the normal one in `e` and a version with +`e'` for the final element in `e'`. + +Implementation would be simpler, we just add a flag to `parser::restriction` +called `RESTRICT_BLOCK` or something, which puts us into a mode which reflects +`e'`. We would drop in to this mode when parsing `e'` position expressions and +drop out of it for all but the last sub-expression of an expression. + +# Drawbacks + +It makes the formal grammar and parsing a little more complicated (although it +is simpler in terms of needing less lookahead and avoiding a special case). + +# Alternatives + +Don't do this. + +Allow all expressions but greedily parse non-terminals in these positions, e.g., +`for N {} {}` would be parsed as `for (N {}) {}`. This seems worse because I +believe it will be much rarer to have structs in these positions than to have an +identifier in the first position, followed by two blocks (i.e., parse as `(for N +{}) {}`). + +# Unresolved questions + +Do we need to expose this distinction anywhere outside of the parser? E.g., +macros?