Prefer element reference over method invocation #156

nbrahms · 2021-01-21T00:14:14Z

Ruby presents two ~~syntactic~~ sugarings that can not be distinguished
unambiguously from syntax alone:

First, array elements can be referenced using a bracketed argument after
any amount of white space, so:

x.[](0)

x[0]

x [0]

are all equivalent.

Second, methods may be invoked with omitted parends, so:

f(y)

f y

are equivalent.

The ambiguity can be seen when the function argument is a literal array:

f [0]

At this point, there is no syntactic information that can distinguish
between element reference and procedural invocation.

This can be seen by running this program in irb:

irb(main):001:0> x = [0, 1, 2]
=> [0, 1, 2]
irb(main):002:0> x.[](0)
=> 0
irb(main):003:0> x [0]
=> 0
irb(main):004:0> def y(z)
irb(main):005:1>   z
irb(main):006:1> end
=> :y
irb(main):007:0> y([0])
=> [0]
irb(main):008:0> y [0]
=> [0]

Previously, tree-sitter-ruby handled this ambiguity by presenting both
x [0] and y [0] as procedural invocation.

However, this causes a parse error as described in
#146, when parsing

d.find { |x| a(x) } [b] == c

Here I add an optional, lower-precedence interpretation of x [0] as an
element reference.

Due to the construction of the grammar in this project, this
unfortunately causes problems when attempting to parse constructs like:

fun [0] do
  something
end

as the parser will eagerly consume fun [0] as the left-hand-side of
a binary expression. To deal with this case, I explicitly add this
construct to the call production rule. Unfortunately I had to resort
to the GLR parser in order to resolve the ambiguity between these two
rules.

Finally, note that the tree obtained from the construct

z [0] == 0

is context-sensitive in Ruby. If z is an array type, it is interpreted
as binary ( reference ( identifier, integer ), integer. If z is
a method, it is interpreted as call ( identifier, binary ( array (integer), integer). Since tree-sitter assumes the parsed language is
context-free, there's no good way for us to resolve this ambiguity. This
commit prefers the second, method-invocation, interpretation, which
appears to be more common within the test corpus.

Fixes #146 and #155

nbrahms · 2021-01-21T01:54:09Z

whoops :D

nbrahms · 2021-01-21T02:13:51Z

@maxbrunsfeld I'm having trouble getting tree-sitter to attempt to resolve this:

argf [x] y

as a command_call node either first or in addition to resolving this as an element_reference after the change here.

I'd love your insight if you have some moments to check this out.

Here's what parse -D is kicking out so far:

maxbrunsfeld · 2021-01-21T17:34:23Z

I think that, for that to parse as a command call, you'd need a comma after the array literal:

argf [x], y

nbrahms · 2021-01-22T17:52:26Z

@maxbrunsfeld I was able to resolve the corpus test failures. Unfortunately, I had to fall back to GLR parsing.

I think the work here both fixes the linked issues and parses the test corpus properly.

However, I'm confident that the conflict here should be resolvable with an LR(1) parser. Would love your thoughts.

nbrahms · 2021-01-22T17:55:29Z

I think that, for that to parse as a command call, you'd need a comma after the array literal:
argf [x], y

Yup, it turns out the issue I actually needed to investigate was

argf [f] do
  ...
end

Ruby presents two ~syntactic~ sugarings that can not be distinguished unambiguously from syntax alone: First, array elements can be referenced using a bracketed argument after any amount of white space, so: x.[](0) x[0] x [0] are all equivalent. Second, methods may be invoked with omitted parends, so: f(y) f y are equivalent. The ambiguity can be seen when the function argument is a literal array: f [0] At this point, there is no syntactic information that can distinguish between element reference and procedural invocation. This can be seen by running this program in irb: irb(main):001:0> x = [0, 1, 2] => [0, 1, 2] irb(main):002:0> x.[](0) => 0 irb(main):003:0> x [0] => 0 irb(main):004:0> def y(z) irb(main):005:1> z irb(main):006:1> end => :y irb(main):007:0> y([0]) => [0] irb(main):008:0> y [0] => [0] Previously, tree-sitter-ruby handled this ambiguity by presenting both `x [0]` and `y [0]` as procedural invocation. However, this causes a parse error as described in tree-sitter#146, when parsing d.find { |x| a(x) } [b] == c Here I add an optional, lower-precedence interpretation of `x [0]` as an element reference. Due to the construction of the grammar in this project, this unfortunately causes problems when attempting to parse constructs like: fun [0] do something end as the parser will eagerly consume `fun [0]` as the left-hand-side of a binary expression. To deal with this case, I explicitly add this construct to the `call` production rule. Unfortunately I had to resort to the GLR parser in order to resolve the ambiguity between these two rules. Finally, note that the tree obtained from the construct z [0] == 0 is context-sensitive in Ruby. If `z` is an array type, it is interpreted as `binary ( reference ( identifier, integer ), integer`. If `z` is a method, it is interpreted as `call ( identifier, binary ( array (integer), integer)`. Since tree-sitter assumes the parsed language is context-free, there's no good way for us to resolve this ambiguity. This commit prefers the second, method-invocation, interpretation, which appears to be more common within the test corpus.

maxbrunsfeld · 2021-01-22T18:20:35Z

This one might take a bit longer for me to review because it's more complex. I will try to look at it next week.

nbrahms · 2021-01-28T18:16:25Z

Hi @maxbrunsfeld ... just checking in to see if you'd had an opportunity to look at this yet. 🙏

nbrahms · 2021-02-02T22:56:42Z

Hi @maxbrunsfeld ... Just checking in again to see if you (or possibly any other maintainers) had a chance to look over this change yet. Much thanks.

maxbrunsfeld

Thanks for the great explanation of your debugging process.

This is actually a pretty tricky change, surprisingly. There are some downsides to this change that I'd like to avoid if possible, but I admit that wasn't able to come up with something better on my first attempt.

The way that the different precedences interact is really tricky in this part of the grammar. I would like to add some more precise APIs for resolving conflicts like these.

I am going to spend a little bit more time investigating this one myself, to see if there's any way to fix the parse errors that you've found (and written good tests for) without adding conflicts, and without changing the structure of call.

Thanks all of the debugging and analysis that you've done so far. We'll fix this soon one way or the other.

maxbrunsfeld · 2021-02-04T22:24:38Z

grammar.js

@@ -535,8 +547,10 @@ module.exports = grammar({
        seq(receiver, arguments),
        seq(receiver, prec(PREC.CURLY_BLOCK, seq(arguments, block))),
        seq(receiver, prec(PREC.DO_BLOCK, seq(arguments, doBlock))),
+        prec(PREC.CURLY_BLOCK, seq(receiver, $.array, block)),
+        prec(PREC.DO_BLOCK, seq(receiver, $.array, doBlock)),


This will cause call nodes to have a new type of child (array), instead of just receiver, arguments and block/do_block. It seems a little bit counterintuitive to me, so I'd like to try and avoid this.

maxbrunsfeld · 2021-02-04T22:26:01Z

grammar.js

+  conflicts: $ => [
+    [$._lhs, $.call],
+    [$._lhs, $.call, $.command_call, $.command_call_with_block],
+  ],


I'm a little worried about how this conflict will behave, since it seems like it may come up in more cases than just an identifier followed by a [. It may be necessary, but I'd like to look into this some, to see if it can be avoided.

When an opening square bracket appears immediately after a callable expression like "a" or "a.b", we must decide between two possible interpretations of the bracket: 1. It could be part of an element reference, as in `a[0] = true`. 2. Or it could be an array literal, passed as an argumet, as in `puts [1, 2, 3]` If there is no preceding whitespace, the bracket should *always* be treated as part of an element reference. This matches MRI's behavior. If there *is* preceding whitespace, MRI makes its decision in a context-sensitive way, based on whether the preceding expression is a local variable or a method name. This parser is not context-sensitive, so we instead will interpret the bracket as part of an array literal whenever that is syntactically valid, and interpret it as part of element reference otherwise. The external scanner can use the validity of other expression tokens like `string` to infer whether an array literal would be valid.

maxbrunsfeld · 2021-02-04T22:58:53Z

@nbrahms I think I have found a solution that I prefer, because there are no conflicts, and the grammar can stay a little bit simpler. It uses some manual logic in the external scanner to match the [ token. Let me know what you think, and if you can come up with any other test cases that we should make sure to try. If you are cool with it, you could merge it into your branch, and then I'll merge this.

semgrep#1

nbrahms · 2021-02-05T00:13:04Z

Awesome thanks @maxbrunsfeld

Use external scanner logic to distinguish between arrays & subscripts

nbrahms · 2021-02-05T00:20:21Z

Yup, that implementation looks good to me. I've merged it into the source branch of this PR.

nbrahms · 2021-02-05T00:23:12Z

Just to double-check @maxbrunsfeld ... I think this change still suffers from your point in #156 (comment)... do we want to try to add aliasing to avoid this?

nbrahms · 2021-02-05T00:25:20Z

Just to double-check @maxbrunsfeld ... I think this change still suffers from your point in #156 (comment)... do we want to try to add aliasing to avoid this?

Ah, I just saw that you were able to undo the changes to the $.call node. Looks like we're good :)

maxbrunsfeld · 2021-02-05T00:30:38Z

Thanks!

nbrahms force-pushed the nathan/indexed-proc branch from 0313cef to 3ec2c53 Compare January 22, 2021 17:48

nbrahms force-pushed the nathan/indexed-proc branch from 3ec2c53 to 41720e3 Compare January 22, 2021 17:59

Merge remote-tracking branch 'origin/master' into nathan/indexed-proc

1c4b299

maxbrunsfeld reviewed Feb 4, 2021

View reviewed changes

Merge pull request #1 from tree-sitter/indexed-proc

02b5403

Use external scanner logic to distinguish between arrays & subscripts

maxbrunsfeld merged commit 1fa06a9 into tree-sitter:master Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer element reference over method invocation #156

Prefer element reference over method invocation #156

nbrahms commented Jan 21, 2021 •

edited

Loading

nbrahms commented Jan 21, 2021

nbrahms commented Jan 21, 2021 •

edited

Loading

maxbrunsfeld commented Jan 21, 2021

nbrahms commented Jan 22, 2021

nbrahms commented Jan 22, 2021

maxbrunsfeld commented Jan 22, 2021

nbrahms commented Jan 28, 2021

nbrahms commented Feb 2, 2021

maxbrunsfeld left a comment

maxbrunsfeld Feb 4, 2021

maxbrunsfeld Feb 4, 2021

maxbrunsfeld commented Feb 4, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

maxbrunsfeld commented Feb 5, 2021

Prefer element reference over method invocation #156

Prefer element reference over method invocation #156

Conversation

nbrahms commented Jan 21, 2021 • edited Loading

nbrahms commented Jan 21, 2021

nbrahms commented Jan 21, 2021 • edited Loading

maxbrunsfeld commented Jan 21, 2021

nbrahms commented Jan 22, 2021

nbrahms commented Jan 22, 2021

maxbrunsfeld commented Jan 22, 2021

nbrahms commented Jan 28, 2021

nbrahms commented Feb 2, 2021

maxbrunsfeld left a comment

Choose a reason for hiding this comment

maxbrunsfeld Feb 4, 2021

Choose a reason for hiding this comment

maxbrunsfeld Feb 4, 2021

Choose a reason for hiding this comment

maxbrunsfeld commented Feb 4, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

nbrahms commented Feb 5, 2021

maxbrunsfeld commented Feb 5, 2021

nbrahms commented Jan 21, 2021 •

edited

Loading

nbrahms commented Jan 21, 2021 •

edited

Loading