Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer element reference over method invocation #156

Merged
merged 4 commits into from
Feb 5, 2021

Conversation

nbrahms
Copy link
Contributor

@nbrahms nbrahms commented Jan 21, 2021

Ruby presents two syntactic sugarings that can not be distinguished
unambiguously from syntax alone:

First, array elements can be referenced using a bracketed argument after
any amount of white space, so:

x.[](0)

x[0]

x [0]

are all equivalent.

Second, methods may be invoked with omitted parends, so:

f(y)

f y

are equivalent.

The ambiguity can be seen when the function argument is a literal array:

f [0]

At this point, there is no syntactic information that can distinguish
between element reference and procedural invocation.

This can be seen by running this program in irb:

irb(main):001:0> x = [0, 1, 2]
=> [0, 1, 2]
irb(main):002:0> x.[](0)
=> 0
irb(main):003:0> x [0]
=> 0
irb(main):004:0> def y(z)
irb(main):005:1>   z
irb(main):006:1> end
=> :y
irb(main):007:0> y([0])
=> [0]
irb(main):008:0> y [0]
=> [0]

Previously, tree-sitter-ruby handled this ambiguity by presenting both
x [0] and y [0] as procedural invocation.

However, this causes a parse error as described in
#146, when parsing

d.find { |x| a(x) } [b] == c

Here I add an optional, lower-precedence interpretation of x [0] as an
element reference.

Due to the construction of the grammar in this project, this
unfortunately causes problems when attempting to parse constructs like:

fun [0] do
  something
end

as the parser will eagerly consume fun [0] as the left-hand-side of
a binary expression. To deal with this case, I explicitly add this
construct to the call production rule. Unfortunately I had to resort
to the GLR parser in order to resolve the ambiguity between these two
rules.

Finally, note that the tree obtained from the construct

z [0] == 0

is context-sensitive in Ruby. If z is an array type, it is interpreted
as binary ( reference ( identifier, integer ), integer. If z is
a method, it is interpreted as call ( identifier, binary ( array (integer), integer). Since tree-sitter assumes the parsed language is
context-free, there's no good way for us to resolve this ambiguity. This
commit prefers the second, method-invocation, interpretation, which
appears to be more common within the test corpus.

Fixes #146 and #155

@nbrahms
Copy link
Contributor Author

nbrahms commented Jan 21, 2021

whoops :D

@nbrahms
Copy link
Contributor Author

nbrahms commented Jan 21, 2021

@maxbrunsfeld I'm having trouble getting tree-sitter to attempt to resolve this:

argf [x] y

as a command_call node either first or in addition to resolving this as an element_reference after the change here.

I'd love your insight if you have some moments to check this out.

Here's what parse -D is kicking out so far:

image

@maxbrunsfeld
Copy link
Contributor

I think that, for that to parse as a command call, you'd need a comma after the array literal:

argf [x], y

@nbrahms nbrahms force-pushed the nathan/indexed-proc branch from 0313cef to 3ec2c53 Compare January 22, 2021 17:48
@nbrahms
Copy link
Contributor Author

nbrahms commented Jan 22, 2021

@maxbrunsfeld I was able to resolve the corpus test failures. Unfortunately, I had to fall back to GLR parsing.

I think the work here both fixes the linked issues and parses the test corpus properly.

However, I'm confident that the conflict here should be resolvable with an LR(1) parser. Would love your thoughts.

@nbrahms
Copy link
Contributor Author

nbrahms commented Jan 22, 2021

I think that, for that to parse as a command call, you'd need a comma after the array literal:

argf [x], y

Yup, it turns out the issue I actually needed to investigate was

argf [f] do
  ...
end

Ruby presents two ~syntactic~ sugarings that can not be distinguished
unambiguously from syntax alone:

First, array elements can be referenced using a bracketed argument after
any amount of white space, so:

    x.[](0)

    x[0]

    x [0]

are all equivalent.

Second, methods may be invoked with omitted parends, so:

    f(y)

    f y

are equivalent.

The ambiguity can be seen when the function argument is a literal array:

    f [0]

At this point, there is no syntactic information that can distinguish
between element reference and procedural invocation.

This can be seen by running this program in irb:

    irb(main):001:0> x = [0, 1, 2]
    => [0, 1, 2]
    irb(main):002:0> x.[](0)
    => 0
    irb(main):003:0> x [0]
    => 0
    irb(main):004:0> def y(z)
    irb(main):005:1>   z
    irb(main):006:1> end
    => :y
    irb(main):007:0> y([0])
    => [0]
    irb(main):008:0> y [0]
    => [0]

Previously, tree-sitter-ruby handled this ambiguity by presenting both
`x [0]` and `y [0]` as procedural invocation.

However, this causes a parse error as described in
tree-sitter#146, when parsing

    d.find { |x| a(x) } [b] == c

Here I add an optional, lower-precedence interpretation of `x [0]` as an
element reference.

Due to the construction of the grammar in this project, this
unfortunately causes problems when attempting to parse constructs like:

    fun [0] do
      something
    end

as the parser will eagerly consume `fun [0]` as the left-hand-side of
a binary expression. To deal with this case, I explicitly add this
construct to the `call` production rule. Unfortunately I had to resort
to the GLR parser in order to resolve the ambiguity between these two
rules.

Finally, note that the tree obtained from the construct

    z [0] == 0

is context-sensitive in Ruby. If `z` is an array type, it is interpreted
as `binary ( reference ( identifier, integer ), integer`. If `z` is
a method, it is interpreted as `call ( identifier, binary ( array
(integer), integer)`. Since tree-sitter assumes the parsed language is
context-free, there's no good way for us to resolve this ambiguity. This
commit prefers the second, method-invocation, interpretation, which
appears to be more common within the test corpus.
@nbrahms nbrahms force-pushed the nathan/indexed-proc branch from 3ec2c53 to 41720e3 Compare January 22, 2021 17:59
@maxbrunsfeld
Copy link
Contributor

This one might take a bit longer for me to review because it's more complex. I will try to look at it next week.

@nbrahms
Copy link
Contributor Author

nbrahms commented Jan 28, 2021

Hi @maxbrunsfeld ... just checking in to see if you'd had an opportunity to look at this yet. 🙏

@nbrahms
Copy link
Contributor Author

nbrahms commented Feb 2, 2021

Hi @maxbrunsfeld ... Just checking in again to see if you (or possibly any other maintainers) had a chance to look over this change yet. Much thanks.

Copy link
Contributor

@maxbrunsfeld maxbrunsfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great explanation of your debugging process.

This is actually a pretty tricky change, surprisingly. There are some downsides to this change that I'd like to avoid if possible, but I admit that wasn't able to come up with something better on my first attempt.

The way that the different precedences interact is really tricky in this part of the grammar. I would like to add some more precise APIs for resolving conflicts like these.

I am going to spend a little bit more time investigating this one myself, to see if there's any way to fix the parse errors that you've found (and written good tests for) without adding conflicts, and without changing the structure of call.

Thanks all of the debugging and analysis that you've done so far. We'll fix this soon one way or the other.

grammar.js Outdated
@@ -535,8 +547,10 @@ module.exports = grammar({
seq(receiver, arguments),
seq(receiver, prec(PREC.CURLY_BLOCK, seq(arguments, block))),
seq(receiver, prec(PREC.DO_BLOCK, seq(arguments, doBlock))),
prec(PREC.CURLY_BLOCK, seq(receiver, $.array, block)),
prec(PREC.DO_BLOCK, seq(receiver, $.array, doBlock)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause call nodes to have a new type of child (array), instead of just receiver, arguments and block/do_block. It seems a little bit counterintuitive to me, so I'd like to try and avoid this.

grammar.js Outdated
conflicts: $ => [
[$._lhs, $.call],
[$._lhs, $.call, $.command_call, $.command_call_with_block],
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about how this conflict will behave, since it seems like it may come up in more cases than just an identifier followed by a [. It may be necessary, but I'd like to look into this some, to see if it can be avoided.

When an opening square bracket appears immediately after a callable
expression like "a" or "a.b", we must decide between two possible
interpretations of the bracket:
1. It could be part of an element reference, as in
   `a[0] = true`.
2. Or it could be an array literal, passed as an argumet, as in
   `puts [1, 2, 3]`

If there is no preceding whitespace, the bracket should *always* be
treated as part of an element reference. This matches MRI's behavior.

If there *is* preceding whitespace, MRI makes its decision in a
context-sensitive way, based on whether the preceding expression
is a local variable or a method name.

This parser is not context-sensitive, so we instead will interpret
the bracket as part of an array literal whenever that is syntactically
valid, and interpret it as part of element reference otherwise. The
external scanner can use the validity of other expression tokens like
`string` to infer whether an array literal would be valid.
@maxbrunsfeld
Copy link
Contributor

@nbrahms I think I have found a solution that I prefer, because there are no conflicts, and the grammar can stay a little bit simpler. It uses some manual logic in the external scanner to match the [ token. Let me know what you think, and if you can come up with any other test cases that we should make sure to try. If you are cool with it, you could merge it into your branch, and then I'll merge this.

semgrep#1

@nbrahms
Copy link
Contributor Author

nbrahms commented Feb 5, 2021

Awesome thanks @maxbrunsfeld

Use external scanner logic to distinguish between arrays & subscripts
@nbrahms
Copy link
Contributor Author

nbrahms commented Feb 5, 2021

Yup, that implementation looks good to me. I've merged it into the source branch of this PR.

@nbrahms
Copy link
Contributor Author

nbrahms commented Feb 5, 2021

Just to double-check @maxbrunsfeld ... I think this change still suffers from your point in #156 (comment)... do we want to try to add aliasing to avoid this?

@nbrahms
Copy link
Contributor Author

nbrahms commented Feb 5, 2021

Just to double-check @maxbrunsfeld ... I think this change still suffers from your point in #156 (comment)... do we want to try to add aliasing to avoid this?

Ah, I just saw that you were able to undo the changes to the $.call node. Looks like we're good :)

@maxbrunsfeld maxbrunsfeld merged commit 1fa06a9 into tree-sitter:master Feb 5, 2021
@maxbrunsfeld
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse error on loop iterator
2 participants