Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compound binding regex when a binding is preceded with text with { or ] #2711

Closed

Conversation

TimvdLippe
Copy link
Contributor

As outlined by @david-saslawsky in his comment another edge-case for the regex is
[YES]/[NO] {{computeCompound('world', 'username ', 'Hello {0} ')}}

Therefore the regex is once again (sorry for the multiple PR's) changed to also cover this edge-case. Updated the original test-case to reflect this change.

image

@kevinpschaaf PTAL

@TimvdLippe
Copy link
Contributor Author

There has been a problem reported in GoogleWebComponents/google-chart#73 (comment) where the regex updated in #2705 matched the JSONArray, which it did not before. I have derived two possible actions:

  1. Do not do anything. While this sounds weird, the JSONArray does match the description in the docs: A binding annotation consists of a property name or subproperty name enclosed in curly brackets ({{}}) or square brackets ([[]]).. The "name" is a list of JSONObjects. However, in order to circumvent this issue, a new regex can be even more strict, as shown in option 3.
  2. Introduce a new escape character =. When a binding starts with =, it is not bound to, but used as literal. E.g. binding="=[[true]]" results in binding="[[true]]" as literal.
  3. Further restrict the allowed names to bind on. This solution is again more intelligent, but with every update it also lowers the maintainability and might cause collateral damage (like what happened with the JSON). It would allow all tests to pass, but creates more edge-cases. The regex I came up with:
// Regex with comments, compiles to:
// ((?!\{\{|\[\[).+?)?(\{\{|\[\[)((?:!)?(?:(?:[\w\s\-\_\.\:\$]+?)|(?:[\w\s\-\_\.\:\$]+\(.*?\))))(?:\]\]|\}\})
_bindingRegex: (function() {
  // Opening Binding options
  var ob = '\\{\\{|\\[\\[';
  // All allowed characters in a name of property or computed function
  var name = '[\\w\\s\\-\\_\\.\\:\\$]+';

  return new RegExp('((?!' + ob + ').+?)?'  // Everything that is not a {{ or [[
    + '(' + ob + ')'                        // The mode (two- or one-way)
    + '('                                   // Start of contents of the binding
      + '(?:\!)?'                           // Is the binding a negation
      + '(?:'                               // Can either be name or computed function
        + '(?:' + name + '?)'               // A normal name for a property
      + '|'                                 // Or
        + '(?:' + name                      // The name of the computed function
        + '\\(.*?\\)'                       // The arguments of the function within ()
        + ')'                               // End of computed function match
      + ')'                                 // End of name of property or computed function match
    + ')'                                   // End of the whole match returned as m[2]
    + '(?:\\]\\]|\\}\\})'                   // The binding closing

    ,'g')
  })(),

image

All options have pro's and con's. Option 1 requires users to put JSON as return value of a computed function or property, option 2 requires end-users to update their databinding, option 3 is significantly more difficult to maintain and understand. Personally, I don't know what the way to go is. I only wanted to show that some options are possible.

I hope this information is useful, please let me know what option would be desirable.

cc @ebidel @kevinpschaaf @azakus @sorvell

@ebidel
Copy link
Contributor

ebidel commented Nov 14, 2015

1 would be a breaking change, so I'm not sure we have much of an option other than to get this working again as it was.

@arthurevans
Copy link

For option #1, couldn't you also put a space between the brackets? I believe "[ [1, 2, 3], [3, 4, 5] ]" would passed through unscathed.

@TimvdLippe
Copy link
Contributor Author

@arthurevans Yes that is correct. This would not match on the regex, but parse as JSON.

@jlg7
Copy link

jlg7 commented Nov 15, 2015

Future complexity may be tamed by decoupling the one-way data binding anchors from valid JSON (choosing new anchors)

Here are some more cases but the intent here may be ambiguous between the use of existing features:

  • one-way data binding w/ literal
  • JSON string literals in attributes with intent of Array or Object deserialization:
[[0]]
[[0.01]]
[[true]]

@TimvdLippe
Copy link
Contributor Author

I have inserted another option to counterfeit JSON edge-cases as pointed out by @jongeho1

@david-saslawsky
Copy link

I think the problem with the option #3 is not that is is too complex but that it not tight enough to prevent all false positive matches.

The current parser is not that simple to maintain, it does not recognizes things it should:

[[localize(',')]]

[[alert('[[caution]]')]]

[[
foo (
   bar, 123
)
]] 

It also recognizes things it shouldn't - which creates a risk of collision with other syntax and makes typos harder to find:

[[compound(foo]]     // silently removed

So it seems to me that there are also maintenance and complexity costs with option #1

I mean, polymer is great as it is. You sure don't run into these problems everyday and there are workarounds in pretty much all cases - but still, trying to match an airtight syntax does not seem that expensive to me:

var NUMBER = '(:?' + '[0-9]+' + ')'; 
var STRING = '(:?' + '\'[^\']*\'' + ')';    // todo: string escaping and double quotes
var IDENTIFIER = '(:?' + '[\\w\\s\\-\\_\\.\\:\\$]+' + ')';

var ARGUMENT = '(:?' + IDENTIFIER + '|' + NUMBER + '|' +  STRING + ')';
var ARGUMENT_LIST = '(:?' + ARGUMENT + ',?' + ')*';
var COMPUTED_FUNCTION = '(:?' + IDENTIFIER + '\\(' + ARGUMENT_LIST + '\\)' + ')';
var BINDING = '(:?' + COMPUTED_FUNCTION + '|' + IDENTIFIER + ')';
var EXPRESSION = '\\[\\[' + BINDING + '\\]\\]' + '|' + '\\{\\{' + BINDING + '\\}\\}';

var bindingRegex = new RegExp(EXPRESSION, "g")

Improving and testing each individual sub-expression is still easy.

If this "one regular expression to rule them all" approach doesn't work (it runs but I haven't tested the speed or the compatibility), another option is to use the RE as a poor man's lexer that split the input string into a stream of tokens - as soon as something that looks like a binding is detected.

The trick would be to bail out and leave the HTML untouched in case of parser error. Ambiguities between json and polymer syntax are rare and wouldn't take long to resolve anyway.

@kevinpschaaf
Copy link
Member

@ebidel Just as a point of order, I want to point out that not disambiguating nested JSON array literals from one-way bindings has been a long-standing issue (#1734) which only temporarily worked for two releases (v1.2.0 and v1.2.1) as an unintentional side-effect of changes that came with compound binding, so I'm inclined to go a little more slowly w.r.t. reacting to that particular "breakage".

That said, I'm reviewing @david-saslawsky's expression for performance impact and if we can make it work with the existing bespoke string escaping.

@TimvdLippe
Copy link
Contributor Author

I really like the syntax of @david-saslawsky , but keep in mind this is a breaking change regarding naming of variables. (I think this limitation is the way to go to prevent updating this regex every week) The consequence would be that update must become a MAJOR update according to semver.

@david-saslawsky
Copy link

Sorry if this is not the right place to discuss this.

I've made more a complete version of the regex:

var NUMBER = '(:?' + '[-+]?[0-9]*\.?[0-9]+(:?[eE][-+]?[0-9]+)?' + ')'; 

var SQ_STRING = "\\'(?:\\\\.|[^'\\\\])*\\'";
var DQ_STRING = '\\"(?:\\\\.|[^"\\\\])*\\"';
var STRING = '(:?' + SQ_STRING + '|' + DQ_STRING + ')';

var IDENTIFIER = '(:?' + '[\\w\\s\\-\\_\\.\\:\\$]+' + ')';

var ARGUMENT = '(:?' + IDENTIFIER + '|' + NUMBER + '|' +  STRING + ')';
var ARGUMENT_LIST = '\\(' + '(:?' + ARGUMENT + '(:?,|\\))' + ')*';
var COMPUTED_FUNCTION = '(:?' + IDENTIFIER + ARGUMENT_LIST + ')';
var BINDING = '(:?' + COMPUTED_FUNCTION + '|' + IDENTIFIER + ')';
var EXPRESSION = '(\\[\\[|\\{\\{)' + '(' + BINDING + ')' + '(:?\\]\\]|\\}\\})';

var bindingRegex = new RegExp(EXPRESSION, "g")

Floats and string escaping work. Spaces and newlines are not supported but this shouldn't have much impact on the performance (I would expect that most of the time is spent looking for the [[ or {{).

This is almost airtight but I can't find a way to handle comma correctly in the argument list.

[[foo(123,'d\'oh','[[x]]')]]    // this is recognized
[[foo(123,'d\'oh','[[x]]',]]    // ... but so is this

In the long term, the "Regexp as a lexer" approach seems even more robust and flexible. Maybe something like:

var SYMBOL = "[,\\(\\)]";
var re = new RegExp('(' + NUMBER + ')|(' + STRING + ')|(' + SYMBOL + ')' + ...);
...
re.lastIndex = (index of the '[[') + 2;
while((match=re.exec(str)) !== null) {
    // parser state machine
    switch(parser.state) {
        ...
        case STATE_ARGUMENT_LIST:
            if(match[1])) { /* accept number */ }
            else if(match[2])) { /* accept string */ }
            else { /* error */ }
    }
}

This can parse things impossible to parse with a regexp:

foo(bar(),[['baz']])    // nested functions and JSON constants

Please let me know if I can be of any help.

@TimvdLippe
Copy link
Contributor Author

@david-saslawsky It might be good to create a PR with your fix so it can be reviewed more easily. I think I have the solution for the argument comma problem.

@kevinpschaaf
Copy link
Member

@david-saslawsky I modified your regex to pass the tests when integrated with Polymer and then ran some rudimentary perf tests against it. While much more correct than the previous, it highlighted a potential catastropic backtracking situation due to the repeating quantifiers for IDENTIFIER characters nested inside repeating ARGUMENT expressions when failing a partial match. I finally resolved the perf issue using an approximation for atomic groups using lookahead+immediate back-referencing, which made it a little more complex but still an overall improvement.

In my tests, it now parses faster than the regex on master and pretty close to the parsing speed in v1.1.5 before we added compound bindings, so I'm inclined to go with this solution, despite being slightly more strict about identifier characters allowed (I fail to see this as any sort of practical problem). I will put up a PR for this shortly.

It also has the happy side-effect of once-again excluding most nested JSON array literals (due to the stricter expression matching) and makes the google-chart repro work again.

@TimvdLippe
Copy link
Contributor Author

Closing in favor of #3017

@TimvdLippe TimvdLippe closed this Nov 18, 2015
@TimvdLippe TimvdLippe deleted the compound-binding-all-consuming branch November 18, 2015 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants