Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions in filters #70

Closed
glyn opened this issue Mar 12, 2021 · 42 comments · Fixed by #163
Closed

Regular expressions in filters #70

glyn opened this issue Mar 12, 2021 · 42 comments · Fixed by #163

Comments

@glyn
Copy link
Collaborator

glyn commented Mar 12, 2021

Issue 17 proposed regular expressions in filters, which is supported by several implementations of JSONPath. This raises two questions:

  1. What is the precise syntax of regular expressions?
  2. How should ReDOS attacks be mitigated?

One approach which addresses both these questions would be to adopt RE2 syntax as discussed in ReDOS attacks, although not all languages may yet have support for RE2.

@cabo
Copy link
Member

cabo commented Mar 13, 2021

So you are saying JSONpath should only support regular expression literals, not strings with regular expressions.

@danielaparker
Copy link

danielaparker commented Mar 13, 2021

So you are saying JSONpath should only support regular expression literals, not strings with regular expressions.

Unfortunately, no, I'm saying that at the time I wrote that sentence, I had forgotten that Perl/Python/Ruby et al support variables inside the "/" delimiters. I wish I could interpret it the way you suggested.

@cabo
Copy link
Member

cabo commented Mar 13, 2021

Do you mean as in (1) interpolating variables into the RE (#{name}) or (2) filling in variables with results of the RE scan (?P<name>...)?

So far, I haven't seen any proposal for variables in the JSONPath expression language that could go into either place.

@cabo
Copy link
Member

cabo commented Mar 13, 2021

RE2 syntax would have the advantage that nobody of us likes it :-)
But seriously, any RE syntax that is not already documented will need to be defined in the document.
That leaves the ECMAscript syntax of the week or W3C XSD syntax.
ECMAscript is a bit anemic but changes all the time.
W3C XSD REs are exactly what would be needed here, but are not very popular.

@danielaparker
Copy link

danielaparker commented Mar 13, 2021

@cabo wrote:

Do you mean as in (1) interpolating variables into the RE (#{name}) or (2) filling in variables with results of the RE scan (?P<name>...)?

So far, I haven't seen any proposal for variables in the JSONPath expression language that could go into either place.

Perhaps an expression e.g. @.pattern that evaluated to a string should be allowed within "/" delimiters. I haven't thought this through, I only have experience implementing it with character literals, and the JSONPath comparison only has character literals. I'll submit a pull request with JSONPath comparison with a match against an expression, to see what everybody else does.

@glyn
Copy link
Collaborator Author

glyn commented Mar 15, 2021

I would prefer to restrict the right hand side of a regular expression match (=~) to a literal regular expression. I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression. I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).

@danielaparker
Copy link

danielaparker commented Mar 15, 2021

I would prefer to restrict the right hand side of a regular expression match (=~) to a literal regular expression.

After giving this some thought, I agree.

Introducing "/" delimiters for regular expressions suggests static "compilation" of the regular expression, as it does in perl/ruby/python. If in JSONPath we're applying a regular expression in a search of a million items, we wouldn't want the regular expression to depend on the data, as that would defeat the purpose of introducing the "/" delimiters in the first place. We wouldn't want to force implementations to "recompile" the regular expression every time the current node changed.

Also, I don't think my speculation above about allowing @.child inside "/" delimiters makes sense, as the symbols allowed within "/" delimiters are reserved for regular expression symbols.

These observations are consistent with the Jayway Java implementation of the "~=" operator.

In terms of functional requirements, regular expressions are very useful for search and filtering, but such expressions are generally statically known.

I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression.

I don't fully understand this sentence. JSONPath expressions don't produce character literals, character literals are a static concept. JSONPath expressions can produce strings, but that's not the same thing. The key feature of character literals is that they don't depend on the data, and in particular on the current node @.

I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).

I don't think anybody would propose that!

@cabo
Copy link
Member

cabo commented Mar 15, 2021 via email

@glyn
Copy link
Collaborator Author

glyn commented Mar 15, 2021

I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression.

I don't fully understand this sentence. JSONPath expressions don't produce character literals, character literals are a static concept. JSONPath expressions can produce strings, but that's not the same thing. The key feature of character literals is that they don't depend on the data, and in particular on the current node @.

Pardon my confusion, but what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

@danielaparker
Copy link

danielaparker commented Mar 15, 2021

@glyn wrote:

Pardon my confusion, but what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

One minor point, I believe it should say that it evaluates to a string "x.*y", rather than character literals /x.*y/.

Otherwise, I believe that that's technically sound, and permits an efficient implementation, in the sense of "compile" regular expression once and apply many times. And it's somewhat more likely that there is a regular expression to be found at an absolute location in the JSON document, as opposed to in all current nodes :-) I still think it's an unlikely user requirement. If you do become persuaded to support this syntax, it probably won't be because I've tried to persuade you.

@cabo
Copy link
Member

cabo commented Mar 15, 2021 via email

@danielaparker
Copy link

danielaparker commented Mar 15, 2021

@cabo wrote:

The expression language could also contain something like RegExp.escape, and it is likely to be helpful to compose the regular expression out of constant and variable (from input item) parts. What’s out there in today’s implementations?

For implementations that rely on Javascript/Python/Perl/PHP et al for a filter predicate implementation, the capabilities of the regular expression evaluator are of course the capabilities of the scripting language, but I think the support for providing input for the evaluator are quite limited, perhaps limited to a string with pre-evaluated @.something terms.

As it turns out, no implementation supports @.something terms in regular expressions.

@gregsdennis
Copy link
Collaborator

...what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

I could get behind this.

But I definitely wouldn't want to "construct" a RegEx from input data, as others have also been opposed to.

@cabo
Copy link
Member

cabo commented Mar 15, 2021 via email

@timbray
Copy link
Contributor

timbray commented Mar 16, 2021

On 2021-03-15, at 18:01, Daniel Parker @.***> wrote: Do you have some candidate syntax in mind that you would like to have investigated?
It would easy to come up with something, but I’d rather not invent something — if there is no current practice for this, we should not include it. Grüße, Carsten

Agree. Reminder: Charter says we're supposed to specify something "based on the common semantics and other aspects of existing implementations". So there's a judgment call to be made as to whether enough of them support a common regex that we can reasonably claim it constitutes "common semantics". My impression had been No, but it's months since I looked at the overlap chart.

@gregsdennis
Copy link
Collaborator

I can tell you neither of mine support RegEx (one is obsolete now). But I'm open to adding it to the current one.

@bettio
Copy link

bettio commented Mar 17, 2021

I would really wish support for regex, and honestly I think they are nearly a must.
However I'm really worried about all the implications behind them.

We are working hard for making sure that any future implementation is going to behave consistently with any other which follows our work.

I'm worried that all the complexity and differences behind regex is going to reintroduce back all the discrepancies we are trying hard to remove.

I think that If we wish any regex support we need to go for a regex syntax which is well specified and comonly implemented.

Honestly I'm not happy with RE2 that has been mentioned before, as far as I know it is available to most of languages as a C library through bindings (bindings might be really problematic).

For instance I wasn't happy with RE2 library while working with Erlang and Elixir, among all the reasons, using it would force my JSONPath implementation to depend on native code using a NIF while the whole library is 100% Elixir compiled into BEAM byte code.

IMHO we might investigate a Perl-like or POSIX-like regex support (that are quite popular). But again we should avoid getting stuck into all the regex complexities.

I have also some further points, questions and doubts:

  1. I think that regex should be supported only as a constant, therefore they must not be taken from a JSONPath evaluation result (e.g. disallow $[?(@.foo =~ @.regex)]).
    We might change idea on a future revision, but right now I'm worried that this might introduce further complexity, discrepancies and performance penalities (we can parse and compile the regex while parsing the JSONPath).

  2. are we going to treat regex as strings when using == comparison (e.g. $[?(@.foo == /[a-z]*/)] is true when foo is "[a-z]*") or are we going to use any special type, therefore == will never be true when comparing with a string?

  3. should we raise a parsing time error if a ill-formed regex is provided (e.g. /[a-z/)?

@cabo
Copy link
Member

cabo commented Mar 17, 2021

My 2 ¢:

Any RE syntax we decide to use will need to be thoroughly documented in JSONPath, except if it is one of the ECMAscript versions or W3C XSD syntax. (I do not really consider Posix a candidate, it is a bit anemic.)
Of course, we don't need capture groups, global substitute, and can probably live without back references etc. That makes W3C XSD syntax my favorite, but it is also less widely implemented than others. It's 2021, so strings and REs will be Unicode text.

Is it (1) string =~ RE, or (2) RE =~ string, or (3) both or even (4) both but with subtly different semantics (horror experience from the Ruby language)?

  • I think that regex should be supported only as a constant, therefore they must not be taken from a JSONPath evaluation result (e.g. disallow $[?(@.foo =~ @.regex)]).
    We might change idea on a future revision, but right now I'm worried that this might introduce further complexity, discrepancies and performance penalities (we can parse and compile the regex while parsing the JSONPath).

That is a sensible approach. Unless there is a large installed base that depends on computed REs, I agree. We could make sure that a later extension can be added seamlessly.

  • are we going to treat regex as strings when using == comparison (e.g. $[?(@.foo == /[a-z]*/)] is true when foo is "[a-z]*") or are we going to use any special type, therefore == will never be true when comparing with a string?

We have the luxury of being able to provide strong typing, so a RE literal should be a (syntax?) error except near a =~.

  • should we raise a parsing time error if a ill-formed regex is provided (e.g. /[a-z/)?

Yes. If we can check a query before applying it to input, we should do that with failing fast.

@bettio
Copy link

bettio commented Mar 18, 2021

I tried to query this example [1] using a JSONPath with a XSD regex

I tried using $.store.book[?(@.title =~ /[a-z-[s-u]]/)] (which is equivalent to $.store.book[?(@.title =~ /[a-rv-z]/)]):

Jayway: fails to understand it and returns a "bogus" result (but it doesn't fail with the "regular" regex)

Personal opinion about XSD regex: I feel like that no implementation is supporting them (I'm not aware of any), I feel like that every implementation supports whatever the language provides as builtin regex implementation (which is likely far from XSD regex syntax). Please, let me know if any implementation is supporting that syntax. If none supports that syntax, are we really sure to go for a syntax that nobody is using? Furthermore I didn't find any online tool to test regex with XSD style, conversely, PCRE style regex are easy to test (this is not really encouraging).

I also did few tests with some online test tools, following implementations just fail to understand any regex (I tried with "basic" syntax $.store.book[?(@.title =~ /[a-z]/)]):

  • jsonpath-plus (javascript, 0.18.1)
  • goessner: bogus result here, it replaces every title occurrence with -1

Personal opinion: previous results make me feel skeptic about the whole regex feature. Do we know which implementations are supporting regex?

[1]:

{
    "store": {
        "book": [
            {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "a",
                "price": 8.95
            },
            {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "s",
                "price": 12.99
            },
            {
                "category": "fiction",
                "author": "Herman Melville",
                "title": "x",
                "isbn": "0-553-21311-3",
                "price": 8.99
            },
            {
                "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    },
    "expensive": 10
}

@bettio
Copy link

bettio commented Mar 19, 2021

But there are online XPATH 3 evaluators. Have you tried testing regex with them?

As far as I know XPath/XQuery regex are different than XML Schema ones:

Note that the regular expression functions available in XQuery and XPath use a different regular expression flavor. This flavor is a superset of the XML Schema flavor described here. It adds some of the features that are available in many modern regex flavors, but not in the XML Schema flavor.

I feel like there is a lack modern tooling for testing XML Schema flavor regex (outside XML schemas), and this might be quite frustrating for end users.

By the way JSON Schema is borrowing syntax from ECMA 262.

@bettio
Copy link

bettio commented Mar 20, 2021

I did some further investigation into regex support (using cburgmer's tool and his implementations list).

I usued a quite simple regex [a-rv-z] (JSON input was: [{"a": "a"}, {"b": "s"}] and JSONPath was: $[?(@.a =~ /[a-rv-z]/)]).

They are supported by 11 implementations in 7 different languages [1], interestingly JavaScript implementations were not supporting it, and 3/4 of them were completely misbehaving [2]:

  • JavaScript brunerd "matches" everything
  • JavaScript Goessner and jsonpath-plus replace matches with -1

I'm worried that users might expect to use regex to reliably validate/filter out invalid inputs, but they get accepted anyway.

Instead $[?(@.a =~ "[a-rv-z]")] is not supported at all.

Also I tested a XSD-like syntax, $[?(@.a =~ /[a-z-[s-u]]/)] is supported by a further limited subset of implementations (5/11).

[1] List of implementations supporting $[?(@.a =~ /[a-rv-z]/)]:
C++ (1/1 implementation):

  • jsoncons

Go (2/6 implementations):

  • github.com-bhmj-jsonslice
  • github.com-vmware-labs-yaml-jsonpath

Java (2/2 implementations):

  • com.github.jsurfer
  • com.jayway.jsonpath

Objective-C (1/1 implementation):

  • SMJJSONPath

Ruby (1/1 implementation):

  • jsonpath

PHP (3/4 implementations):

  • galbar-jsonpath
  • remorhaz-jsonpath
  • softcreatr-jsonpath

dotNET (1/4 implementations):

  • Json.NET

[2] JavaScript misbehaving implementations:

  • Goessner: [{"a": -1}, {"a": -1, "b": "s"}]
  • brunerd: [{"a": "a"}, {"b": "s"}]
  • jsonpath-plus: [{"a": -1}, {"a": -1, "b": "s"}]

@cabo
Copy link
Member

cabo commented Mar 20, 2021

Interesting. While it may be a bit early to actually decide anything about regular expressions, this also seems to show some differences in expression evaluation. Can you do a similar check with == "a" and == "b"?

@bettio
Copy link

bettio commented Mar 23, 2021

@cabo: it's fine. Let me know what kind of output are you looking for, so I can summarize it in a helpful way.

@danielaparker
Copy link

danielaparker commented Mar 25, 2021

By the way JSON Schema is borrowing syntax from ECMA 262.

@bettio, what is your opinion about ECMA 262? All other things equal, and when it doesn't matter too much, I am strongly in favour of staying consistent with other standards, whether IETF recommendation or de facto standards such as JSON Schema.

Daniel

@goessner
Copy link
Collaborator

... this might be a way to go. Sharing a small, well defined subset of tc39 with JSON Schema.

@timbray
Copy link
Contributor

timbray commented Mar 25, 2021 via email

@gregsdennis
Copy link
Collaborator

gregsdennis commented Mar 25, 2021

Some historical context: JSON Schema specifies 262 because that was what was available when the feature was added. It maintains 262 because that's what implementations now support.

I don't think this is cause enough for us to adopt it here. We should look at which RegEx specification serves our needs and use that. If it also happens to be 262, that's fine.


On a side note of implementation support, neither of my implementations support RegEx, but I'm happy to add it. I expect other maintainers wouldn't complain about it if it became a requirement.

@bettio
Copy link

bettio commented Mar 25, 2021

@timbray:

Another reason a modest subset seems like
a good idea for the JSONPath RFC.

The modest subset is what actually really works in a consistent manner: $.key is supported by 100% implementations, and Goessner's article is enough for that (I believe that some "basic" features are not supported by every implementation just because they are incomplete implementations, so no RFC is going to fix an unfinished or unmaintained project).

I don't see any useful reason about documenting just the "modest subset" (and I don't think it is worth it), also I think that is not compatible with our WG charter:

I will quote Barry Leiba, that asked to change the original charter.

The original charter:

The WG will develop a standards-track JSONPath specification, with the primary goal of capturing the common semantics of existing implementations and, where there are differences, choosing semantics with the goal of causing the least disruption among JSONPath users.

Barry Leiba:

I fear that this text appears to say that we primarily want to develop a dumbed-down compromise mush, and I'm sure that's not what we really mean. The primary goal is, surely, to develop a specification for JSONPath that is technically sound, complete, and useful.

And the following has been proposed:

The WG will develop a standards-track JSONPath specification that is technically sound and complete, based on the common semantics and other aspects of existing implementations. Where there are differences, the working group will analyze those differences and make choices that rough consensus considers technically best, with an aim toward minimizing disruption among the different JSONPath implementations.

Honestly what I found hard was taking decisions about quite commonly implemented "extensions" and a number of corner cases, and users love extensions (such as regex) because they find them useful.

@timbray: however if you are willing to work on it, we might document a "JSONPath Core" subset, that can be safely used across implementations for portability reasons, and for transitional purposes, but I don't agree about just limiting to it.

Sorry for the OT, let's continue the regex topic.

Edit: I wrote a quick proposal that tries to match the 2 point of view I see here (and in other discussions): #78 (comment)

@bettio
Copy link

bettio commented Mar 26, 2021

Talking about regex: I think we should start again from requirements. I'll start with some of them:

  • Good enough for matching URLs, domain names, mail addresses, IP addresses, UUIDs and all the stuff that people keep asking on stackoverflow
  • Reasonable unicode support (it is 2021, honestly I rather not support them than supporting a limited version that is mostly good for Latin1).

@glyn
Copy link
Collaborator Author

glyn commented Mar 27, 2021

I think we all agree:

  • JSONPath must stipulate a regular expression syntax for use in filters, and
  • it's not the role of JSONPath to define a new regular expression standard.

Fortunately, there's a clean syntactic and semantic separation between JSONPath and regular expression (de facto or de jure) standards.

I suggest we then stipulate RE2 for its security properties and unicode support (with "SHOULD" language) and allow for other regular expression standards (with "MAY" language).

JSONPath implementations that share a regular expression standard will interoperate on JSONPaths involving regular expressions. JSONPath implementations which don't share a regular expression standard may or may not interoperate on particular JSONPaths, depending on the regular expressions used.

In terms of compliance testing, it may be sufficient to include a few simple regular expressions which will work "everywhere".

@gregsdennis
Copy link
Collaborator

I suggest we then stipulate RE2 for its security properties and unicode support

I see your intent here, but I have to say that I'm not a fan of pure wrappers (especially as I work in .Net), and it seems that all of the implementations for RE2 are just wrappers for the C++ implementation. There are a lot of benefits to developers when a library is written in the language. I don't think we should ignore that.

Let's make sure that this comparison is considered in this decision.

@glyn
Copy link
Collaborator Author

glyn commented Mar 28, 2021

Go has RE2 built in (https://golang.org/pkg/regexp/), but point taken. I agree that a wrapper is to be avoided, especially if it makes shipping a static binary impossible.

Which other regex (de facto or de jure) standards support unicode?

@cabo
Copy link
Member

cabo commented Mar 28, 2021

It's 2021, so all (notable) RE standards support Unicode.

I think, for once, recent drafts from json-schema.org got this right:
Let's define a common subset, and reference that to an existing written up standard such as ECMAScript.
(json-schema.org got the idea right, not necessarily the specific subset -- there is quite some fluff in that.)

If nobody else wants, I'll do a PR with an ABNF definition of such a subset.
We can then debate what to leave in/throw out/add.

@gregsdennis
Copy link
Collaborator

Example StackOverflow question that could be solved using RegEx.

@gregsdennis
Copy link
Collaborator

Another question on the usage of RegEx in filters.

@glyn
Copy link
Collaborator Author

glyn commented Dec 1, 2021

I find it interesting that JMESPath does not support regular expressions in filters. See, for example, this issue.

@cabo
Copy link
Member

cabo commented Dec 1, 2021

The discussion found in this issue pretty much is the reason why iregexp is the right approach.
(The alternative, escaping into a platform-specific function, is not available for the present standardization effort.)

@glyn
Copy link
Collaborator Author

glyn commented Dec 1, 2021

Is iregexp the right approach for JMESPath too? If so, I wonder if the JMESPath community would be interested in collaborating on iregexp?

@cabo
Copy link
Member

cabo commented Dec 1, 2021

We could ask! We can use all the help we can get.

@cabo
Copy link
Member

cabo commented Dec 1, 2021

Well, maybe I can get iregexp-02 out first...

@glyn
Copy link
Collaborator Author

glyn commented Dec 1, 2021

Sounds good!

@cabo
Copy link
Member

cabo commented Jan 17, 2022

112 output:

Action: Take this discussion to the list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants