Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Like operator #160

Closed
wants to merge 8 commits into from
Closed

Conversation

sauliusg
Copy link
Contributor

@sauliusg sauliusg commented Jul 15, 2019

Offering implementation of the "LIKE", "NOT LIKE" and "UNLIKE" string matching operators in the Filter grammar, and briefly describing their semantics.

Alternatively, offering implementation of the "MATCH /regexp/" operator.

@sauliusg sauliusg added the PR/ready-for-review Add this flag if you are the author of the PR and you want it to be reviewed. Remove it when editing label Jul 15, 2019
Copy link
Member

@merkys merkys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me!

If implemented, the "LIKE" operator MUST behave as the correspoding standard SQL operator. In particular,
The `x` string MUST be interpreted as a pattern where an underscore character ('_', ASCII DEC 95, HEX 5F)
matches any single character and a percent character ('%', ASCII DEC 37, HEX 25) matches an arbitrary
sequence of characters (including zero characters).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't then we also explain how to escape these characters, and how to escape the escape character? (should be \ IIRC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. MySQL specification says that to search for \ in LIKE one would have to write \\\\ (instead of \\, which is common way to escape the escaping sequences in C and the like). Which will we choose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giovannipizzi & @merkys for spotting this deficiency; I'll look up the existing escape rules and suggest a version in the text.

I agree with @giovannipizzi that we can use single backslash to escape characters, but we need to think over all possible combinations. I'll propose a variant in a while.

@merkys merkys self-requested a review July 16, 2019 06:18
Copy link
Member

@merkys merkys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that clarification for the escape sequence is needed. The simplest solution would be to point to MySQL specification, nevertheless I'd like to see it written explicitly.

@giovannipizzi
Copy link
Contributor

They say that this is just because they first C-unescape the string, eg to accept \n as a newline. We don't, so we should just say \\ to escape a backslash.
Whoever converts the string to MySQL must then first C-escape whatever its obtained from the query string.

@giovannipizzi
Copy link
Contributor

Maybe it's still good to point this out to implementers though, as I'm sure this will be a common error (but anyway if you don't want SQL injection you should encode your string, is the user passes a quote in itts string....)

@rartino
Copy link
Contributor

rartino commented Jul 16, 2019

Is the hope to get this in for v0.10? Because, after having mulled over this a bit since the meeting last Friday, I think this is something we may want to discuss a bit more before merging.

(To me, the point with the several rushed PRs this week is to make sure the spec is in a sane state for a v0.10 release, i.e., internally consistent and preferably without having rough completely backwards-incompatible changes just around the corner--- not so much to get new features in last-minute. I'd be happy to see a release v0.11 with this feature in shortly after v0.10.)

Now, about the content of this PR:

OPTiMaDe filters are designed to be user friendly for our target audience of 'client' end users, right? Hence, even as someone who knows SQL quite well, I'm not super stoked on building in the SQL convention into OPTiMaDe:

  • This choice has a very non-backend-agnostic "feel". Won't this antagonize those without SQL backends? "So, to implement OPTiMaDe we have to translate wildcard expressions from the SQL convention to our backend?"

  • I expect most client users of OPTiMaDe to not know SQL and the conventions of its LIKE operator.

  • I expect those users to find SQL's % and _ unexpected for a wildcard mini-language. Just Google "wildcard" and disregard the hits about SQL; you'll find a wide array of descriptions of built-in functions across various software that very universally uses * for multi-character wildcards, and often ? for single character matches. Two notable examples includes:

    I suspect users are likely to have come across one of these.

  • Maybe worst: as coming out from the point @giovannipizzi brought up, I don't think there is a single escape syntax standard for SQL LIKE valid across SQL backends, so this is not a great standard anyway... And, we must make sure escaping in our LIKE does not interfere with string escaping in our string tokens. Reading the specification we cannot say that identifier LIKE "\%" is a search for a literal percent sign, because the right hand side is not a valid string token. A valid string token can only contain \ followed by a " or a \. So, for us the right hand side would have to be "\\%", and how do we then encode a literal backslash + a wildcard %?

So, my suggestion is for us to skip SQL LIKE entirely, and give the operator a different name, e.g., PATTERN with the conventions:

  • * for any number of arbitrary characters (including zero).
  • ? for exactly one arbitrary character.
  • ** for a literal *
  • *? for a literal ? (whereas ?? indicates two arbitrary characters, and ?* is the standard way to indicate an arbitrary character sequence of at least one character)
  • the use of \ remains as for any string token: \\ for literal \ and \" for literal ".

This seems somewhat user-friendly, and would be trivial to search-replace into a corresponding SQL LIKE operator for us with SQL backends. (But I'd be ok with any other sensible quoting scheme as well.)

@giovannipizzi
Copy link
Contributor

I see @rartino's point and I tend to agree that, if there is no agreement, it's better not to merge this in 0.10 yet.
And to admit, I also think * and ? are more intuitive (and should be easy to replace for SQL syntax in the backend).

A note on the choice of escaping: I think this is not yet "perfect":

  • ** as a literal *: would then *** mean [a] "a literal * followed by any sequence of characters" or [b] "any sequence of characters followed by a literal *"? Might seem a little detail, but for instance the string ab*cd would match the pattern ab*** in interpretation [a] but not in interpretation [b].
  • Similar concern for ???
  • For ?? there is also an additional concern: how then can we specify that we want exactly two characters?

I think the only solution is to have a third character for escaping. I see that using backslash is counterintuitive because one has to escape the string again (indeed to encode \\ one would have to write \\\\ e.g. in JSON) and multiple escaping is the hardest thing ever to get right ;-) )

@rartino
Copy link
Contributor

rartino commented Jul 17, 2019

@giovannipizzi are indeed correct about the quoting scheme not yet being perfect.

(For anyone confused by the discussion about ??, ???, these are based on the first version of my comment; shortly after posting it I noticed the problem with ?? myself and re-edited it to use \*?, thinking no one would notice, sorry. But even with those edits, *** remains broken.)

I have a different suggestion for quoting now: going forward, even with other functionality, I think it may be very useful if we can extend our string tokens to represent "literal characters" aside "normal characters" using a single backslash.

It would require a change in the grammar, so \<character> is allowed for any character in a string token, marking that character as "literal". (Right now it is forbidden by the grammar for <character> to be anything else but a " and \.)

We then need some carefully formulated text in the specification, where we say that for all regular uses of a string token, a literal character \<character> means the same thing as <character>. However, in other situations where some special meaning has been assigned to characters, a literal character refers to that character without that special meaning.

Then in a PATTERN, ? and * are assigned the special meaning as specified above. It then follows from the paragraph above that \? and \* are used to represent the literal ? and *. I think this would be elegant.

Note: this solution allows marking any character in an OPTiMaDe string as "literal" with the following exception: the characters " and \ are always literal, one cannot indicate non-literal versions of these characters. I think that is an OK trade-off. Practically this could mean limitations on what syntax we can allow for other mini-languages embedded in OPTiMaDe strings, e.g., if we want to adopt a standard language for time formatting that uses " and \ with their own special meanings, etc.

We might want to think forward to make sure this allows embedding a reasonable regex format. I think it does.

@sauliusg
Copy link
Contributor Author

@rartino wrote:

So, my suggestion is for us to skip SQL LIKE entirely, and give the operator a different name, e.g., PATTERN with the conventions:

  • for any number of arbitrary characters (including zero).
    ? for exactly one arbitrary character.
    ** for a literal *
    ? for a literal ? (whereas ?? indicates two arbitrary characters, and ? is the standard way to indicate an arbitrary character sequence of at least one character)
    the use of \ remains as for any string token: \ for literal \ and " for literal ".

I though about this in the beginning (Issue #87), and then I realised that this is a very bad idea. Essentially we put on our shoulders a burden of developing yet another wildcard language, with the explicit goal to be not similar to existing ones, and with no apparent gains. This is a large work; as @giovannipizzi has just pointed out it is very easy to get it wrong; it gives no clear benefit, and adds extra burden on every implementer (because now everyone will have to implement, test and support this notation), and on every OPTiMaDe user (since everyone will have to learn a new wildcard language). I think this is very inefficient and unnecessary, since we will hardly give in the new wildcard syntax any new essential features that one does not have in SQL patterns, globs or regexps.

And then, what is the reason to be not similar to SQL, or any other pattern/regexp standard? One might object to it if it were a MUST feature and one would have to re-implement it in an SQL-unfriendly backend; but since it is supposed to be an optional feature, those backends that find it difficult to implement can simply skip it, and offer their versions of 'MATCH REGEXP' instead. I think sticking to a known standard, if that standard is well documented, has clear benefits: a/ one can find working and tested implementations; b/ the documentation is out there c/ many people will already know the syntax d/ those who learn it anew will be able to use it also in other contexts (thus will have more incentives to learn it)

I therefore suggest the following strategy:

  • let's settle on several widespread standards with accessible implementations. I suggest picking SQL LIKE patterns, ERE regular expressions, and PCRE regular expressions. It covers a large area of useful cases.

  • All such matches will be optional. Thus, the implementer can choose to implement those matching operators that are easy to match to their respective backend, and skip all the rest. Nobody should be put at disadvantage by this.

  • let's have in mind possible extensions of the patterns. This is the reason I suggested (in LIKE operators #87) to have regexp type specified in the match expression (e.g. MATCH PCRE); this will allow adding new flavours of regexps would such extensions be needed in the future (I personally doubt that it will be ever needed, but you never know what new "revolutionary" DBs will be rolled out with incompatible and non-standard interfaces ;). One possible backend which we up to now did not discuss much is SparQL; I'm not sure if it uses its own match expressions or some standard ones. Still, it fits into the proposed scheme, either with 'MATCH ERE' if SparQL supports ERE, or with 'MATCH SPARQL' if SparQL has (will have) its own matching syntax, and we decide to support it in the future.

  • OPTiMaDe in itself is neutral as to the implementation of the matches, but we need to specify what the semantics of the matches will be in case they are implemented. Thus I think we can do it without extension of the existing Filter grammar.

In such setting, which is very simple, and is actually implied in this PR, I think we could make it even into v0.10 :).

Regards,
Saulius

@sauliusg
Copy link
Contributor Author

sauliusg commented Jul 17, 2019

@rartino wrote:

It would require a change in the grammar, so \<character> is allowed for any character in a string token, marking that character as "literal". (Right now it is forbidden by the grammar for <character> to be anything else but a " and \.)

It could be done, but at the moment I think it is totally unnecessary. Filter language itself does not define or handle regular expressions, just strings; strings can encode arbitrary character sequences, including backslash with any characters; we just need to escape the backslash as a double backslash. Thus, an ERE matching arbitrary number of stars after the a character would be a\**, and embedded into a Filter string it will become "a\\**". The Filter language quotes and backslashes are one encapsulation level higher than those of regexps. The situation is similar as with Java regular expressions.

@sauliusg
Copy link
Contributor Author

@rartino

* Reading the specification we cannot say that `identifier LIKE "\%"` is a search for a literal percent sign, because the right hand side is not a valid string token. A valid string token can only contain `\` followed by a `"` or a `\`. 

True

So, for us the right hand side would have to be "\\%",

True, this is a logical consequence of the current grammar. Java does the same with regexps in strings, awk does it the same way, and I suspect Python has no other option but to behave the same.

and how do we then encode a literal backslash + a wildcard %?

a LIKE "\\\\%" Not the nicest looking but usable (and in fact used in many occasions).

I still need to double-check how SQL escapes %.

@sauliusg
Copy link
Contributor Author

* This choice has a very non-backend-agnostic "feel". Won't this antagonize those without SQL backends?

Why should it? As said, if the people find it difficult to implement, they just skip it (the feature is supposed to be optional). Unless they feel bitter envious that SQL has better pattern matching than their beloved super-modern backend, there should be no problems :P. And then again, I suggest also introducing (again optional) match operator syntax that would be easy to implement in alternative, non-SQL backends.

@sauliusg
Copy link
Contributor Author

sauliusg commented Jul 17, 2019

  • I expect most client users of OPTiMaDe to not know SQL and the conventions of its LIKE operator.

Why? At least currently, most of the OPTiMaDe consrtium members do know about at least something SQL :)

  • I expect those users to find SQL's % and _ unexpected for a wildcard mini-language.

One can always learn :). One just search for "SQL LIKE", and in 5 mins one has the answer. Also, we will include a short description and a couple of examples into the OPTiMaDe spec (I'll update my PR with these features).

Just Google "wildcard" and disregard the hits about SQL;

Ehhh... why should we disregard SQL? Actually, DuckDuckGo on the "wildcard" search gives SQL on the first page, slightly below Wicktionary's description, and the top-most link in my current search (https://www.computerhope.com/jargon/w/wildcard.htm) mentions SQL convention ('%') as well, along with the asterisk convention.

you'll find a wide array of descriptions of built-in functions across various software that very universally uses * for multi-character wildcards, and often ? for single character matches. Two notable examples includes:

That is why I suggested (#87) also describing the MATCH GLOB operator, with shell glob semantics; I can include it into the PR; I just had impression that we decided to have only LIKE and no more advanced operators when we skyped in Friday.

Should we also include MATCH GLOB into the syntax?

  • Microsoft excel, access, etc. as well as all software that mimics the same functions (google sheets, libreoffice) (though, libreoffice apparently also supports regex syntax.)

Again, this would be an argument to include an optional MATCH ERE and MATCH PCRE operator.

What does MongoDB and/or CoachDB support, actually? These would be probably the most likely backends that some of us actually use.

@rartino
Copy link
Contributor

rartino commented Jul 17, 2019

I'll get to the individual points, but I see an important philosophical difference here, so lets start with sorting that out:

  • let's settle on several widespread standards with accessible implementations. I suggest picking SQL LIKE patterns, ERE regular expressions, and PCRE regular expressions. It covers a large area of useful cases.
  • All such matches will be optional. Thus, the implementer can choose to implement those matching operators that are easy to match to their respective backend, and skip all the rest. Nobody should be put at disadvantage by this.

When we agreed on the need for optional features in the filtering language, you were very concerned about the fragmentation in support for filters that this would lead to. My argument was that optional features actually helps avoid fragmentation, because instead of every provider inventing their own extension for feature X, we now say in the standard that "you don't have to support feature X, but if you do, this is how you do it." That way, all providers that support feature X do it the same way, and queries are at least universal across all databases that support that feature.

However, to get this effect, we need to be conservative with adding several optional features that are overlapping, i.e. exactly what you propose to do here. If we do, we get back to the same fragmentation we were trying to avoid. Now all databases will support only their own favorite brand of partial text matching operators, and partial text matching queries are not going to be universal even across databases that support partial text matching. Clients will constantly have to rewrite their queries to fit the text matching capabilities of each database. This, to me, is a too high cost for everyone to get their favorite syntax in.

So, I'm very much against supporting e.g. multiple regex flavors, and rather want us to standardize on a flavor that is translatable into the specific ones used by the backends. I realize I put more work on implementors here, but this is necessary to achieve the main goal of a universal api: to provide a common interface for access.

Edit: While the discussion in this comment is formulated about regexs, it very much applies to the suggestion to provide both an SQL LIKE operator and a GLOB operator (and perhaps also other pattern operators). This means end users of OPTiMaDe would have to learn all the pattern languages to use OPTiMaDe against multiple databases. This does not scale well.

@rartino
Copy link
Contributor

rartino commented Jul 17, 2019

On the specific points:

Essentially we put on our shoulders a burden of developing yet another wildcard language, with the explicit goal to be not similar to existing ones, and with no apparent gains. This is a large work; as @giovannipizzi has just pointed out it is very easy to get it wrong;

I'm with you if we are talking about full regex support, I don't think we want to invent a new flavor of regexes. But, if we talk about patterns with character wildcards, that is trivial apart from how to do escaping. And, unless I am mistaken, there is no SQL standard for that escaping. If we pick one, e.g. MySQL, we still have to do the same amount of work to make sure we embedd it in a way consistent with our string quoting mechanism. This isn't easier than starting from scratch; and for all intents and purposes, we are creating a new pattern language in the eyes of users, because the actual strings they need to input are different in, e.g., how many backslashes are needed.

In fact, what is likely to happen with the SQL approach is that users wondering how to represent a literal % will find a stack overflow question about, e.g., postgree SQL, try that syntax with whatever number of backslashes they use, and then be confused over the exact same string behaving differently in OPTiMaDe.

Filter language itself does not define or handle regular expressions, just strings; strings can encode arbitrary character sequences, including backslash with any characters; we just need to escape the backslash as a double backslash.

Right, but the point in much of the design we do is for the filter language to be user friendly. I know from heavy amount of experience that no one appreciates this "backslash hell" where several quoting mechanisms are stacked to yield unreadable expressions. If we can support single backslash qouting, is it not nice to do so?

Ehhh... why should we disregard SQL?

My point here was that it is only SQL that uses % and _. There are far more pattern mini-language using * and ?. Hence, for users not familiar with SQL LiKE, this is going to be a surprise.

At least currently, most of the OPTiMaDe consrtium members do know about at least something SQL :)

We are all database providers. Who are you designing OPTiMaDe for? I'm having the PhD students at our department in mind who do various projects that would benefit from easy aggregating of data across databases. I think few of them know SQL.

@sauliusg
Copy link
Contributor Author

I'm with you if we are talking about full regex support,

ACK. And so, one flavour of matches should include RE support; but we do not make it mandatory and do not want to exclude the use of simpler matching mechanisms, do we? The proposed solution, with 'LIKE', 'MATCH ERE', 'MATCH PCRE' would permit standard implementation of various levels of patterns.

I don't think we want to invent a new flavor of regexes.

I agree, we don't :)

But, if we talk about patterns with character wildcards, that is trivial apart from how to do escaping.

I don't find it trivial; it's not just regexp search/replace, since we will need to deal with escaped characters and character ranges within [] brackets. I feel that a client will have to do a full pattern parsing, and then map it into the backend's patterns/regexps if we want to have specific OPTiMaDe pattern syntax.

And, unless I am mistaken, there is no SQL standard for that escaping.

I think there is; I would be very much surprised if it did not.

I do not have the latest SQL standard at hand (costs 180$ with ISO :/), but open sources on the Web (https://modern-sql.com/standard, http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt) specify that the syntax is column LIKE 'pattern%' or column LIKE '30\%pattern% ESCAPE '\', where ESCAPE allows to specify any escape character one wishes. MySQL defaults ESCAPE to ''; other SQL description do not mention any default, so probably default backslash escape is MySQL specific extension. That does not cause difficulties if we specify backslash as an escape character for OPTIMADE since all SQL databases should support the standard ESCAPE clause.

If we pick one, e.g. MySQL, we still have to do the same amount of work to make sure we embedd it in a way consistent with our string quoting mechanism.

As said, SQL ESCAPE mechanism should allow to specify escape character for any SQL engine.

This isn't easier than starting from scratch; and for all intents and purposes,

I would say it is much easier than starting from scratch, we just take a (subset) of a standard implementation.

we are creating a new pattern language in the eyes of users,

Not if we take an existing standard(s) and reference it/them.

because the actual strings they need to input are different in, e.g., how many backslashes are needed.

We should not confuse defining the meaning of patterns, and defining the way how to encode patterns into strings. Currently the OPTiMaDe string definition allows encoding any character sequence (using escapes); so we have no problem here. If we think that too many backslashes in escaped strings are awkward to read (I agree with this), then we can think about how to improve OPTiMaDe string escaping; this is an issue independent from match-pattern semantics, IMHO.

In fact, what is likely to happen with the SQL approach is that users wondering how to represent a literal % will find a stack overflow question about, e.g., postgree SQL, try that syntax with whatever number of backslashes they use, and then be confused over the exact same string behaving differently in OPTiMaDe.

I don't find this argument convincing. If people take Postgress SQL specific (non-standard) rules and try to apply to something else, be it OPTiMaDe, MySQL, MogoDB or what not, of course they will get nonsense, and here you can not help them in any way – except maybe by writing decent OPTiMaDe documentation (that specified how OPTiMaDe patterns work) and advising them to use '-stackoverflow' in their Google searches :P.

@sauliusg
Copy link
Contributor Author

My point here was that it is only SQL that uses % and _. There are far more pattern mini-language using * and ?. Hence, for users not familiar with SQL LiKE, this is going to be a surprise.

SQL seems to be still the most widely used database query language. Its patterns, as I see now after some days of searching on the Web, are very well described and present in all searches. Disregarding this as "only SQL" does not seem wise to me. Nobody will be surprised at SQL pattern languages if they do a short search on the Web.

Yes, there are essentially two match syntaxes – SQL LIKE and PCRE. All the rest is variations and subsets of these two. SQL seems to be a fairly large chunk of DB implementations; probably SQL LIKE is the most frequently supported backend operator, if we take by number of installations, no?

Another is ERE/PCRE. Also fairly popular. That's why I suggest having possibility to use either or both of them in OPTiMaDe queries.

@merkys
Copy link
Member

merkys commented Jul 19, 2019

and advising them to use '-stackoverflow' in their Google searches :P.

This won't help. Web is full of other how-tos. If we refer to some other standard in our specification, then we have to follow it as closely as possible. It's the "SQL" in "SQLish" that may get the user searching the Web and trying first hits of SQL syntax how-tos.

@sauliusg
Copy link
Contributor Author

When we agreed on the need for optional features in the filtering language, you were very concerned about the fragmentation in support for filters that this would lead to. My argument was that optional features actually helps avoid fragmentation, because instead of every provider inventing their own extension for feature X, we now say in the standard that "you don't have to support feature X, but if you do, this is how you do it." That way, all providers that support feature X do it the same way, and queries are at least universal across all databases that support that feature.

I agree, and this argument convinced me that optional features are not bad. They will be even better if supplemented by machine-readable descriptions of "feature sets" (see #91, second suggestion).

However, to get this effect, we need to be conservative with adding several optional features that are overlapping,

I agree if the features are completely overlapping, but here we have different features with different trade-offs between search power and implementation complexity.

i.e. exactly what you propose to do here.

This is not true. I propose three well-defined optional features, so that, as you say, "if you do [them], this is how you do it.". Implementers then can choose which ones to implement.

If we do, we get back to the same fragmentation we were trying to avoid.

Not really.

Now all databases will support only their own favorite brand of partial text matching operators,

NO! If compliant to OPTiMaDe, they will support the ones described in the spec.

and partial text matching queries are not going to be universal even across databases that support partial text matching. Clients will constantly have to rewrite their queries to fit the text matching capabilities of each database.

With optional features you will have to adapt your queries to different implementation, no way around it as long as you introduce options. This could only be made somewhat easier if each OPTiMaDe endpoint advertises in a machine-readable form which optional features they support (the above-mentioned "feature sets:").

And then, you do not need to re-write constantly, there will be no constant changes once the specs are out, will there?

This, to me, is a too high cost for everyone to get their favorite syntax in.

Its not about favourite syntax, its about easier implementation. We can not require all implementers say "look, we have invented our own matching semantics, no go and implement it in your backend".

So, I'm very much against supporting e.g. multiple regex flavors, and rather want us to standardize on a flavor that is translatable into the specific ones used by the backends.

At the moment I think that "translatable into the specific ones used by the backends" is too difficult to be reasonably implementable.

I realize I put more work on implementors here, but this is necessary to achieve the main goal of a universal api: to provide a common interface for access.

I think the burden you suggest for implementers is unacceptably high. In that case, we are probably better of with just what is in the specs right now, STARTS/ENDS/CONTAINS. Regexp filtering can then be done at the client side if needed...

@sauliusg
Copy link
Contributor Author

and advising them to use '-stackoverflow' in their Google searches :P.

This won't help. Web is full of other how-tos. If we refer to some other standard in our specification, then we have to follow it as closely as possible. It's the "SQL" in "SQLish" that may get the user searching the Web and trying first hits of SQL syntax how-tos.

This would be a good reason to make LIKE behaviour as close to the SQL's as possible, this is my main point.

@rartino
Copy link
Contributor

rartino commented Jan 17, 2023

Alternatively we may select one of the other syntaxes, preferably stable and having wider support. For example PCRE2 which has implementations at least in Perl, Python and JavaScript.

The issue is not what is supported by programming languages, but across database backends, since it is your database that needs to be able to search efficiently for matches of the regex.

@merkys
Copy link
Member

merkys commented Jan 17, 2023

Alternatively we may select one of the other syntaxes, preferably stable and having wider support. For example PCRE2 which has implementations at least in Perl, Python and JavaScript.

The issue is not what is supported by programming languages, but across database backends, since it is your database that needs to be able to search efficiently for matches of the regex.

Right. Just checked MariaDB (backend for COD) and Mongo, both seem to support PCRE2 (although COD would have to update MariaDB to v10.5.1 to have PCRE2 instead of PCRE). It seems to me that JSON Schema draft is a subset of PCRE2.

@rartino
Copy link
Contributor

rartino commented Jan 17, 2023

Just checked MariaDB (backend for COD) and Mongo, both seem to support PCRE2 (although COD would have to update MariaDB to v10.5.1 to have PCRE2 instead of PCRE). It seems to me that JSON Schema draft is a subset of PCRE2.

Just the fact that one of the most active OPTIMADE participants would need to update their backend is perhaps itself an argument against, e.g., PCRE2.

Nevertheless, also consider, e.g.:

etc., etc.

(To be honest, I didn't explicitly check that none of the above doesn't happen to exactly be equal to PCRE2, just that they don't explicitly say so.)

@merkys
Copy link
Member

merkys commented Jan 27, 2023

Just the fact that one of the most active OPTIMADE participants would need to update their backend is perhaps itself an argument against, e.g., PCRE2.

Makes sense. However I have to check to make sure JSON Schema regex is really a subset of the old PCRE that our backend currently uses.

@sauliusg
Copy link
Contributor Author

This has been sitting for a while, perhaps long enough for others to solve part of the problem. We discovered in #436 that the JSON Schema standard now may have defined a reasonable "common" regex subset here:

https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions

After a quick look at the JSON Schema REs, I have the question similar to that one we had with the LIKE operator strings: how do we escape special characters? E.e. what is the regular expression according the the JSON schema variety to match "(", ")", "|" and similar characters?

Maybe I overlooked in the https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-keyword-independence document, but I could not found the escape methods there...

@vaitkus
Copy link
Contributor

vaitkus commented Jan 27, 2023

@sauliusg, according to the regular expression examples given in https://json-schema.org/understanding-json-schema/reference/regular_expressions.html, special characters can be escaped using a double backslash.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @sauliusg, I definitely support the inclusion of such an operator once we've ironed the technicalities out. Sorry I didn't have a chance to look through this before the last meeting!

I can easily add this construct to the optimade-python-tools reference server if we want a live version to play around with (perhaps if COD has it too we can do some empirical tests on the regexp engines rather than just trusting the docs!)

My review below mostly focuses on wording/typos, though I do have some concerns about adding both LIKE and UNLIKE as separate optional features (based on my current reading of the spec).

Comment on lines +1292 to +1294
* `identfier LIKE x`

* `identfier UNLIKE x`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `identfier LIKE x`
* `identfier UNLIKE x`
* `identifier LIKE pattern`: Is true if the property matches the provided `pattern`.
* `identifier UNLIKE x`: Is true if the property does not match the provided `pattern`.

Adds a bit of explanation matching the format for CONTAINS etc. above. I think pattern is clearer than x here, as x is used as a placeholder for the substrings above. Also fixes a typo in "identifier" - will also add a separate suggestion for this in isolation.

@@ -1287,8 +1287,22 @@ In addition to the standard equality and inequality operators, matching of parti

OPTIONAL features:

The following comparison operators are OPTIONAL:

* `identfier LIKE x`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `identfier LIKE x`
* `identifier LIKE x`


* `identfier LIKE x`

* `identfier UNLIKE x`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `identfier UNLIKE x`
* `identifier UNLIKE x`


* `identfier LIKE x`

* `identfier UNLIKE x`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong preference for UNLIKE rather than just NOT(identifier LIKE pattern)? Is UNLIKE is an SQL construct that I have not come across?

I guess the precedent for this is our inclusion of KNOWN and UNKNOWN -- I just don't think I like having two ways of doing different optional things, e.g., an implementation could choose to implement only LIKE and therefore has to implement ~LIKE but can still choose to ignore UNLIKE. At least KNOWN/UNKNOWN are both mandatory. I guess we could add to the spec that if LIKE is implemented then UNLIKE must also be implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong preference for UNLIKE rather than just NOT(identifier LIKE pattern)? Is UNLIKE is an SQL construct that I have not come across?

UNLIKE is shorter, which is significant for URL-embedded querries...

I guess the precedent for this is our inclusion of KNOWN and UNKNOWN -- I just don't think I like having two ways of doing different optional things, e.g., an implementation could choose to implement only LIKE and therefore has to implement ~LIKE but can still choose to ignore UNLIKE. At least KNOWN/UNKNOWN are both mandatory. I guess we could add to the spec that if LIKE is implemented then UNLIKE must also be implemented.

I would say that if LIKE is implemented, UNLIKE also MUST be implemented; i.e. x UNLIKE "%s%" and NOT(x LIKE "%s%" MUST be synonyms – can we decide on that? If the LIKE is implemented, then implementing UNLIKE does not require any extra effort.

Comment on lines +1298 to +1301
If implemented, the "LIKE" operator MUST behave as the correspoding standard SQL operator. In particular,
The `x` string MUST be interpreted as a pattern where an underscore character ('_', ASCII DEC 95, HEX 5F)
matches any single character and a percent character ('%', ASCII DEC 37, HEX 25) matches an arbitrary
sequence of characters (including zero characters).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If implemented, the "LIKE" operator MUST behave as the correspoding standard SQL operator. In particular,
The `x` string MUST be interpreted as a pattern where an underscore character ('_', ASCII DEC 95, HEX 5F)
matches any single character and a percent character ('%', ASCII DEC 37, HEX 25) matches an arbitrary
sequence of characters (including zero characters).
If implemented, the `LIKE` operator MUST behave as the corresponding standard SQL operator.
The `x` string MUST be interpreted as a string-matching pattern where an underscore character ('_', ASCII DEC 95, HEX 5F) matches any single character and a percent character ('%', ASCII DEC 37, HEX 25) matches an arbitrary sequence of characters (including zero characters).

Fix a few typos and enforces one line per sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sauliusg makes a good point about the escapes. The text in the json schema standard points to the syntax standardized in ECMA-262, section 21.2.1 which says the syntax is "modeled" after Perl 5. However, the json schema standard then says - on SHOULD level - that the expressions should be "limit[ed] [...] to the following regular expression tokens", which doesn't actually include any description of the escape token.

This is likely an oversight. In the ECMA standard it is handled in the grammar where an Atom can be \ plus an AtomEscape, which is then further discussed in 21.2.2.8.1.

If we describe, as suggested, the RE syntax in the OPTIMADE independently, we can strengthen the requirements as we see it fit. This is a good opportunity to correct the oversight, if it really was.

optimade.md Outdated Show resolved Hide resolved
@rartino
Copy link
Contributor

rartino commented Jan 29, 2023

@sauliusg @vaitkus:

@sauliusg makes a good point about the escapes. The text in the json schema standard points to the syntax standardized in ECMA-262, section 21.2.1 which says the syntax is "modeled" after Perl 5. However, the json schema standard then says - on SHOULD level - that the expressions should be "limit[ed] [...] to the following regular expression tokens", which doesn't actually include any description of the escape token.

This is likely an oversight. In the ECMA standard it is handled in the grammar where an Atom can be \ plus an AtomEscape, which is then further discussed in 21.2.2.8.1.

If we end up quoting this subset of ECMA into our standard (as we discussed above that we may want to do), we probably want to clarify this.

Furthermore, @sauliusg did question whether the lazy wildcards was widely supported; so we may want to consider to also exclude those, making our subset even more stricter than the one described by json schema.

Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
@sauliusg
Copy link
Contributor Author

@sauliusg, according to the regular expression examples given in https://json-schema.org/understanding-json-schema/reference/regular_expressions.html, special characters can be escaped using a double backslash.

Thanks, @vaitkus for finding this document! AFAIR, this is the same "random page" that I tried to refer to during the last meeting, or at least it looks very similar :).

Under these circumstances I have no more objections to include a subset of these regexps, or the subset of regexps listed in https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions, in OPTIMADE.

The only point I'd like to stress rather strongly:

I would not reference any JSON document as a source of the description. I would rather explicitly list the permissible regexps in the OPTIMADE standards; we can (and should) reference the rationale, where these definitions come from – PCRE, JSON Schema, JSON Schema draft, etc., but the definitive list should be in the OPTIMADE standard.

All RE construct listed in the mentioned documents seem to be modeled after the PCRE, and PCRE is probably the most convenient form of REs, so I suggest we go for it.

That would leave us with one 'string MATCH "re"' or 'string MATCH /re/' operator, RECOMMENDED, to implement RE matches.

@rartino
Copy link
Contributor

rartino commented Jan 30, 2023

I would not reference any JSON document as a source of the description. I would rather explicitly list the permissible regexps in the OPTIMADE standards; we can (and should) reference the rationale, where these definitions come from – PCRE, JSON Schema, JSON Schema draft, etc., but the definitive list should be in the OPTIMADE standard.

Yes, also from prior discussion above I think we are generally in agreement on that the rules need to be stated in our standard.

However - this a technicality - it is relevant for us to say that what we end up with is not just based on/inspired by the PCRE regex subset in the JSON Schema draft, but is equal to, or a subset (whichever is true) of the JSON Schema draft standard. The reason is that this matters for the other place we want to invoke this regex standard, namely when specifying RE string validations in property definitions.

Question: what should a server do if a client uses REGEX that are valid PCRE2 but NOT constrained as specified? Since we've concluded that the ECMA standard is based on Perl 5 (i.e., PCRE2 if I'm not mistaken) we may as well say that supporting the subset is a MUST, but full support of PCRE2 is allowed, and on MAY level?

That would leave us with one 'string MATCH "re"' or 'string MATCH /re/' operator, RECOMMENDED, to implement RE matches.

Right, about that - the PR right now implements them as "/"-delimited. I was initially surprised by that decision, but realized I had no reason to oppose it. Does anyone see things differently?

@sauliusg
Copy link
Contributor Author

sauliusg commented Jan 30, 2023

I agree with most of your latest comments about level 2 support, so we may be converging on a design here. In particular we seem to agree that:

* level 2 and level 3 partial text matching should use segregated syntax.

* OPTiMaDe basic pattern support should work equivalently to standard SQL LIKE with ESCAPE `\` (but possibly with other wildcard characters.)

So, perhaps then all that is left is to decide is between these alternatives?:

* SQL LIKE-based basic patterns with wildcard characters `%` and `_` as it is in that standard.

* SQL-LIKE-like basic patterns with wildcard characters `*` and `?`.

* Skip basic pattern support, have only regex.

So, to make things converge – is there a consensus, or at least not a strong opposition, to the following OPTIONAL features:

  • The LIKE operator: string LIKE "%s_". Characters of the SQL standard are used (% for any sequence of characters, _ for any single character; alternatively – * for any sequence of characters, ? for any single character, which is essentially a shell glob); the UNLIKE operator MUST be a synonym of NOT ... LIKE (i.e. NOT (string LIKE "*s?") <=> string UNLIKE "*s?"). Escape character is backslash ('\'), which escapes itself ("\\" → a string with a single character '\');

  • The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash ('\'), which escapes itself ("\" → a string with a single character '\').

Shall we go for it?

@sauliusg
Copy link
Contributor Author

That would leave us with one 'string MATCH "re"' or 'string MATCH /re/' operator, RECOMMENDED, to implement RE matches.

Right, about that - the PR right now implements them as "/"-delimited. I was initially surprised by that decision, but realized I had no reason to oppose it. Does anyone see things differently?

You are right, I start forgetting what is in the PR :).

The /re/ syntax should minimize the escape backslashes that are needed.

@merkys
Copy link
Member

merkys commented Jan 30, 2023

Question: what should a server do if a client uses REGEX that are valid PCRE2 but NOT constrained as specified? Since we've concluded that the ECMA standard is based on Perl 5 (i.e., PCRE2 if I'm not mistaken) we may as well say that supporting the subset is a MUST, but full support of PCRE2 is allowed, and on MAY level?

If we do so, OPTIMADE RE will have to stay a subset of PCRE2 at least until the next major release. I am not against it, just stating that all additions/changes will have to be PCRE2 compatible. In addition, PCRE2 standard has life of its own, so if we tie OPTIMADE RE to PCRE2, we have to tie to some specific version.

@rartino
Copy link
Contributor

rartino commented Jan 30, 2023

@sauliusg

I suggest to skip LIKE entirely and just add MATCH. For simple matching we already provide BEGINS WITH, ENDS WITH, and CONTAINS which are user friendly in their definitions (i.e., no need to remember any particular wildcard syntax). Those with more sophisticated needs for general wildcard matching can use MATCH. Lets try to keep down the implementation work for servers.

The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash (''), which escapes itself ("" → a string with a single character '').

Doesn't the grammar change in the PR already do this with "/"?

@merkys
Copy link
Member

merkys commented Jan 30, 2023

  • The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash ('\'), which escapes itself ("" → a string with a single character '\').

Using / in grammar might clash with math operator / if we want to introduce math operators in future. SQL seems to support math operators.

@rartino
Copy link
Contributor

rartino commented Jan 30, 2023

If we do so, OPTIMADE RE will have to stay a subset of PCRE2 at least until the next major release. I am not against it, just stating that all additions/changes will have to be PCRE2 compatible. In addition, PCRE2 standard has life of its own, so if we tie OPTIMADE RE to PCRE2, we have to tie to some specific version.

This is all true. And, arguably (but I suspect this will be argued...) our reference should be ECMA-262 v11 just like JSON Schema rather than a specific version of the PCRE2 library.

Nevertheless, I feels strange about forcing those with full PRCE2 implementations in their backends to put in the extra work to lock away useful features. I suspect people will be reluctant to do that in practice.

We could say (not saying it is a good idea) something along the lines of: "For non-compliant REGEXs servers MAY behave in any undefined way, including serving matches according to the PCRE2 standard (or any other standard). Non-compliant REGEXs SHOULD trigger a warning. Servers are free to document their behavior for non-compliant REGEXs."

Such a statement does work against general compatibility, but will make implementing MATCH for various underlying REGEX standards a lot easier. Also for those backends where one need to translate REGEXs it would be nice to only have to worry about what the translator is doing for compliant REGEXs.

@sauliusg
Copy link
Contributor Author

  • The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash ('\'), which escapes itself ("" → a string with a single character '\').

Using / in grammar might clash with math operator / if we want to introduce math operators in future. SQL seems to support math operators.

I think the fact that the RE slash follows the MATCH keyword and REs never participate in arithmetic expressions should resolve the ambiguity...

@sauliusg
Copy link
Contributor Author

Nevertheless, I feels strange about forcing those with full PRCE2 implementations in their backends to put in the extra work to lock away useful features. I suspect people will be reluctant to do that in practice.

We could say (not saying it is a good idea) something along the lines of: "For non-compliant REGEXs servers MAY behave in any undefined way, including serving matches according to the PCRE2 standard (or any other standard). Non-compliant REGEXs SHOULD trigger a warning. Servers are free to document their behavior for non-compliant REGEXs."

I agree. Clients could use extensions at their own risk.

@sauliusg
Copy link
Contributor Author

The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash (''), which escapes itself ("" → a string with a single character '').

Doesn't the grammar change in the PR already do this with "/"?

Yes it does; I forgot it (had to review the PR again ;)

@merkys
Copy link
Member

merkys commented Jan 30, 2023

Using / in grammar might clash with math operator / if we want to introduce math operators in future. SQL seems to support math operators.

I think the fact that the RE slash follows the MATCH keyword and REs never participate in arithmetic expressions should resolve the ambiguity...

Indeed. I suppose our LALR parsers will be able to handle that.

@sauliusg
Copy link
Contributor Author

I suggest to skip LIKE entirely and just add MATCH. For simple matching we already provide BEGINS WITH, ENDS WITH, and CONTAINS which are user friendly in their definitions (i.e., no need to remember any particular wildcard syntax). Those with more sophisticated needs for general wildcard matching can use MATCH. Lets try to keep down the implementation work for servers.

Agreed. So we are converging on:

  • The MATCH operator that uses a subset of PCRE (based on [1] and [2]); string MATCH /abc.*xyz[0-9]{3,7}/; I'll check if slashes '/' can be easily accommodated into out Filter language grammar – otherwise we use the standard Filter strings. Escape character is backslash ('\'), which escapes itself ("\\" → a string with a single character '\').

@rartino
Copy link
Contributor

rartino commented Jan 30, 2023

The /re/ syntax should minimize the escape backslashes that are needed.

This is an excellent point. This is why we have to use / as the delimiter. Otherwise users would be exposed to figuring out our own double-escaping rules for regexs inside strings. As someone who has occasionally been exposed to preparing regexs through >= 4 layers of escaping, extra escape layers is something that absolutely should be avoided.

@merkys
Copy link
Member

merkys commented Jun 5, 2023

To test for the possible grammar conflicts with putative / operator I made the following change (which I think reflects a possible use-case):

diff --git a/optimade.md b/optimade.md
index 030be71..a364197 100644
--- a/optimade.md
+++ b/optimade.md
@@ -2164,7 +2164,7 @@ Filter = [Spaces], Expression ;
 
 Constant = String | Number ;
 
-Value = String | Number | Property ;
+Value = String | Number | Property | Number, '/', Number ;
 (* Note: support for Property in Value is OPTIONAL *)
 
 ValueList = [ Operator ], Value, { Comma, [ Operator ], Value } ;

With it, every grammar test fails with the following output:

> Error: in tests/generated/Filter.g: line 110:
>     inherent ambiguity in production 'Value' starting with tokens "+",
>     "-", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", and "."
> 
> Value = String | Number | Property | Number '/' Number;
> 

Thus I assume / operator and current RE syntax cannot co-exist. Which is not a show-stopper, unless we foresee introduction of math operators in OPTIMADE.

@rartino
Copy link
Contributor

rartino commented Jun 5, 2023

Thus I assume / operator and current RE syntax cannot co-exist. Which is not a show-stopper, unless we foresee introduction of math operators in OPTIMADE

This strikes me as strange, and may be a consequence on the details of how /regex/ is introduced in the grammar, because it should be possible to formulate the grammar so that a division / can ONLY follow a <number> whereas <non-number>/ puts you in a state where the following <anything>/ completes the regexp, and thus cannot be ambiguous with a division.

edit Actually, looking at your error message, isn't that just because it is impossible to distinguish Number and Number '/' Number? What if you rewrite the rule as:

Value = String | Number, ['/', Number] | Property

@sauliusg
Copy link
Contributor Author

sauliusg commented Jun 5, 2023

Thus I assume / operator and current RE syntax cannot co-exist. Which is not a show-stopper, unless we foresee introduction of math operators in OPTIMADE

This strikes me as strange, and may be a consequence on the details of how /regex/ is introduced in the grammar, because it should be possible to formulate the grammar so that a division / can ONLY follow a <number> whereas <non-number>/ puts you in a state where the following <anything>/ completes the regexp, and thus cannot be ambiguous with a division.

edit Actually, looking at your error message, isn't that just because it is impossible to distinguish Number and Number '/' Number? What if you rewrite the rule as:

Value = String | Number, ['/', Number] | Property

This sounds strange to me as well, since I thing we tried things similar to REs, I guess. I wold like to look closer into the grammar and see if we can transform it in such a way that the language stays the same (with REs) and the ambiguity goes away.

@rartino
Copy link
Contributor

rartino commented Feb 12, 2024

@sauliusg Why did you close this? If we get #490 merged, this PR can be updated to just reference that format for the MATCH operator and we will be done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants