Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in filter for value vs. value comparisons of type string vs. string #157

Closed
rartino opened this issue Jul 14, 2019 · 11 comments · Fixed by #229
Closed

Ambiguity in filter for value vs. value comparisons of type string vs. string #157

rartino opened this issue Jul 14, 2019 · 11 comments · Fixed by #229

Comments

@rartino
Copy link
Contributor

rartino commented Jul 14, 2019

Recent changes introduced a timestamp type. Timestamps in the filtering language are right now represented as string tokens (parsed according to rfc3339). At the same time value vs. value comparisons are optionally allowed to be supported.

It seems to me that this presents an ambiguity in the interpretation of "value (string) vs. value (string)" comparisons. They can either be parsed as string comparisons, or as timestamp comparisons. E.g., does this filter match or not?:
"2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z"

One solution to this issue is to change the spec so that timestamps have their own token in the filter language. A proposal is to use bare rfc3339 strings, i.e.: 2019-07-14T22:06:43-01:00

Some relevant discussion from #144

@sauliusg wrote:

Let me share some thoughts on type matching and comparisons, to see I understand the problems the same way as you guys (sorry for a long post):

  1. Filter language defines 2 constant types on syntactic level: Strings (quoted) and Numbers.

1a. If necessary, Numbers can be further subdivided into Integers and Reals, to match the rest of the specification (JSON schema definitions of responses).

  1. In a filter Comparison expression, both values MUST have the same lexical type. I.e. that x > 1.0 and x > 1 is only permitted if the property x is numeric (integer or float); whereas s = "abrakadabra" is only permitted if s is string. Constants in comparisons, consequently, MUST also have identical types: comparison 1 = 1 is valid and true; comparison 1 != "1" is invalid (and not even false :).
  2. Comparing numeric values:

Lexical Real (like 1.12) numbers are not actually floating point since we do not describe any arithmetic on them; thus all fractional numbers are actually rationals. That is to say, comparison x > 1.12 is exactly the same as comparison x * 100 > 112, except that OPTiMaDe Filter language does not (currently) support arithmetic. Thus, comparing numbers is uniquely defined, even if one number is integer, and another is "computer real" (actually, always a rational approximation of a real number).

Backends MAY implement OPTiMaDe Filter Numbers and Response schema Integers/Float as floating point numbers (including IEEE floats). Without arithmetic, number comparisons should not be problematic even if different back ends us different floating point implementations. E.g. if x is a real-valued property, x = 0.5 should give the same result regardless whether the back-end uses IEEE float, IEEE double or decimal floating-point values, if all implementations correctly convert 0.5 fraction into their respective representation (and this should be equivalent to x * 10 = 5 assuming infinite-precision arithmetic). Most actual implementations already behave this way. Comparisons like 1.2E-100 < 1.2E+100 should also be possible.

It remains to be decided how implementation MUST behave when the number can not be handled exactly. Problems might be x > 1E-600 AND x < 2E-400 (both values are too small to be represented by a floating point back-end and may underflow to 0); or x > 0.12345678901 (on different implementations, different number of digits might be used to represent x and comparisons will be different even with the same data).

Regarding the underflow or overflow, we can:

  • either require that an implementation returns an error if underflow or overflow conditions occur (thus x > 1E-200 would produce and error response if 1E-200 can not be represented on a backend, so this comparison will be computed on a host using double IEEE floats for x but not on a host using single precision IEEE floats);
  • or allow inexact comparisons (and potentially different results from different databases) if such conditions occur;

To cope with rounding errors on long fractions, such as in x > 0.12345678901, we can settle in the OPTiMaDe specification that at least 6 decimal digits MUST be taken into account by the implementation. This AFAIK corresponds to the IEEE single-precision float capabilities. Thus decimal fractions with no more than 6 digits will give the same results on all hosts.

  1. String comparisons.

Comparisons like s > "Abc" depend on collation order. I would specify a default (Unicode) collation order, regardless of the language and locale; for higher Unicode characters (outside the ASCII range) a normalised representation MUST be compared. In this way comparisons should be unambiguous.

  1. For all other properties, like dates, the values will lexically be represented as strings, and the back-end will have to make a conversion to whatever representation is needed. We can describe comparison semantics on the representation strings alone; for example date comparisons in last_modified > "2019-01-10" will follow the usual date comparison semantics.

Such agreement will allow to define comparison semantics for any types with internal structures, such as formulae.

This unfortunately breaks @rartino desiderata that all string fields are compared using the same semantics, but I do not see any convenient way around it. One could introduce type definitions and typed constants, but this is definitely too complex for the OPTIMaDe purposes :).

@rartino replied

I had already somewhat reluctantly accepted that the timestamp property type is encoded in the filtering language as a string token. However, it is important to me that these different comparison semantics are due to properties being of different types. I.e., if you want special semantics for comparison of, e.g., smiles strings, there has to be a well-defined smiles property type that clearly defines these semantics.

But now that you point at it... Doesn't the introduction of "timestamp strings" resurface the problem with allowing value vs. value comparisons? How should my implementation handle the following?:
"2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z"
Does this filter match or not?

If the answer is "no" (at least that is my interpretation of the present spec), then does it really make sense to leave out the facility for value vs. value comparisons of timestamps?

Should we change over to simply define a timestamp token as a bare rfc3339 timestamp?, i.e. 2019-07-14T22:06:43-01:00 ?

@merkys
Copy link
Member

merkys commented Jul 15, 2019

I see @rartino's point. I interpret the current specification as requiring to interpret strings as timestamps only in comparisons with timestamp-typed property:

In a comparison with a timestamp property, a string token represents a
timestamp value that would result from parsing the string according to
RFC 3339 Internet Date/Time Format.

Therefore, the following comparison evaluates to true (ASCII - < Z):

"2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z"

@merkys
Copy link
Member

merkys commented Jul 15, 2019

One solution to this issue is to change the spec so that timestamps have their own token in the filter language. A proposal is to use bare rfc3339 strings, i.e.: 2019-07-14T22:06:43-01:00

This sounds good. There might be problems, however, if at some point we decide to introduce both math operations (2019-07-15, which equals to 1997) and dates without time (2019-07-15 referring to the day of writing this comment).

@sauliusg
Copy link
Contributor

Therefore, the following comparison evaluates to true (ASCII - < Z):

"2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z"

This is exactly as I would interpret it :)

@sauliusg
Copy link
Contributor

This sounds good. There might be problems, however, if at some point we decide to introduce both math operations (2019-07-15, which equals to 1997) and dates without time (2019-07-15 referring to the day of writing this comment).

Indeed, this is a problem that using quotes solves once and for all.

@sauliusg
Copy link
Contributor

I left two somewhat lengthy discussions on the points raised by @rartino in #144 (probably not the best place, but that is where the a-mail ling directed me to...)

@sauliusg
Copy link
Contributor

sauliusg commented Jul 15, 2019

@rartino

It seems to me that this presents an ambiguity in the interpretation of "value (string) vs. value (string)" comparisons. They can either be parsed as string comparisons, or as timestamp comparisons. E.g., does this filter match or not?:
"2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z"

As a short wrap-up: I would compare value-with-value always as a string, and only recognise literal string values as time-stamps when actually comparing them with a time-stamp properties.

Actually, the ISO timestamps are designed in such a way that a full time-stamp lexicographic comparison gives correct chronological order. Thus surprises should be minimal :)

@merkys
Copy link
Member

merkys commented Jul 15, 2019

As a short wrap-up: I would compare value-with-value always as a string, and only recognise literal string values as time-stamps when actually comparing them with a time-stamp properties.

Couldn't agree more :)

Actually, the ISO timestamps are designed in such a way that a full time-stamp lexicographic comparison gives correct chronological order. Thus surprises should be minimal :)

You probably mean the UTC times, because timestamps with offsets (+HH:MM and -HH:MM) ruin this nice property both for ISO 8601 and RFC 3339. Of course we may simply forbid offsets and attain correct chronological order.

@sauliusg
Copy link
Contributor

You probably mean the UTC times, because timestamps with offsets (+HH:MM and -HH:MM) ruin this nice property both for ISO 8601 and RFC 3339. Of course we may simply forbid offsets and attain correct chronological order.

Yes, that's true; the alphabetic comparison apparently gives chronological order only when strings are in the same time zone...

@rartino
Copy link
Contributor Author

rartino commented Jul 16, 2019

the alphabetic comparison apparently gives chronological order only when strings are in the same time zone...

Right, that was the point of my example:

2019-07-14T22:06:43-01:00 = 1563145603000 unix time
2019-07-14T22:06:43Z      = 1563142003000 unix time

So, lexicographically the first one is smaller, time-wise the first one is larger.

My other point is: is it thus our intent to ONLY provide a facility for value vs. value comparisons for those types who have their own lexiographic tokens (i.e., presently strings and numbers), but not for other property types, e.g., timestamps? Does the good arguments put forward for allowing value vs. value comparisons not apply also for timestamp value vs. value comparisons?

What worries me is that we are essentially building in a "gotcha" in the filter language for client implementations:

  1. Implement a client that via a UI generates OPTiMaDe filters.
  2. User input may lead to value vs. value comparisons in the filter, but that is fine, since OPTiMaDe supports it.
  3. User input may lead to timestamp comparisons, which the implementation encodes for the user as rfc3339 strings, e.g., generating: last_changed > 2019-07-14T22:06:43Z" and for some other input, the other way around 2019-07-14T22:06:43-01:00" < last_changed.
  4. Everything works as intended up to the point where a user trigger 2 + 3 simultaneously; in which case the implementation erroneously generates "2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z", expecting this to be handled as a timestamp value vs value comparison rather than a string value vs. value comparison.

@rartino
Copy link
Contributor Author

rartino commented Nov 17, 2019

I did some further thinking about this.

How about that we for v1.0 simply forbid all string vs string value comparisons, while we keep value vs. value comparisons of other types. Because, as is noted above, it isn't clear what "2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z" should evaluate to.

Then, moving on after v1.0, here is where I'd like us to go:

  • The notation "something" isn't a string, it is a data token, which depending on context is interpreted as different data types, of which string is just one possibility.

  • If an expected data type of a data token can be inferred from the context, e.g., by comparison with a property of a specific type, it is interpreted as that type.

    • If compared to a string property "2019-07-14T22:06:43-01:00" is interpreted as a normal string, and string comparison semantics apply.
    • If compared to a timestamp property, "2019-07-14T22:06:43-01:00" is interpreted as a timestamp, and timestamp comparison semantics apply.
    • If compared to a smiles property, "O=C=O" is interpreted as a smiles string, and "smiles comparison semantics" apply.
    • etc.
  • However, instead of using a data token, a client can represent specific datatypes using a prefix notation:

    • [string]"2019-07-14T22:06:43-01:00" is a string.
    • [timestamp]"2019-07-14T22:06:43-01:00" is a timestamp.
    • [smiles]"O=C=O" is smile data.
    • etc.

    This makes the difference between these two possibilities explicit:

    • [timestamp]"2019-07-14T22:06:43-01:00" < [timestamp]"2019-07-14T22:06:43Z"
    • [string]"2019-07-14T22:06:43-01:00" < [string]"2019-07-14T22:06:43Z"
  • data tokens CANNOT be value vs. value compared, because there is no context in which to interpret them as datatypes.

    • "2019-07-14T22:06:43-01:00" < "2019-07-14T22:06:43Z" gives the error: "Data tokens cannot be value by value compared: use prefix notation to indicate the data types."

This seems very expandable for embedding any kind of data into string-like tokens in the language in the future.

Edit: we can even, from day one, allow database-specific data types with their own semantics! E.g.:

[omdb_smiles]"O=C=O"

can be allowed from day one!, with comparison semantics defined by my database.

@merkys
Copy link
Member

merkys commented Nov 18, 2019

Could the square bracket syntax be replaced with function-like syntax? I.e., instead of

[timestamp]"2019-07-14T22:06:43-01:00"

one would write

TIMESTAMP( "2019-07-14T22:06:43-01:00" )

This would be similar to DATE() function in MySQL, which converts string/numeric representations of date into its internal data structure for representing timestamps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants