Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean value representation in queries #345

Closed
lauri-codes opened this issue Dec 17, 2020 · 12 comments · Fixed by #348
Closed

Boolean value representation in queries #345

lauri-codes opened this issue Dec 17, 2020 · 12 comments · Fixed by #348
Labels
misc/question Further information is requested status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus. status/no-consensus There is no consensus on this issue, it needs to be further discussed and developed. topic/filtering-language Issue discussing changes and improvements to the query and filtering language topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.

Comments

@lauri-codes
Copy link

Hi!

First of all thanks for all your hard work on this great project!

I'm just getting to know the different standards that Optimade provides, and I bumped into a problem with boolean values. The documentation (optimade.rst) at 2.1 says that there are the following basic types: string, integer, float, boolean, timestamp. In section 4.2. it is mentioned that booleans in responses should use the JSON standard. However, the standard is unclear on how boolean values should be represented in the queries. Section 6.1 discusses the different tokens, but also there is no mention of how boolean values should be presented. The grammar itself also does not seem to define a token for booleans.

So the question is whether I just somehow missed how booleans are to be defined, or if the official standard is missing? I guess none of the current entries are of boolean type, so this was never discussed. I would, however, like to apply the Optimade query language to a domain which also contains boolean values. My naive suggestion would be to go with the JSON style lowercase true/false values.

Thanks fo your time!

@CasperWA
Copy link
Member

Thank you for your interest and the issue!

The issue of boolean type usage in the filter language indeed never came up (as far as I remember), as you rightly suggest, due to a lack of filter fields needing boolean values.

Personally, I would follow your suggestion here and use the JSON style boolean values true/false.
The reason being that this reflects well how the other types are represented in the filter language. Furthermore, since strings are to be enclosed in double quotations ("), leaving these out for the true/false values, should ensure that the type can be determined in the grammar parser without any ambiguity.

@CasperWA CasperWA added misc/question Further information is requested status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus. status/no-consensus There is no consensus on this issue, it needs to be further discussed and developed. topic/filtering-language Issue discussing changes and improvements to the query and filtering language topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. labels Dec 17, 2020
@lauri-codes
Copy link
Author

Thanks for the prompt reply! Regarding the use of true/false as booleans: for the sake of consistency it might make sense to instead use TRUE/FALSE as all other keywords are uppercase..? This would also get rid of the (very unlikely) issue with someone using an entry named "true" or "false" which would confuse the parser.

@CasperWA
Copy link
Member

CasperWA commented Dec 17, 2020

Since true/false would be values and not operator keywords, I don't think this consistency argument holds due to this fact.
However, it's worth considering. I'll just ping @sauliusg, @merkys and @rartino for this discussion as well.

Concerning the issue of naming a field's value (e.g., id) true or false, this shouldn't be an issue due to the reason I mentioned above, where the value MUST be enclosed in double quotations (") since it's of type string (e.g., ?filter=id="true").

For the single-entry endpoint (´/structures/true`) it shouldn't be an issue either, since the filter parser never touches this part of the URL.

Whether a field's name could be true or false is a non-existent issue, since custom field names MUST be prepended by _{provider}_, where {provider} signifies a provider's dedicated name (e.g., mcloud for Materials Cloud).

@merkys
Copy link
Member

merkys commented Dec 17, 2020

Issue well spotted! I think all-caps TRUE and FALSE is better than true and false because the latter values might be at some point used for endpoint or field name. Of course we may state in the standard that these names are forbidden explicitly, but currently, AFAIK, we have no such limitations to the namespaces, and I prefer keeping them low.

For the single-entry endpoint (´/structures/true`) it shouldn't be an issue either, since the filter parser never touches this part of the URL.

By the way, endpoint names may appear in filters as explained in "Filtering on relationships".

true and false might overcomplicate the grammar, since they have to be detected as values rather than identifiers. On the other hand, capital letters are used for operators and special values. For instance, UNKNOWN could be seen as replacement of JSON's null.

Moreover, I would go for using IS TRUE and IS FALSE instead of comparisons against TRUE and FALSE with currently defined comparison operators. Rationale: I find it difficult to understand what > TRUE should mean 😄. Thus IS TRUE and IS FALSE would appear nicely in parallel to IS KNOWN and IS UNKNOWN.

@ml-evs
Copy link
Member

ml-evs commented Dec 17, 2020

This is definitely an oversight and should be fixed for v1.0.1, thanks @lauri-codes (also hello, hope you're good!)

Moreover, I would go for using IS TRUE and IS FALSE instead of comparisons against TRUE and FALSE with currently defined comparison operators. Rationale: I find it difficult to understand what > TRUE should mean smile. Thus IS TRUE and IS FALSE would appear nicely in parallel to IS KNOWN and IS UNKNOWN.

I like this suggestion, we need to make it clear that IS TRUE (or = true, if we go with that) can only be applied to properties and not to expressions. I think the analogy with IS KNOWN works well for this.

@lauri-codes
Copy link
Author

Hi @ml-evs, long time no see :)

If boolean values are always assumed to be scalar-like, then IS TRUE and IS FALSE would be nice.

What about queries that would target lists containing booleans? Would they be supported? E.g.

  • How to search if a list of bools contains a boolean value: pbc HAS TRUE?
  • How to search if a list of bools matches: 'pbc = TRUE, TRUE, TRUE`?

If boolean values can be used together with other operators besides equality/unequality, then I would prefer keeping the operators as they are (= TRUE instead of IS TRUE) and forbidding operators that don't make any sense (like > TRUE).

@rartino
Copy link
Contributor

rartino commented Dec 17, 2020

I am surprised we've missed this so far, thanks for pointing it out. The reason is of course that we so far did not yet need to standardize a boolean field. Formally the filter language supports comparison of boolean fields, but there is just no way to express a boolean constant, nor an operator to test for "truthiness" in a filter.

Since this is likely to require a grammar extension (though, depending on where we land, maybe not) and this will take some time to make it into the standard, I would suggest that @lauri-codes may try to find some form of workaround solution for now. If you have several "flags" you want to represent and query against, consider representing them as strings in a list like the standardized structure_features. E.g., filter=_exmpl_categorizations HAS "poisonous". Uglier workaround for booleans would be to represent them instead as integer 0/1 or "true"/"false" strings or similar.

We definitely should be careful with introducing lowercase keywords, e.g. true, false as they may conflict with other tokens we could want to introduce later (the primary restriction here being that we try to keep the filtering language LL-parseable for parser simplicity.)

My first thought was also towards 'IS TRUE', which then isn't a boolean constant token, but an operator to test for truthiness. However, I think it would be more consistent with the present language design to find another word than "IS" for truthiness operators. Not sure what it would be though. (Also - I see that @lauri-codes just posted the objection that it is not clear how this generalizes to testing boolean values in lists.)

However, it would actually be consistent with how we handled timestamps to simply say that the OPTIMADE representation of booleans in the filter language is via the strings "true" and "false". Given that we worked out that there aren't any ambiguities for timestamp comparisons, I don't see why there would be for booleans. Testing a="true" means a string comparison if the type of a is a string, but a boolean comparison if the type of a is a boolean.

@sauliusg
Copy link
Contributor

Issue well spotted! I think all-caps TRUE and FALSE is better than true and false because the latter values might be at some point used for endpoint or field name. Of course we may state in the standard that these names are forbidden explicitly, but currently, AFAIK, we have no such limitations to the namespaces, and I prefer keeping them low.

For the single-entry endpoint (´/structures/true`) it shouldn't be an issue either, since the filter parser never touches this part of the URL.

By the way, endpoint names may appear in filters as explained in "Filtering on relationships".

true and false might overcomplicate the grammar, since they have to be detected as values rather than identifiers. On the other hand, capital letters are used for operators and special values. For instance, UNKNOWN could be seen as replacement of JSON's null.

Moreover, I would go for using IS TRUE and IS FALSE instead of comparisons against TRUE and FALSE with currently defined comparison operators. Rationale: I find it difficult to understand what > TRUE should mean smile. Thus IS TRUE and IS FALSE would appear nicely in parallel to IS KNOWN and IS UNKNOWN.

I think TRUE and FALSE (a keyword-like in all caps) would be consistent with the current grammar design. Using names in quotes like something = "true" will immediately lead to confusion with comparison against the string value "true".

Whether something = TRUE or something IS TRUE I have no string feelings, IMHO both are fine. Both will require moderate conservative (backwards compatible) extension of the Filter grammar, which should not be a problem.

@rartino
Copy link
Contributor

rartino commented Dec 17, 2020

Whether something = TRUE or something IS TRUE I have no string feelings, IMHO both are fine. Both will require moderate conservative (backwards compatible) extension of the Filter grammar, which should not be a problem.

I support TRUE and FALSE as tokens for boolean constants and skipping a truthiness operator. As far as I can see, these constants can be implemented slightly less intrusively in the grammar than "IS TRUE", they neatly solve the issue pointed out by @lauri-codes about testing booleans in lists, and after thinking about it, it frankly seems odd to not have a way to specify a constant of one of our most basic data types.

@merkys
Copy link
Member

merkys commented Dec 17, 2020

I support TRUE and FALSE as tokens for boolean constants and skipping a truthiness operator. As far as I can see, these constants can be implemented slightly less intrusively in the grammar than "IS TRUE", they neatly solve the issue pointed out by @lauri-codes about testing booleans in lists, and after thinking about it, it frankly seems odd to not have a way to specify a constant of one of our most basic data types.

The more I think, the more I like the idea of TRUE and FALSE tokens. However, I feel we need to define how operators > and < work with boolean types. Is TRUE > FALSE?

@rartino
Copy link
Contributor

rartino commented Dec 18, 2020

The more I think, the more I like the idea of TRUE and FALSE tokens. However, I feel we need to define how operators > and < work with boolean types. Is TRUE > FALSE?

Surely we can say that the only valid comparison operator for booleans is equality?

Note: filter=a>b will be accepted grammatically even if the a and b are boolean type fields, because the grammar cannot type check fields. It will be the type checking "layer" in our implementations that will throw an error about this. (Technically we could specify the grammar to block filters with invalid comparisons when there are boolean constants on both sides, e.g. TRUE > FALSE but that isn't very useful in my opinion.)

@merkys
Copy link
Member

merkys commented Dec 18, 2020

Surely we can say that the only valid comparison operator for booleans is equality?

Either that, or specify that when compared, TRUE is cast to integer 1 and FALSE is cast to integer 0. At least Python works this way:

$ ipython3
In [1]: True > False
Out[1]: True

In [2]: False > True
Out[2]: False

In [3]: True + 0
Out[3]: 1

(Technically we could specify the grammar to block filters with invalid comparisons when there are boolean constants on both sides, e.g. TRUE > FALSE but that isn't very useful in my opinion.)

Yes, I think that doing so would require many additional grammar rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
misc/question Further information is requested status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus. status/no-consensus There is no consensus on this issue, it needs to be further discussed and developed. topic/filtering-language Issue discussing changes and improvements to the query and filtering language topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants