Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

Merged
merged 16 commits into from
Jan 30, 2022

Conversation

pvretano
Copy link
Contributor

@pvretano pvretano commented Oct 25, 2021

Still in progress pending discussion in the SWG. Need to find out:

  1. Is the name CASEI OK?
  2. Do we need a similar function named ACCENTI for accent-insensitive matching? I have included it in the BNF and JSON encodings of CQL but I have not discussed it in the specification.
  3. If we do need ACCENTI, should that be in its own conformance class or parse of the same class as CASEI?

NOTE: In the processed of adding the CASEI function I noticed that the way we had defined the UPPER/LOWER functions in the JSON was not correct. UPPER/LOWER and CASEI are suppose to be built-in functions and so they should be encoded as a function would be but that was not the case. So, I fixed that.

@pvretano pvretano linked an issue Oct 25, 2021 that may be closed by this pull request
@pvretano pvretano requested a review from cportele October 25, 2021 03:28
Copy link
Member

@cportele cportele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few topics for discussion.

Regarding the questions:

  • The name CASEI works for me.
  • Good question whether we want to support ACCENTI, too. Probably yes. If we add it, I think it should be in its own conformance class. Some backends may not support accent collation.

cql2/standard/schema/cql2.bnf Outdated Show resolved Hide resolved
cql2/standard/clause_7_enhanced.adoc Outdated Show resolved Hide resolved
cql2/standard/schema/cql2.yml Outdated Show resolved Hide resolved
@cportele
Copy link
Member

cportele commented Oct 25, 2021

Meeting 2021-10-25: See the review comments for their resolution. Regarding the open questions:

  • We will use the name CASEI.
  • We will include ACCENTI in its own requirement/conformance class. (ACCENTI includes all kinds of diacritics.)
  • We also discussed including a combined operator, but decided to not do this in this version. Combining both requires CASEI(ACCENTI("Größer")).
  • @pvretano will update the PR based on the decision. After that it should be ready to merge.

@philvarner
Copy link
Contributor

I have a concern about the semantics of and complexity of implementing CASEI. My main concern is that the use of CASEI in an expression has a non-local effect on the evaluation of the expression, such that it's not actually a function. With any user-defined function, the function evaluates to a value. However, CASEI doesn't evaluate to a single value, but rather modifies the semantics of the operators in the expression it's used in.

One of the examples is:

CASEI(road_class) IN (CASEI('Οδος'),CASEI('Straße'))

but this has the same semantics as:

CASEI(road_class) IN ('Οδος','Straße')

since road_class=straße would evaluate "true" in either expression.

The presence of CASEI changes the semantics of the operators, rather than the value of the argument to the function. I feel like this is going to make it very difficult for implementers to get right, much more so than just having UPPER and LOWER functions. This gets complicated, especially when multiple functions are involved. For example, I'm not sure what the semantics of this expression would be:

FOO(BAR(CASEI(road_class))) IN ('Οδος','Straße')

The CASEI should do something, but it seems odd that FOO(BAR(CASEI(road_class))) and CASEI(FOO(BAR(road_class))) would have the same semantics.

@pvretano
Copy link
Contributor Author

pvretano commented Dec 3, 2021

@philvarner where did you get that CASEI() has a non-local effect? That is not the intent and CASEI() has no effect on the semantics of the operator. The intent of CASEI() is to convert its argument into a case insensitive form (using the appropriate case folding rules) that can be compared to another similarly converted value. So, CASEI('straße')<> 'Straße'.

@philvarner
Copy link
Contributor

Ah, so CASEI returns a new type that is this insensitive format, and then that can only be compared against other values of that same type. So something like CASEI('straße') == 'Straße' would yield a type-checking error because the left hand is a CASEI object and the right is a string.

@philvarner
Copy link
Contributor

One place in the spec that made me think it was a non-local effect is "Although the CASEI function is used in CQL2 to trigger case insensitive matching" -- the use of the word "trigger" here makes me think it's putting the expression into a different mode, rather than the CASEI function returning a (new) special type.

@pvretano
Copy link
Contributor Author

pvretano commented Dec 3, 2021

@philvarner CASEI('straße') == 'Straße' would evaluate to false. The CASEI() function returns a string but one to which case folding rules have been applied. So CASEI('straße') would generate the string 'STRASSE' and CASEI('Οδος') would generate the string 'ΟΔΟΣ'. Clearly I have more explanation to add to the text of CQL2! ... and also to clarify the bit that made you think that the scope of CASEI was beyond its argument. I will endeavour to clarify a bit more.

@philvarner
Copy link
Contributor

Yes, I think describing case folding would be good, since I've been at this a long time I've never used it, and instead just always done lowercasing, which is probably a fault of my focus primarily on English. Glad I know about it now!

The way I would describe it is that it returns a string typed representation of the input string that is guaranteed to be equal to any other case insensitive representation of that string. The Unicode specs I read say this is typically lowercase (in contrast to your example of it being uppercase 'STRASSE'), but I would stress to users that they should ONLY compare the results of two CASEI function expressions if they actually want to ensure a correct comparison, because the exact rules of conversion are opaque to them. So the only "durable" comparison is:

CASEI(prop1) == CASEI('Straße')

CASEI(prop1) == 'strasse' might work, but it's not guaranteed to work across implementations or between versions of the same implemenation.

@pvretano
Copy link
Contributor Author

pvretano commented Dec 6, 2021

06-DEC-2021: There is an open question about whether CASEI() and ACCENTI() should be in their own conformance classes and we need feedback on that. @cportele and @pvretano feel that English implementations probably only care about CASEI() while European implementors would want ACCENTI() too. So based on that, separate conformance classes would make sense. If anyone has any feeling about that, please add your comments here and the SWG can circle back in the a future meeting.

@philvarner
Copy link
Contributor

I'm usually an advocate finer-grained conformance classes, but I don't see a strong case here for separating them. English-only speakers may only see the use case for needing case-insensitive comparison, but the implementations they make will likely also have European language users and data who need the accent-insensitive comparison. I don't know of any databases that only have support for case and not accent. There's always the fall-back of simply implementing CASEI or ACCENTI as a custom function.

@m-mohr
Copy link
Contributor

m-mohr commented Dec 7, 2021

From a German POV, I find the differentiation really weird. ß is supported in CASEI, but ä, ö and ü are not, which is surprising as from a German POV they are basically equivalent. The naming is also not ideal. In German (might be biased), I usually think about the French accents like á and à etc and not about ä and ö if I speak about accents. We call these umlauts and I doubt people will really make the connection.

Overall, it feels like an implementation detail is shifted towards the user. I'd strongly step back from the English centric standpoint and think more global here. Aren't most environments nowadays supporting Unicode related functions? And if a back-end really can't support it and thinks he only delivers in English, then he can still simply not support it in CASEI and will not run into problems.

So my +1 to merging ACCENTI and CASEI.

@m-mohr
Copy link
Contributor

m-mohr commented Dec 7, 2021

I'm coming from a (non-developer) user point of view here. Why does a novice user need to understand the difference between CASEI and ACCENTI? As a normal user I just want to say that I don't care about casing and all the details should be handled by the implementation. If I let the users think about it, I make a complexity of the implementation (to support accents etc) be a complexity for the user... But maybe I'm oversimplifying things?

@rsmith013
Copy link

rsmith013 commented Dec 7, 2021

@m-mohr for a user, this would definitely be nicer.

@cportele
Copy link
Member

cportele commented Dec 7, 2021

But maybe I'm oversimplifying things?

I think so. Normalization of characters is orthogonal to case folding. If we merge the two, this would create issues for the cases where one wants one or the other, but not both.

Do databases usually offer a convenience function that combines both? Even if not, should we provide one?

@aznan2
Copy link

aznan2 commented Dec 7, 2021

In Swedish, ÅÄÖ are not just A and O with accents but their own separate letters. The words "Al" and "Ål" have completely different meaning, so it would make sense to me to want to want to search in a case insensitive way while still keeping the accents, so that I don't get a hit on "Ål" when searching for "al".

@m-mohr
Copy link
Contributor

m-mohr commented Dec 7, 2021

Wait, I was only talking about casing, not about allowing ä for a etc. I wasn't even aware that the latter is tackled by those functions and it's indeed completely different thing and should likely be a separate function, but casing is casing regardless of what I pass in (I think).

Maybe I misunderstood what the functions do.

I'd expect that with CASEI:

  • Süd is equal to SÜD, but is NOT equal to Sud (or Sued)

What I could imagine is that ACCENTI (or whatever it's then called) allows:

  • Süd is equal to Sud, but is NOT equal to SÜD (or SUED)

That seperates two different use cases clearly and makes it easy to use (which might not be easy to implement).
That's just my not very informed opinion on how I'd expect it to work from a German speaker POV. This might not be what others expect...

I don't have a strong preference whether it's one or two conformance classes.

@pvretano
Copy link
Contributor Author

pvretano commented Dec 7, 2021

@cportele et all, FYI, most databases that I surveyed before creating this PR handle case insensitivity separately from accent insensitivity and many databases have a convention where letters such as '_AI' and '_CI' are appended to the collation name to indicate whether the collation is case insensitive, accent insensitive or both. The default collation for MySQL, for example, is 'latin1_swedish_ci' so it is a case insensitive -- but not accent insensitive -- collation.

@aznan2
Copy link

aznan2 commented Dec 7, 2021

@m-mohr Sorry, my bad. I read "+1 to merging ACCENTI and CASEI" as merging the two functions, not the conformance classes.

@pvretano
Copy link
Contributor Author

pvretano commented Dec 7, 2021

I'm agnostic about whether CASEI and ACCENTI are in one conformance class or two. I can see both sides. On the one hand these are related functions and so existing in one conformance class makes sense. One the other hand, languages have a mix of case and accent requirements and that sorta points to separate conformance classes ... although it seems kinda short sighted to develop software for a specific language or set of languages ... no!? ;) I'm happy to go with the consensus.

@philvarner
Copy link
Contributor

To clarify, I think they must be two distinct functions, but I think it's okay to put them in the same conformance class, but I also don't have a problem with putting them in different conformance classes because of the uncertainty about whether users can or want to support both of them.

@philvarner
Copy link
Contributor

One other comment to add about what the purpose of these two functions is. Both are useful to operate across data that has not been normalized or has been normalized to values that are different than they should be.

CASEI is useful when I have different data that sets a field value to "PLANET", "Planet", or "planet" and I want to match either without having to enumerate all the variations.

ACCENTI is useful when either:
(1) accents (or, more generally, diacritics not available in ASCII) were dropped when indexing the data, where the value is actually "wrong" in many cases -- e.g., even in Spanish (the language with the smallest set of characters outside ASCII), papa is potato and papá is father, so they're not interchangeable. Some of my data could have either been rewritten to ASCII and dropped the á, so I want to match on either even though what I really want is only "papá", or
(2) I'm an English-only speaker who doesn't know how to type a á and just want to use ASCII because I know that my data doesn't have both of these words in it.

Comment on lines +10 to +13
[[collation-def]]
==== collation
a set of rules that indicate how to compare and sort character string data
the process of ordering units of textual information https://www.unicode.org/glossary/#collation[Glossary of Unicode Terms]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term “collation” does not seem to be used in the remainder of the document. Should it be removed from the terms & definitions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meeting 2022-01-17: Use the term in the description and keep the definition.

^|A |The server SHALL support a built-in function named `ACCENTI`.
^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`).
^|C |The function SHALL return a character string literal.
^|D |The function SHALL implement https://www.w3.org/TR/charmod-norm/#unicodeNormalization[unicode normalization] described in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.6 Normalization]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the term “Unicode normalization” to the terms & definitions.

Unicode normalization
normalization
process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence

Source: https://www.unicode.org/glossary/#normalization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meeting 2022-01-17: Agreed.

^|A |The server SHALL support a built-in function named `CASEI`.
^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`).
^|C |The function SHALL return a character string literal.
^|D |The function SHALL implement the https://www.w3.org/TR/charmod-norm/#definitionCaseFolding[full case folding] algorithm defined in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.18 Case Mappings, sub-clause Caseless Matching]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the term “Unicode case folding” to the terms & definitions:

Unicode case folding
case folding
process of making two texts which differ only in case identical for comparison purposes

Note: Case folding is meant for the purpose of string matching.

Source: https://www.w3.org/TR/charmod-norm/#definitionCaseFolding

Consider adding and using the term “Unicode case-insensitive matching” as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meeting 2022-01-17: Agreed.

cportele and others added 2 commits January 17, 2022 16:30
…REQ_accenti-builtin-function.adoc


Meeting 2022-01-17: Agreed.

Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>
…REQ_casei-builtin-function.adoc


Meeting 2022-01-17: Agreed.

Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>
@cportele
Copy link
Member

Meeting 2022-01-17: Define both functions in one requirements/conformance class. In most cases, implementing the functions will simply require calling the function in the underlying datastore, but if someone wants/needs to implement the capability themselves, this is a complex implementation. If it turns out that it would be better to have separate conformance declarations, we can add two additional conformance classes, one for each function.

@pvretano will resolve the outstanding comments/edits and then merge the PR.

@pvretano pvretano merged commit 3580a4f into opengeospatial:master Jan 30, 2022
====

----
etat_vol = ACCENTI('débárquér')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvretano I know you just merged this, but shouldn't this be

ACCENTI(etat_vol) = ACCENTI('débárquér')

as as it is for CASEI? same for the JSON example.

@cportele
Copy link
Member

@philvarner - I agree.

There are also a few other comments/edits that were discussed in this PR that are not yet included. I am working on a follow-up PR. I will make that change there, too.

cc @pvretano

@pvretano
Copy link
Contributor Author

@philvarner yes, it should be ACCENTI(etat_vol) = ACCENTI('débárquér'). @cportele do you want me to fix the example or will you do it in the followup PR?

@cportele
Copy link
Member

@pvretano - I have already fixed it in my edits (the property references in the JSON also needed an update).

cportele added a commit that referenced this pull request Jan 31, 2022
There were still some outstanding comments/edits from the discussion in PR #641:

* Add explanatory text (#641 (comment))
* Removed definition of "collation" - it was not used in the text and it is not necessary to introduce the term (#641 (comment))
* Add "unicode normalization" as a term (#641 (comment))
* Add "unicode case folding" as a term (#641 (comment))
* Correct examples (#641 (comment))
cportele added a commit that referenced this pull request Jan 31, 2022
follow-up edits from PR #641

Meeting 2022-01-31: Merge the PR. Any additional comment would be addressed in a new PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Case-insensitive string comparison
7 participants