Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searching for a string should ingore case and diacritic accents #73

Open
AliciaMstt opened this issue Aug 6, 2021 · 5 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@AliciaMstt
Copy link
Member

Ignore case

Currently searching for a string is case sensitive, so that in the example below in the SPA searching for "chaya" yields no results:

image

Whilst "Chaya" does:

image

The desired behavior is that "chaya" "Chaya", "chAya" or any similar form should bring the same results than "Chaya".

Accents
We have a similar issue with the special character ´ and others, which in Spanish are used all the time. Unfortunately even if we know that the correct spelling is maíz, users commonly would look for maiz. Therefore we need that searching for a string to ignore accents of anykind, something like:

a == âàáäÂÀÂÄ 
e == êèéëÊÈÉË 
i == îïíÎÏ 
o == ôöóÔÖ 
u == ûùüÛÜÙ 
c == çÇ 
n == ñ

So that if maíz is present in the db, the query maiz and maíz should yield the same results.

@AliciaMstt AliciaMstt added the enhancement New feature or request label Aug 6, 2021
@asishallab
Copy link
Member

asishallab commented Aug 26, 2021

A problem with a general solution is that the like operator is not implemented in all storage types. MongoDb and Neo4J only support regular expressions, which, of course, could be used to implement that behavior in these storage techs. Additionally the ilike operator which is case in-sensitive is only implemented in Postgres. Furthermore, we could not use regular expressions in all storages, because regexs are not implemented in Presto / Trino and Amazon S3 tables.

Thus, we'd need to first identify

  • What could we use in Presto / Trino?
  • What could we use in Amazon S3?
  • What could we use in all relational DBs?
  • What could we use in MongoDb and Neo4J, probably regexs?

Then we implement a helper module that constructs the respective storage type
correct search and executes it in its respective data model layer. We could even
(thanks @coeit ) implement a new operator like zilike (Zendro - ilike) and implement
that for each storage separately.

The main question is to identify a common denominator of search function that can
be implemented in all storage techs, but Cassandra. This question can be answered using
something like the following hypothetical example table, inspired by Wikipedia's tech-tables:

Storage-Technology Regular Expression Operator regexp supports case-insensitive like Operator ilike Operator
Sequelize regexp ? true only postgres

@coeit
Copy link
Member

coeit commented Aug 27, 2021

Storage-Technology Regular Expression Operator regexp supports case-insensitive like Operator ilike Operator resource
Sequelize regexp (~, PG/MySQL only) iRegexp (~*, PG only) like ilike (PG only) https://sequelize.org/master/manual/model-querying-basics.html#operators
MongoDB $regex $options: 'i' can be implemented with regex can be implemented with regex https://docs.mongodb.com/manual/reference/operator/query-evaluation/
neo4j =~ (?i)<regexp> can be implemented with regex can be implemented with regex https://neo4j.com/docs/cypher-manual/current/clauses/where/#query-where-string
presto / trino regexp_like(<regexp>, <string>) regexp_like((?i)<regexp>, <string>) LIKE LOWER(<string>) LIKE OR can be implemented with regex https://trino.io/docs/current/functions.html
amazonS3 - - LIKE LOWER(<string>) LIKE https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-operators.html#s3-glacier-select-sql-reference-pattern

@coeit
Copy link
Member

coeit commented Aug 31, 2021

Since regular expressions have quite a different syntax depending on the storage type when it comes to setting flags (for Zendro we need case insensitivity), a consistent way to define those flags in Zendro has to be defined.

  • Sequelize: iRegexp
  • mongoDB: $options: 'i'
  • neo4j / presto / trino: (?i)

Solutions

iRegexp operator

In the spirit of Sequelize, define a iRegexp operator that implements the i flag depending on the storage type.

  • Do we need notiRegexp ?

Define a Zendro specific regexp syntax

We can use e,g, Java regular expressions syntax like neo4j/ presto and trino do. We would have to translate this for mongoDB and sequelize to their specific implementations

@asishallab
Copy link
Member

Response to comment about iRegexp

Implement the Zendro operator iRegexp and notiRegexp. It seems to be the best solution

@asishallab
Copy link
Member

In order to resolve this issue we should first implement a basic solution and use the ilike operator in all storages, Zendro currently supports. This will avoid having errors if one Zendro instance uses Amozon S3. In a later iteration we could think about implementing a new zilike operator that actually uses regular expressions to match e.g. "a" with all variations of the letter with accents and such, in Amazon S3 these searches would be done with its internal ilike and the _ wild-card. zilike would thus have different behavior in different storage types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants