-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy string matching indexes #3526
Comments
@cangfengzhs @critical27 What do you think about this issue? |
@igrekun Thanks for bringing this up for discussion! I do think it is a recommended practice to first file an issue to get alignment between the two sides before you actually start working on it, which can avoid waste in time and energy just in case the solution you provide does not match the design of the core team. Hope that makes sense to you. Please @critical27 @cangfengzhs provide your insight on this issue. |
Thx for contacting us. So you are going to add tri-gram index in nebula instead of es, right? |
These are indeed some very useful features. I have learned some knowledge of NLP, but I don't fully grasp it. And I have the following questions:
I am looking forward to the realization of native fuzzy string index. Maybe you can make a simple design first, and then we will discuss its possible problems and try to solve them. Moreover, I am also very willing to participate in the development process of this feature. |
And, I have a bolder idea. The key point of this feature is the vector ANN search. We may be able to support a vector data type and expose it to users. We are only responsible for the search of vectors. After all, the performance of textCNN/ELMo/Bert will be better than tri-gram. |
@cangfengzhs I very much like the bold idea (: Fuzzy search is then done by treating each trigram as a word and searching closest by cosine. Anything more fancy should then be "bring your own vectors" not to clutter the core codebase. Personally I fancy the idea of generic ANN more than GIN / GiST since it is more general. Given the vector type, graph and basic math one could run ML algorithms without reaching for spark. If we go with ANNs then I almost have a design to build upon, just let me know where to move further technical discussion if we decide to proceed on this! |
Is your feature request related to a problem? Please describe.
Running full fledged Elasticsearch cluster to search short strings seems like an overkill.
Describe the solution you'd like
Basic tri-gram or tf-idf / cosine distance for simple fuzzy string matching.
Describe alternatives you've considered
Elasticsearch.
Additional context
I will more than likely work on implementing those indexes, but Jamie Liu told me it's better to get alignment with the team by filing an issue first.
What are your thoughts on a simple but native text index for Nebula?
It can't replace Elasticsearch but should suffice for matching names and similar use cases.
The options under consideration are
The text was updated successfully, but these errors were encountered: