Full text index does not work as expected #4837
Labels
affects/none
PR/issue: this bug affects none version.
priority/hi-pri
Priority: high
process/done
Process of bug
severity/major
Severity of bug
type/bug/functionality
Bugs preventing the database to deliver a promised function.
type/bug
Type: something is unexpected
Milestone
How do we use ES for fulltext search?
In NebulaGraph, according to our document, you need to first establish a native index when using full-text indexes. This is against common sense. Because it works in the following way:
data in ES
The docId is the ID used to uniquely identify a document in elastic search. It has a maximum length limit of 512. Maybe you don't understand why they are like this, or you already understand but are confused. I will explain why they are like this and what problems they have.
write
We limit the maximum length of the text to 256. For convenience, we temporarily assume that the maximum length limit is only 3. The "256" described in the docId and value above is also changed to "3".
insert vertex t (name) values 1: ("abcd")
Let's take the above statement as an example. Assume that a native index is established on tag t.name, index_t_name(3) and the full-text index es_ft_t_name.
When NebulaGraph (storaged) wants to write this vertex, its corresponding raft listener will write this data to ES. Because the maximum length limit is 3 (we assumed earlier), "abcd" will be truncated to "abc" and then written to ES.
read
lookup on t where prefix(t.name, "ab")
Then we will query according to the above statement. When NebulaGraph (graphd) processes the expression prefix(t.name, "ab"), it can recognize that it is a full-text index expression. At this time, it will send a query request to ES: find the data prefixed with "ab". ES will return an "abc" (which we wrote earlier). Then graphd will rewrite prefix (t.name, "ab") to t.name=="abc".
lookup on t where t.name=="abc"
Then we will execute according to this statement. However, we do not have t.name=="abc" data, we only have t.name=="abcd" data. So we can't find any data.
problems
utf8
We directly truncate the string according to the maximum length limit of 256. For Chinese or other complex characters, it is very likely to get an incomplete utf8 character. Such data will report an error when written to ES.
ES could not retrieve the data as we expected
If the characters we want to match are after 256 bytes, they are not written to ES. ES cannot search them.
If a string is longer than 256, it will never be found
As the example above.
solve
Here are three solutions.
Unlimit maximum length in value
This is the simplest way to repair.And I've fixed it in #4836. It is not worse than before. But it will have new problems.
Back to our example. If value has no length limit, our example works normally.However, if we insert "abcd" and then insert an "abce". Then we query according to the lookup statement, and we will only get "abce". Because the docIDs of "abcd" and "abce" are the same, we will overwrite the document of "abcd" in ES when writing "abce".
We cannot remove the length limit of docID, because ES does not allow it to exceed 512 bytes.
refactor docID
We don't use the first 256 bytes of value in docID. Instead, write vid (for tag) or {src, dst, rank} (for edge). Other logics are not changed for the time being, and it is still necessary to establish a native index. However, on the premise of correctness, such changes should be minimal.
The problem here is that I don't find that when we delete a tag/edge, there is a corresponding logic to delete the data in ES. I need further confirmation.
refactor fulltext index (best but hard)
We directly reconstruct the whole full-text indexing logic. The native index is no longer required. Record the vid (src, dst, rank) of the point corresponding to the text directly in ES. We directly search vertices (edges) according to vid (src, dst, rank)
This may take a lot of time
The text was updated successfully, but these errors were encountered: