Full text index does not work as expected #4837

cangfengzhs · 2022-11-08T08:20:19Z

How do we use ES for fulltext search?

In NebulaGraph, according to our document, you need to first establish a native index when using full-text indexes. This is against common sense. Because it works in the following way:

data in ES

docID: partitionID, schemaID, column_name, and the first 256 bytes of text
value: the first 256 bytes of text

The docId is the ID used to uniquely identify a document in elastic search. It has a maximum length limit of 512. Maybe you don't understand why they are like this, or you already understand but are confused. I will explain why they are like this and what problems they have.

write

We limit the maximum length of the text to 256. For convenience, we temporarily assume that the maximum length limit is only 3. The "256" described in the docId and value above is also changed to "3".

insert vertex t (name) values 1: ("abcd")

Let's take the above statement as an example. Assume that a native index is established on tag t.name, index_t_name(3) and the full-text index es_ft_t_name.

When NebulaGraph (storaged) wants to write this vertex, its corresponding raft listener will write this data to ES. Because the maximum length limit is 3 (we assumed earlier), "abcd" will be truncated to "abc" and then written to ES.

read

lookup on t where prefix(t.name, "ab")

Then we will query according to the above statement. When NebulaGraph (graphd) processes the expression prefix(t.name, "ab"), it can recognize that it is a full-text index expression. At this time, it will send a query request to ES: find the data prefixed with "ab". ES will return an "abc" (which we wrote earlier). Then graphd will rewrite prefix (t.name, "ab") to t.name=="abc".

lookup on t where t.name=="abc"

Then we will execute according to this statement. However, we do not have t.name=="abc" data, we only have t.name=="abcd" data. So we can't find any data.

problems

utf8

We directly truncate the string according to the maximum length limit of 256. For Chinese or other complex characters, it is very likely to get an incomplete utf8 character. Such data will report an error when written to ES.

ES could not retrieve the data as we expected

If the characters we want to match are after 256 bytes, they are not written to ES. ES cannot search them.

If a string is longer than 256, it will never be found

As the example above.

solve

Here are three solutions.

Unlimit maximum length in value

This is the simplest way to repair.And I've fixed it in #4836. It is not worse than before. But it will have new problems.

Back to our example. If value has no length limit, our example works normally.However, if we insert "abcd" and then insert an "abce". Then we query according to the lookup statement, and we will only get "abce". Because the docIDs of "abcd" and "abce" are the same, we will overwrite the document of "abcd" in ES when writing "abce".

We cannot remove the length limit of docID, because ES does not allow it to exceed 512 bytes.

refactor docID

We don't use the first 256 bytes of value in docID. Instead, write vid (for tag) or {src, dst, rank} (for edge). Other logics are not changed for the time being, and it is still necessary to establish a native index. However, on the premise of correctness, such changes should be minimal.

The problem here is that I don't find that when we delete a tag/edge, there is a corresponding logic to delete the data in ES. I need further confirmation.

refactor fulltext index (best but hard)

We directly reconstruct the whole full-text indexing logic. The native index is no longer required. Record the vid (src, dst, rank) of the point corresponding to the text directly in ES. We directly search vertices (edges) according to vid (src, dst, rank)

This may take a lot of time

The text was updated successfully, but these errors were encountered:

cangfengzhs · 2022-12-27T07:18:05Z

fixed #5077

cangfengzhs added type/bug Type: something is unexpected priority/hi-pri Priority: high labels Nov 8, 2022

Sophie-Xie added this to the v3.4.0 milestone Nov 8, 2022

Sophie-Xie assigned cangfengzhs Nov 8, 2022

xtcyclist added the type/bug/functionality Bugs preventing the database to deliver a promised function. label Nov 9, 2022

jinyingsunny added the severity/major Severity of bug label Nov 10, 2022

wey-gu mentioned this issue Nov 12, 2022

Weekly Report 2022-11-11 vesoft-inc/nebula-community#142

Closed

HarrisChu added the affects/none PR/issue: this bug affects none version. label Dec 1, 2022

cangfengzhs closed this as completed Dec 27, 2022

github-actions bot added the process/fixed Process of bug label Dec 27, 2022

wey-gu mentioned this issue Dec 31, 2022

Weekly Report 2022-12-30 vesoft-inc/nebula-community#171

Closed

Hester-Gu added the process/done Process of bug label Jan 13, 2023

github-actions bot removed the process/fixed Process of bug label Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full text index does not work as expected #4837

Full text index does not work as expected #4837

cangfengzhs commented Nov 8, 2022

cangfengzhs commented Dec 27, 2022

Full text index does not work as expected #4837

Full text index does not work as expected #4837

Comments

cangfengzhs commented Nov 8, 2022

How do we use ES for fulltext search?

data in ES

write

read

problems

utf8

ES could not retrieve the data as we expected

If a string is longer than 256, it will never be found

solve

Unlimit maximum length in value

refactor docID

refactor fulltext index (best but hard)

cangfengzhs commented Dec 27, 2022