Unexpected / undocumented behaviour of textContainsPrefix when query string contains multiple tokens #3942

mrckzgl · 2023-08-31T16:09:06Z

mrckzgl
Aug 31, 2023

Consider a simple setup with a Lucene search backend containing a mixed index with a TEXT property (example in jruby, only dependency is lock_jar gem, complete minimal example also attached as zip):

LockJar.lock do
  jar 'org.janusgraph:janusgraph-berkeleyje:0.6.3'
  jar 'org.janusgraph:janusgraph-lucene:0.6.3'
  jar 'org.slf4j:slf4j-simple:1.7.36'
end
LockJar.load()

graph = org.janusgraph.core.JanusGraphFactory.build
  .set("storage.backend","berkeleyje")
  .set("storage.directory", "./db")
  .set("index.search.backend","lucene")
  .set("index.search.directory","./db/index")
  .set("query.force-index","true")
  .open()

mgmt = graph.open_management()
index_name = "simplesearch"

search_value = mgmt.make_property_key("search_value").dataType(Java::JavaLang::String.java_class).cardinality(Java::OrgJanusgraphCore::Cardinality::SINGLE).make()
index = mgmt.build_index(index_name, Java::OrgApacheTinkerpopGremlinStructure::Vertex.java_class)
index.add_key(search_value, Java::OrgJanusgraphCoreSchema::Mapping::TEXT.as_parameter)
index.build_mixed_index("search")
mgmt.commit()

watcher = org.janusgraph.graphdb.database.management::ManagementSystem.await_graph_index_status(graph, index_name)
watcher.status(org.janusgraph.core.schema::SchemaStatus::REGISTERED, org.janusgraph.core.schema::SchemaStatus::ENABLED)
watcher.call()

graph.tx().commit()

Now there are two vertices in the graph as follows and a helper function to search for the indexed property and print the number of results found:

Text = ::Java::OrgJanusgraphCoreAttribute::Text
VertexProperty = org.apache.tinkerpop.gremlin.structure.VertexProperty

graph.traversal().add_v().property("search_value", "This is a test with multiple words.").iterate()
graph.traversal().add_v().property("search_value", "singletoken").iterate()
graph.tx().commit()

def search(graph, string)
  puts("Search for '#{string}':" +
    graph.traversal().v().has("search_value", Text.textContainsPrefix(string)).count().next().to_s())  
end

Documentation of text search states:

When a string property is indexed as text, the string value is tokenized into a bag of tokens. The exact tokenization depends on the indexing backend and its configuration. JanusGraph’s default tokenization splits the string on non-alphanumeric characters and removes any tokens with less than 2 characters.

and further:

textContainsPrefix: is true if (at least) one word inside the text string begins with the query string

This applies wonderfully if the query string only contains one token, and everything works as expected (output as comment):

search(graph, "singletoken") # Search for 'singletoken':1
search(graph, "singletoke") # Search for 'singletoke':1
search(graph, "single") # Search for 'single':1
search(graph, "token") # Search for 'token':0
search(graph, "singletokennotfound") # Search for 'singletokennotfound':0

So only prefixes of the property value "singletoken" are found.

Now the problem is, that the definition in the documentation that a result is found

if (at least) one word inside the text string begins with the query string

is not applicaple to the case where the query string contains multiple words/tokens. If that definition would hold true, then no result would ever be found for multi token query strings, as it is impossible for a word (in the text string) to begin with multiple words/tokens (of the query string).

Hence my general question is: How does textContainsPrefix behave, if the query string contains mutliple tokens? Or in other words, what is the higher level logic in this case?

Here are a few observations which seem to imply the following logic: For each token in the query string, at least one token in the text string has to be present, where query token is a prefix of text token:

search(graph, "This is a test with multiple words") # Search for 'This is a test with multiple words':1
search(graph, "test words with") # Search for 'test words with':1
search(graph, "This") # Search for 'This':1
search(graph, "Thi") # Search for 'Thi':1
search(graph, "test") # Search for 'test':1
search(graph, "tes") # Search for 'tes':1
search(graph, "test with") # Search for 'test with':1
search(graph, "test words without") # Search for 'test words without':0

However, as soon as query string contains one incomplete token + additional prefixes, things get unexpected:

search(graph, "test wit") # Search for 'test wit':0
search(graph, "Thi tes") # Search for 'Thi tes':0

Here, I would expect both cases to return a result.

I would even assume that this is a bug? But, as the documentation of textContainsPrefix is incomplete here, I am of course not sure... however, current logic does not seem to make sense to me.

We would be very happy to get insight about this issue.

best

query_sample.zip

hadoopmarc · 2023-09-01T05:41:12Z

hadoopmarc
Sep 1, 2023

Nice write-up with a sound analysis! You would have to check in the JanusGraph code whether this behaviour is due to JanusGraph or due to Lucene. In the former case I would count it as a bug and you can file an issue. Maybe, a JanusGraph committer recognizes the "begins with" functionality. I would expect this only to be possible with a wildcard search, where Janusgraph inserts the wildcard (apparently for the first token of the query only).

0 replies

mrckzgl · 2023-09-01T09:38:21Z

mrckzgl
Sep 1, 2023
Author

Thank you very much

Nice write-up with a sound analysis!

Thanks! :-)

Followed your advise and looked in the code. It has to be somewhere here: https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java Didn't get into the gory details, but I saw that the raw lucene queries will be logged on DEBUG log level, and voila:

[main] DEBUG org.janusgraph.diskstorage.lucene.LuceneIndex - Executed query [+(+search_value:single*)] in 1 ms
[...]
[main] DEBUG org.janusgraph.diskstorage.lucene.LuceneIndex - Executed query [+(+(+search_value:test +search_value:with))] in 1 ms
[...]
[main] DEBUG org.janusgraph.diskstorage.lucene.LuceneIndex - Executed query [+(+(+search_value:thi +search_value:tes))] in 0 ms

It is even "worse" than you assumption: For single token query strings the wildcard is added, but as soon there are multiple query tokens, for all tokens, no wildcard is added. However, the assessment "worse" "only" applies regarding the general (desired) logic I am assuming above.

So the main question IMHO has to be cleared before doing anything else: How is textContainsPrefix supposed to behave, if the query string contains mutliple tokens?

4 replies

mad Sep 4, 2023
Collaborator

Looks like a bug

This is possible fix:

Index: janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java b/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java
--- a/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java	(revision 1950f3b70fb6d8926801dd7afaac393a3a27a624)
+++ b/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java	(date 1693814693211)
@@ -727,7 +727,11 @@
                 occur = BooleanClause.Occur.MUST;
             }
             for (final List<String> stems : terms) {
-                q.add(combineTerms(key, stems, TermQuery::new), occur);
+                if (janusgraphPredicate == Text.CONTAINS_PREFIX) {
+                    q.add(combineTerms(key, stems, PrefixQuery::new), occur);
+                } else {
+                    q.add(combineTerms(key, stems, TermQuery::new), occur);
+                }
             }
             params.addQuery(q.build());
         }

mrckzgl Sep 4, 2023
Author

Two things:

Does that mean textContainsPrefix is supposed to work as I inferred above?
Thanks, is it possible for someone to transform this discussion to a ticket, or should I open a new one?

mad Sep 14, 2023
Collaborator

In-memory implementation org.janusgraph.core.attribute.Text#CONTAINS_PREFIX works as you described, but SolrIndex possibly works different return (key + ":" + escapeValue(value) + "*"); - without tokenization
yes, sure, you can open new one

mrckzgl Oct 20, 2023
Author

Regarding 2) Done. See: #4073

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected / undocumented behaviour of textContainsPrefix when query string contains multiple tokens #3942

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unexpected / undocumented behaviour of textContainsPrefix when query string contains multiple tokens #3942

mrckzgl Aug 31, 2023

Replies: 2 comments · 4 replies

hadoopmarc Sep 1, 2023

mrckzgl Sep 1, 2023 Author

mad Sep 4, 2023 Collaborator

mrckzgl Sep 4, 2023 Author

mad Sep 14, 2023 Collaborator

mrckzgl Oct 20, 2023 Author

mrckzgl
Aug 31, 2023

Replies: 2 comments 4 replies

hadoopmarc
Sep 1, 2023

mrckzgl
Sep 1, 2023
Author

mad Sep 4, 2023
Collaborator

mrckzgl Sep 4, 2023
Author

mad Sep 14, 2023
Collaborator

mrckzgl Oct 20, 2023
Author