Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how can you escape double quotes in search queries? #185

Closed
JLHasson opened this issue Jan 10, 2024 · 6 comments
Closed

Question: how can you escape double quotes in search queries? #185

JLHasson opened this issue Jan 10, 2024 · 6 comments
Labels
query Issues related to the tantivy query engine

Comments

@JLHasson
Copy link

I have a query containing double quotes, e.g. a measurement in inches - '36"'. Right now I'm using tantivy.Index.query_parser(query) to parse the query. I have two questions:

  1. What is the default behavior of the analyzer in this case for double quotes? (can probably answer myself with a bit more digging)
  2. I haven't been able to find a way to parse this without hitting SyntaxError, is this a bug?
---> 21     query = self._index.parse_query(q.query)  # Search all fields
     22     search_results = self._tantivy_results_to_docs(
     23         self._searcher.search(query, q.limit).hits
     24     )
     25     return [
     26         SearchResultFromSystem(
     27             result=SearchResult(query=q, result=self._tantivy_doc_to_dict(doc)),
   (...)
     32         for idx, (score, doc) in enumerate(search_results)
     33     ]

ValueError: Syntax Error: fawkes 36" blue vanity

The approaches I've tried, all of which result in the same error above:
Escape the double quote: 36\"
Double or triple escape the quote: 36\\\" or 36\\"
Encode: '36"'.encode('utf-8')
Unicode from here: "36\\u{FF02}"
Raw python string: r'36"'
Raw python string w/ escapes from above: r'36\"'...

What should I do to get Tantivy to interpret this not as a field search, but as a literal double quote?

@cjrh
Copy link
Collaborator

cjrh commented Jan 10, 2024

Hi @JLHasson, thanks for the report!

Could you make a tiny self-contained example we can run to reproduce the error? A simple approach could be to adapt one of the unit tests. Please also report the version of tantivy you're using test. I found problems with escaping quotes back in version 0.19.2 but I don't think I've seen those problems since. A simple reproducer we can just copy-paste and run would save time.

@JLHasson
Copy link
Author

Hey @cjrh ! Sure thing:

$ poetry show | grep tantivy
tantivy                   0.21.0       

Code:

import tantivy
query = '36"'
doc = {"title": 'some title with 36"'}
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()

_index.reload()
query = _index.parse_query(query)  # Search all fields
# ---> [13](vscode-notebook-cell:?execution_count=34&line=13) query = _index.parse_query(query)  # Search all fields
# 
# ValueError: Syntax Error: 36"

@cjrh
Copy link
Collaborator

cjrh commented Jan 11, 2024

Thank you for the reproducer, that helps.

It seems that the issue is the default tokenizer being used for the text field. If you change the tokenizer to raw or en_stem, the parse_query() call no longer fails. raw is not super useful for general text fields, but hopefully you can work with en_stem.

import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36\"'}


schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="en_stem")
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()


_index.reload()
query = _index.parse_query(query)  # Search all fields
print(query)


searcher = _index.searcher()
results = searcher.search(query, 3).hits
print(results)

Note that the en_stem tokenizer will be stemming the tokens, which may not be what you want. This is the output that is produced running the above:

$ python main.py 
Query(TermQuery(Term(field=0, type=Str, "36")))
[(0.28768211603164673, <tantivy.DocAddress object at 0x71eb645fdf70>)]

You can see that the escaped quote in the query is lost after parsing. The r"36\"" becomes just 36.

This seems like a bug in the default tokenizer.

@cjrh
Copy link
Collaborator

cjrh commented Jan 11, 2024

For your specific use-case, it may be necessary to make your own tokenizer.

@cjrh
Copy link
Collaborator

cjrh commented Jan 11, 2024

Hmm, I was wrong about the default tokenizer having a bug, it seems the same escaping works as I described above in my en_stem example:

import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36"'}

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="default")
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()

index.reload()
query = index.parse_query(query)  # Search all fields
print(query)
print(index.searcher().search(query, 1).hits)

Output:

$ python main2.py 
Query(TermQuery(Term(field=0, type=Str, "36")))
[(0.28768211603164673, <tantivy.DocAddress object at 0x783eb87fdf50>)]

Again though, the extra " is lost after parsing because the tokenizer rules in default drops it.

@cjrh
Copy link
Collaborator

cjrh commented Jan 11, 2024

I'm going to leave this issue open until I have a chance to add documentation about this specifically. Also, I really need to find some time to address #25 to make it easier to customize tokenization.

@cjrh cjrh added the query Issues related to the tantivy query engine label Jan 20, 2024
@cjrh cjrh closed this as completed in eba0d55 Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query Issues related to the tantivy query engine
Projects
None yet
Development

No branches or pull requests

2 participants