Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Term Query is not tokenized (?) #296

Open
afbarbaro opened this issue Jun 6, 2024 · 9 comments
Open

Term Query is not tokenized (?) #296

afbarbaro opened this issue Jun 6, 2024 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@afbarbaro
Copy link

I'm testing tantivy-py, which I'm finding pretty great. However, I bumped into what seems to be an issue with the Python package: it seems that term queries are not tokenized when using the searcher.search(query, ..) method, so I can't really use the en_stem tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).

I'm testing tavinty-py with the Simple Wikipedia Example Set from Cohere and here's what I see with a few sample queries:

  • Australia monarchy --> no good hits unless I change it to Australia monarch
  • Titanic sink --> no good hits unless I change it to Titan sink

Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the search method, but tokenizers are not exposed in the Python bindings.

Any suggestions?

@afbarbaro afbarbaro changed the title Query term is not tokenized (?) Term Query is not tokenized (?) Jun 6, 2024
@cjrh
Copy link
Collaborator

cjrh commented Jun 6, 2024

Hi @afbarbaro 😄

This is the old problem of how stemmers don't always do what we think they do.

Here's a small program I made to test queries quickly:

from tantivy import SchemaBuilder, Index, Document
schema = SchemaBuilder().add_text_field("body", stored=True, tokenizer_name="en_stem").build()
index = Index(schema=schema, path=None)
writer = index.writer(heap_size=15_000_000, num_threads=1)
doc = Document()
doc.add_text("body", "lovely freshness older")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Australian monarchy")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Titanic sinks")
writer.add_document(doc)
writer.commit()
index.reload()

searcher = index.searcher()

def find(query_text: str):
    q = index.parse_query(query_text, ["body"])
    if hits := searcher.search(q).hits:
        for _, doc_address in hits:
            doc = searcher.doc(doc_address)
            print(f"{query_text} hit: {doc['body']}")
    else:
        print(f"{query_text} not found")

# Run with `python -i main.py`

The idea is to run it with python -i main.py, so that when the script ends it leaves you in an interpreter where you can test using the find() function.

Indeed, as you saw, monarchy does not stem to monarch:

$ python -i main.py
>>> find('monarch')
monarch not found
>>> find('monarchs')
monarchs not found
>>> find('monarchy')
monarchy hit: ['Australian monarchy']

However, you'll see that if we search for Australians (with an "s"), which was not the term indexed, we'll get a hit:

>>> find('Australians')
Australians hit: ['Australian monarchy']

This means that tantivy-py is indeed stemming the query text. I checked the code and the query parsers does make use of the tokenizers registered on the fields.

Stemming can be very surprising, for example:

>>> find('monarchies')
monarchies hit: ['Australian monarchy']

Did you expect monarchies to work? The reason this happens is because the shared stem of both monarchy and monarchies is in fact monarchi, which you can verify on this demo page for the snowball stemmer: https://snowballstem.org/demo.html

image

In my test I do get a hit using Titan so I'm not sure why that doesn't work for you:

>>> find('Titan')
Titan hit: ['Titanic sinks']
>>> find('Titans')
Titans hit: ['Titanic sinks']
>>> find('Titanics')
Titanics hit: ['Titanic sinks']
>>> find('titanics')
titanics hit: ['Titanic sinks']

Sometimes stemming can be very frustrating. For example, in the first document you would think older should stem to old but this doesn't happen:

>>> find('old')
old not found
>>> find('older')
older hit: ['lovely freshness older']

Even oldest doesn't stem back to older:

>>> find('oldest')
oldest not found

This is because the stem for older is just older.

image

@afbarbaro
Copy link
Author

@cjrh thanks so much for the detailed explanation. I know what is the difference between your code and example and what I was doing: you're using index.parse_query(query_text) and I was building the query using this:

subqueries = [(Occur.Should, Query.term_query(index.schema, 'body', term)) for term in terms]
query = tantivy.Query.boolean_query(subqueries)

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

One of the reasons why I was constructing the query myself is because in the Python version of index.parse_query you cannot pass the additional arguments available in the Rust true implementations - particularly the map for how to deal with Fuzzy search for a given field.

So I guess my question now is:

  • can I get the additional arguments for parse_query?
  • should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

Again, thank you for your help!

@afbarbaro
Copy link
Author

Also, for this:

should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

There is a hack that I can use at least temporarily to get the stemmed terms by running parsed = index.parse_query(query_text) then cast the parsed variable to a str and get the stemmed words for each term query inside that. Pretty nasty, but gets me unblocked.

@cjrh
Copy link
Collaborator

cjrh commented Jun 7, 2024

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

Ah I see. To answer your question, this is not by design. Tantivy-py has been built by a fairly large number of volunteers and drive-by contributors over the years so there is relatively little that is specifically "designed". Earlier on we wanted to avoid adding all the fine-grained query classes (and other classes) and just have the parse_query function, but in the end this proved to be too limiting and so recently we started adding those classes in.

I think more people will come across this now that we have the Query class, so we'll need some kind of answer for how to do stemming on the query here. If you were using the upstream tantivy in Rust, what would the approach be? As a general guiding principle I'd tantivy-py to be a recognisable wrapper of tantivy so that familiarity with one allows famiiliarity with the other. Python in general has had great success with this approach of wrapping C libraries over the decades.

@cjrh
Copy link
Collaborator

cjrh commented Jun 7, 2024

should term_query stem?

Maybe.

It wraps TermQuery, so I'd want to know how stemming is handled in tantivy first, before deciding on a path for us.

@cjrh
Copy link
Collaborator

cjrh commented Jun 7, 2024

We do have the fuzzy map in parse_query, at least in main:

image

@afbarbaro
Copy link
Author

Thanks @cjrh . The reason I didn't this these additional args being there is because the Python bindings are outdated.

def parse_query(

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

@cjrh
Copy link
Collaborator

cjrh commented Jun 8, 2024

Yes please that would be great if you can update those type annotations.

@cjrh
Copy link
Collaborator

cjrh commented Jun 10, 2024

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

Yes created by hand. It would be a simple PR.

@cjrh cjrh added the documentation Improvements or additions to documentation label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants