Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing QueryParser #201

Closed
safwansamsudeen opened this issue Jan 30, 2024 · 9 comments · Fixed by #202
Closed

Accessing QueryParser #201

safwansamsudeen opened this issue Jan 30, 2024 · 9 comments · Fixed by #202
Labels
feature-parity Feature parity with upstream tantivy help wanted Extra attention is needed

Comments

@safwansamsudeen
Copy link

For fuzzy search or boosting fields, I need to access QueryParser.

Is this possible with Tantivy Py? This article seems to thinks so, but it doesn't work (ImportError, I also checked and see that QueryParse isn't available in the top level anyway).

If not, how can I do fuzzy searching?

@cjrh cjrh added the help wanted Extra attention is needed label Jan 30, 2024
@cjrh
Copy link
Collaborator

cjrh commented Jan 30, 2024

I agree, the article is odd. The code can't work with our current tantivy release:

from tantivy import Collector, Index, QueryParser, SchemaBuilder, Term

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Add documents to the index
with index.writer() as writer:
    writer.add_document({"title": "First document", "body": "This is the first document."})
    writer.add_document({"title": "Second document", "body": "This is the second document."})
    writer.commit()

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

# Basic search
query = query_parser.parse_query("first")
collector = Collector.top_docs(10)
search_result = index.searcher().search(query, collector)

print("Basic search results:")
for doc in search_result.docs():
    print(doc)

# Fuzzy search
fuzzy_query = query_parser.parse_query("frst~1")  # Allows one edit distance
fuzzy_collector = Collector.top_docs(10)
fuzzy_search_result = index.searcher().search(fuzzy_query, fuzzy_collector)

print("Fuzzy search results:")
for doc in fuzzy_search_result.docs():
    print(doc)

# Filtered search
title_term = Term(title_field, "first")
body_term = Term(body_field, "first")
filter_query = schema.new_boolean_query().add_term(title_term).add_term(body_term)
filtered_collector = Collector.top_docs(10)
filtered_search_result = index.searcher().search(filter_query, filtered_collector)

print("Filtered search results:")
for doc in filtered_search_result.docs():
    print(doc)

Collector and QueryParser aren't exposed yet.

@cjrh
Copy link
Collaborator

cjrh commented Jan 30, 2024

Boosting is already requested in #50 (and it mentions fuzzy search also)

@cjrh cjrh added the feature-parity Feature parity with upstream tantivy label Jan 30, 2024
@wallies
Copy link
Collaborator

wallies commented Jan 30, 2024

Stange thing is that article uses tantivy-py, where we maintain tantivy. Tantivy-py stopped at 0.11. No idea how that ever worked looking back at that tag

@safwansamsudeen
Copy link
Author

safwansamsudeen commented Jan 31, 2024

So what are my options? I can't boost fields right now?

EDIT: from what I'm reading, it seems like just "plugging in". That sounds easy - is it? If you could provide some link on how to do that, I'll submit a PR. But IDK Rust, so I'd really appreciate it if you could do that in the next couple weeks.

@cjrh
Copy link
Collaborator

cjrh commented Feb 1, 2024

@safwansamsudeen I understand your frustration. Typically in most open-source projects, including this one, these are your options:

  1. If you need this feature because your employer needs it, i.e., there is a company behind your request, your fastest option is to convince your employer to pay someone to add the feature. It'll cost a couple hundred dollars (very rough estimate) but it can get done quickly. It shouldn't be hard to find someone with Rust experience who will do the work. You don't have to work through us to find someone, you yourself can look around to try to hire a short term contract to do the work.

  2. You could try to just ask someone to do it, for example on Reddit or the Rust channel in LinkedIn. It's basically the same as option 1, but with no money for the contributor.

  3. You could try to do it yourself. This is easier when you know a bit more of the stack, and more difficult when you don't. This option is a little tricky because if you need a lot of help getting it done, the help needs to come from somewhere which will also consume someone else's time, just like in options 1 and 2. I guess it would depend on how much time is involved for the helping party.

  4. Wait for someone else to implement it. On a very active project, this can happen quickly. Unfortunately tantivy-py is not super active so new features wait until one of the maintainers or someone else has some free time. For example, I am doing my contributions over weekends, and I don't have many weekends free.

Fortunately, this feature is not very complex. It just needs someone to actually do the work :)

@cjrh
Copy link
Collaborator

cjrh commented Feb 1, 2024

If you want to take a quick stab at trying an implementation yourself, time-box it to a couple hours, then I can have a look at your code. Maybe that is enough. You can look at the other classes and how they are currently wrapped in tantivy-py, and then just try to copy that for QueryParser and the boosting.

This is the sequence:

  1. Add the rust code to wrap QueryParser
  2. Add a python tests file with a test to create an instance of a QueryParser (from Python), and try to call methods on it
  3. Run the python tests with something like $ nox -s test-3.11 to run using Python 3.11, if that's the version of your virtualenv.

And then you basically keep repeating that cycle, fixing bugs, adding more features, and testing them in the python test.

@safwansamsudeen
Copy link
Author

Hi @cjrh,

Thank you for your detailed and kind reply. I think I might have sounded a little angry - not at all, thank you for your generous work. We're all the in the same boat, I realize that it's hard to work on OSS ;).

Yeah, I think I'll give it a stab.

BTW, how do I remove a document with Tantivy Py? Is there a way to directly remove a document? It seems that writer.delete_documents kinda performs a search and just deletes it. If so, that's alright - could you explain how to use it? The help message isn't enough.

@cjrh
Copy link
Collaborator

cjrh commented Feb 1, 2024

@safwansamsudeen Fortune smiles upon you, @adamreichold jumped in to add boosts for you in #202. Would you be able to test out the PR to check if it works for what you need? You will need to check out the PR branch and build a wheel. Then you can use that Python wheel file and install into your own virtualenv and try out the new features.

For example:

(venv) ~/Documents/repos/tantivy-py  ±field-boost-fuzzy|✔︎ [venv://h/c/D/r/t/v:3.10.6] 
$ maturin build --release
📦 Including license file "/home/caleb/Documents/repos/tantivy-py/LICENSE"
🍹 Building a mixed python/rust project
🔗 Found pyo3 bindings
🐍 Found CPython 3.10 at /home/caleb/Documents/repos/tantivy-py/venv/bin/python3
📡 Using build options bindings from pyproject.toml
   Compiling tantivy v0.21.0 (/home/caleb/Documents/repos/tantivy-py)
    Finished release [optimized] target(s) in 40.04s
📦 Built wheel for CPython 3.10 to /home/caleb/Documents/repos/tantivy-py/target/wheels/tantivy-0.21.0-cp310-cp310-manylinux_2_34_x86_64.whl

Produces this wheel (Python 3.10):
tantivy-0.21.0-cp310-cp310-manylinux_2_34_x86_64.zip

@safwansamsudeen
Copy link
Author

WOOT! That is brilliant! Thank you so much, @adamreichold and @cjrh.

Plus, I should probably learn Rust, interesting language.

I'll test it tomorrow and let you know.

@cjrh cjrh closed this as completed in #202 Feb 5, 2024
cjrh pushed a commit to cjrh/tantivy-py that referenced this issue Sep 3, 2024
…oss#201)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-parity Feature parity with upstream tantivy help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants