You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I query for "foo-bar", tantivy finds documents with just "foo", ignoring the "bar" part completely.
Here's a complete code to reproduce. It's almost literally the example from the docs.
use tantivy::*;use tantivy::query::*;use tantivy::schema::*;use tantivy::collector::*;fnmain() -> Result<()>{// ...// First we need to define a schema ...// `TEXT` means the field should be tokenized and indexed,// along with its term frequency and term positions.//// `STORED` means that the field will also be saved// in a compressed, row-oriented key-value store.// This store is useful to reconstruct the// documents that were selected during the search phase.letmut schema_builder = Schema::builder();let title = schema_builder.add_text_field("title",TEXT | STORED);let body = schema_builder.add_text_field("body",TEXT);let schema = schema_builder.build();// Indexing documentslet index_path = "/tmp/testindex";let _ = std::fs::remove_dir_all(index_path);let _ = std::fs::create_dir_all(index_path);let index = Index::create_in_dir(index_path, schema.clone())?;// Here we use a buffer of 100MB that will be split// between indexing threads.letmut index_writer = index.writer(100_000_000)?;// Let's index one documents!
index_writer.add_document(doc!(
title => "Web form (good result)",
body => "x-www-form-urlencoded"));
index_writer.add_document(doc!(
title => "TV show (bad result)",
body => "x-files"));
index_writer.add_document(doc!(
title => "Shows up if hyphens were spaces",
body => "www"));// We need to call .commit() explicitly to force the// index_writer to finish processing the documents in the queue,// flush the current index to the disk, and advertise// the existence of new documents.
index_writer.commit()?;// # Searchinglet reader = index.reader()?;let searcher = reader.searcher();let query_parser = QueryParser::for_index(&index,vec![title, body]);// QueryParser may fail if the query is not in the right// format. For user facing applications, this can be a problem.// A ticket has been opened regarding this problem.let query = query_parser.parse_query("x-www-form-urlencoded")?;// Perform search.// `topdocs` contains the 10 most relevant doc ids, sorted by decreasing scores...let top_docs:Vec<(Score,DocAddress)> =
searcher.search(&query,&TopDocs::with_limit(10))?;for(_score, doc_address)in top_docs {// Retrieve the actual content of documents given its `doc_address`.let retrieved_doc = searcher.doc(doc_address)?;println!("{}", schema.to_json(&retrieved_doc));}Ok(())}
This affects lib.rs, where search for lib.rs?x-www-form-urlencoded brings all kinds of crates with "x" anywhere, rather than the specific one containing "x-www-form-urlencoded".
The text was updated successfully, but these errors were encountered:
The problem here is in the query parser.
It is especially tricky because "-" has a special meaning in the query grammar (excluding keywords)
I think it is ok to call it a bug, and we might want to address it.
Clients suffering from this can either roll out their own query parser, or replace "-" by
whitespaces before feeding it to the query parser.
If I query for "foo-bar", tantivy finds documents with just "foo", ignoring the "bar" part completely.
Here's a complete code to reproduce. It's almost literally the example from the docs.
This affects lib.rs, where search for lib.rs?x-www-form-urlencoded brings all kinds of crates with "x" anywhere, rather than the specific one containing "x-www-form-urlencoded".
The text was updated successfully, but these errors were encountered: