Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries with hyphens search just for the part before the hyphen #609

Closed
kornelski opened this issue Jul 31, 2019 · 1 comment
Closed

Queries with hyphens search just for the part before the hyphen #609

kornelski opened this issue Jul 31, 2019 · 1 comment

Comments

@kornelski
Copy link
Contributor

kornelski commented Jul 31, 2019

If I query for "foo-bar", tantivy finds documents with just "foo", ignoring the "bar" part completely.

Here's a complete code to reproduce. It's almost literally the example from the docs.

use tantivy::*;
use tantivy::query::*;
use tantivy::schema::*;
use tantivy::collector::*;

fn main() -> Result<()> {
// ...

// First we need to define a schema ...

// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// documents that were selected during the search phase.
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();

// Indexing documents

let index_path = "/tmp/testindex";
let _ = std::fs::remove_dir_all(index_path);
let _ = std::fs::create_dir_all(index_path);
let index = Index::create_in_dir(index_path, schema.clone())?;

// Here we use a buffer of 100MB that will be split
// between indexing threads.
let mut index_writer = index.writer(100_000_000)?;

// Let's index one documents!
index_writer.add_document(doc!(
    title => "Web form (good result)",
    body => "x-www-form-urlencoded"
));

index_writer.add_document(doc!(
    title => "TV show (bad result)",
    body => "x-files"
));

index_writer.add_document(doc!(
    title => "Shows up if hyphens were spaces",
    body => "www"
));

// We need to call .commit() explicitly to force the
// index_writer to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
index_writer.commit()?;

// # Searching

let reader = index.reader()?;

let searcher = reader.searcher();

let query_parser = QueryParser::for_index(&index, vec![title, body]);

// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = query_parser.parse_query("x-www-form-urlencoded")?;

// Perform search.
// `topdocs` contains the 10 most relevant doc ids, sorted by decreasing scores...
let top_docs: Vec<(Score, DocAddress)> =
    searcher.search(&query, &TopDocs::with_limit(10))?;

for (_score, doc_address) in top_docs {
    // Retrieve the actual content of documents given its `doc_address`.
    let retrieved_doc = searcher.doc(doc_address)?;
    println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

This affects lib.rs, where search for lib.rs?x-www-form-urlencoded brings all kinds of crates with "x" anywhere, rather than the specific one containing "x-www-form-urlencoded".

@fulmicoton
Copy link
Collaborator

The standard tokenizer splits on hyphen.

The problem here is in the query parser.
It is especially tricky because "-" has a special meaning in the query grammar (excluding keywords)
I think it is ok to call it a bug, and we might want to address it.

Clients suffering from this can either roll out their own query parser, or replace "-" by
whitespaces before feeding it to the query parser.

fulmicoton added a commit that referenced this issue Aug 6, 2019
fulmicoton added a commit that referenced this issue Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants