Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whitespace at end of search string yields results; lack of it yields no results #3

Open
kshahkshah opened this issue Apr 2, 2017 · 4 comments

Comments

@kshahkshah
Copy link

very simply, when I am searching for:

"My Search String" and there is a perfect match in the corpus provided, I am getting return no results.

However, with a single space "My Search String " I'm able to get results, though the top result is similar to but not the exact match. I'm going to dive into this further, note that the actual string in question is 10 characters long, and I have not been able to create a reduced example.

Elements from actual corpus of products that are being searched look like:

My Search String
My Search Attribute1 String
My Search Attribute2 String
My Search Attribute3 String

Again when searching for "My Search String " I'm returned "My Search Attribute 1 String" as my top result.

Also, again, I tried building a reduced example of this behaviour along these contrived lines but failed to do so, so it'll merit additional investigation.

Though perhaps you can suggest ways I can debug?

@kshahkshah
Copy link
Author

Also I'm happily to provide the actual corpus privately

@brianhempel
Copy link
Owner

This line of code is probably the issue: https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy_tools/tf_idf_index.rb#L53

The first round of pruning for search only matches on not-low-value tokens. It's not a great heuristic. Not sure how to replace it.

@kshahkshah
Copy link
Author

Ahh, yes, I suspected it might be because it is such a common phrase it gets removed as useless, but that tends to not model a master X and variants of X well.

Let me see if I can tune that parameter. I wonder how other implementations accomplish this as well.

Thanks again for getting back so quickly.

@brianhempel
Copy link
Owner

Yeah, it's kind of like "stop words" in other implementations except that there's no hard-coded list of stop words. Instead, the lowest 1/16th of the tokens are not used for finding the initial candidate documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants