Skip to content
This repository was archived by the owner on May 18, 2023. It is now read-only.
This repository was archived by the owner on May 18, 2023. It is now read-only.

Similarities between documents and query may be >1 #3

@hrs

Description

@hrs

The README claims that similarities between documents and queries shouldn't be greater than 1. However:

table = tfidf.tfidf()
table.addDocument("foo", ["alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel"])
table.addDocument("bar", ["alpha", "bravo", "charlie", "india", "juliet", "kilo"])
table.addDocument("baz", ["kilo", "lima", "mike", "november"])
print table.similarities (["alpha", "bravo", "charlie", "india"])

Yields [['foo', 0.5625], ['bar', 1.0416666666666665], ['baz', 0.0]]. Whoops!

This is happening because the query isn't being normalized. The ranking of results should still be correct, but it'd be better if we normalized it so we can make guarantees about the output.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions