Skip to content

Commit

Permalink
Merge pull request #3615 from vespa-engine/vekterli/extend-glossary
Browse files Browse the repository at this point in the history
Add glossary entries for (normalized) document frequency and estimated hit ratio
  • Loading branch information
geirst authored Feb 5, 2025
2 parents c18e747 + 69aa611 commit cdf06b1
Showing 1 changed file with 38 additions and 0 deletions.
38 changes: 38 additions & 0 deletions en/glossary.html
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,27 @@
Read more in <a href="documents.html">Documents</a>.
</p>
</li>
<li>
<p id="document-frequency-normalized"><strong>Document frequency (normalized)</strong></p>
<p>
The <em>document frequency</em> of a term captures how often the term occurs in the document corpus
relative to the total number of documents.
For ranking purposes this value is always normalized so that it is in the range [0, 1].
For example, if a term occurs in 600 out of 1000 documents, its normalized document
frequency will be \(600/1000 = 0.6\).
</p>
<p>
From an information retrieval perspective, the normalized document frequency gives a measure
of how common (or rare) a term is. Query terms that occur rarely (thus having a low document
frequency) are usually expected to be more <em>relevant</em> to the query, since they are
more specific. On the other end, very common terms (with high document frequency) are often
considered to be "stopwords" (such as "the", "an" etc.), and are expected to have a low
contribution to query relevance. This is directly related to
<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency">inverse document frequency</a>,
which is used by classic text ranking algorithms such as <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a>
and <a href="reference/bm25.html">BM25</a>.
</p>
</li>
<li>
<p id="document-summary"><strong>Document summary</strong></p>
<p>
Expand Down Expand Up @@ -207,6 +228,23 @@
and Vespa provides <a href="embedding.html">built-in support</a> for this.
</p>
</li>
<li>
<p id="estimated-hit-ratio"><strong>Estimated hit ratio</strong></p>
<p>
When Vespa plans how a query should be evaluated in the most efficient way
possible, one of the most important pieces of information is how many <em>hits</em>
different parts of the query will produce. The estimated hit ratio is a normalized
number in the range [0, 1] that states the proportion of documents that is expected
to match a given part of the query.
</p>
<p>
For example, a query with an <code>AND</code> operator over multiple terms will benefit
by having the query planner place the term with the <em>lowest</em> estimated hit
ratio <em>first</em> in the AND's evaluation order. This is because that term will be
the cheapest to evaluate (least number of candidate documents to iterate over), and all
other terms can be excluded as a possible match if it doesn't match.
</p>
</li>
<li>
<p id="federation"><strong>Federation</strong></p>
<p>
Expand Down

0 comments on commit cdf06b1

Please sign in to comment.