Merge pull request #3615 from vespa-engine/vekterli/extend-glossary

Add glossary entries for (normalized) document frequency and estimated hit ratio
vespa-engine · Feb 5, 2025 · cdf06b1 · cdf06b1
2 parents c18e747 + 69aa611
commit cdf06b1
Showing 1 changed file with 38 additions and 0 deletions.
diff --git a/en/glossary.html b/en/glossary.html
@@ -156,6 +156,27 @@
       Read more in <a href="documents.html">Documents</a>.
     </p>
   </li>
+  <li>
+    <p id="document-frequency-normalized"><strong>Document frequency (normalized)</strong></p>
+    <p>
+      The <em>document frequency</em> of a term captures how often the term occurs in the document corpus
+      relative to the total number of documents.
+      For ranking purposes this value is always normalized so that it is in the range [0, 1].
+      For example, if a term occurs in 600 out of 1000 documents, its normalized document
+      frequency will be \(600/1000 = 0.6\).
+    </p>
+    <p>
+      From an information retrieval perspective, the normalized document frequency gives a measure
+      of how common (or rare) a term is. Query terms that occur rarely (thus having a low document
+      frequency) are usually expected to be more <em>relevant</em> to the query, since they are
+      more specific. On the other end, very common terms (with high document frequency) are often
+      considered to be "stopwords" (such as "the", "an" etc.), and are expected to have a low
+      contribution to query relevance. This is directly related to
+      <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency">inverse document frequency</a>,
+      which is used by classic text ranking algorithms such as <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a>
+      and <a href="reference/bm25.html">BM25</a>.
+    </p>
+  </li>
   <li>
     <p id="document-summary"><strong>Document summary</strong></p>
     <p>
@@ -207,6 +228,23 @@
       and Vespa provides <a href="embedding.html">built-in support</a> for this.
     </p>
   </li>
+  <li>
+    <p id="estimated-hit-ratio"><strong>Estimated hit ratio</strong></p>
+    <p>
+      When Vespa plans how a query should be evaluated in the most efficient way
+      possible, one of the most important pieces of information is how many <em>hits</em>
+      different parts of the query will produce. The estimated hit ratio is a normalized
+      number in the range [0, 1] that states the proportion of documents that is expected
+      to match a given part of the query.
+    </p>
+    <p>
+      For example, a query with an <code>AND</code> operator over multiple terms will benefit
+      by having the query planner place the term with the <em>lowest</em> estimated hit
+      ratio <em>first</em> in the AND's evaluation order. This is because that term will be
+      the cheapest to evaluate (least number of candidate documents to iterate over), and all
+      other terms can be excluded as a possible match if it doesn't match.
+    </p>
+  </li>
   <li>
     <p id="federation"><strong>Federation</strong></p>
     <p>