diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 8dde2a2ce0..7be1f6d9db 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -2,7 +2,21 @@ This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/). Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md). -We also have a [separate page](experiments-doc2query.md) describing document expansion experiments (Doc2query) for this task. +We also have a [separate page](experiments-doc2query.md) describing document expansion experiments (doc2query) for this task. + +If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell). +In particular, you'll want to pay attention to the "What's going on here?" sections. + +
+What's going on here? + +As a really high level summary: in the MS MARCO passage ranking task, you're given a bunch of passages to search and a bunch of queries. +The system's task is to return the best passages for each query (i.e., passages that are relevant). + +Note that "the things you're searching" are called documents (in the generic sense), even though they're actually passages (extracted from web pages) in this case. +You could be search web pages, PDFs, Excel spreadsheets, and even podcasts. +Information retrieval researchers refer to these all as "documents". +
## Data Prep @@ -22,6 +36,21 @@ tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/ To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`. +
+What's going on here? + +If you peak inside the collection: + +```bash +head collections/msmarco-passage/collection.tsv +``` + +You'll see that `collection.tsv` contains the passages that we're searching. +Each line represents a passage: +the first column contains a unique identifier for the passage (called the `docid`) and the second column contains the text of the passage itself. + +
+ Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line): ```bash @@ -55,6 +84,31 @@ python tools/scripts/msmarco/filter_queries.py \ ``` The output queries file should contain 6980 lines. + +
+What's going on here? + +Check out the contents of the queries file: + +```bash +$ head collections/msmarco-passage/queries.dev.small.tsv +1048585 what is paula deen's brother +2 Androgen receptor define +524332 treating tension headaches without medication +1048642 what is paranoid sc +524447 treatment of varicose veins in legs +786674 what is prime rate in canada +1048876 who plays young dr mallard on ncis +1048917 what is operating system misconfiguration +786786 what is priority pass +524699 tricare service number +``` + +These are the queries we're going to feed to the search engine. +The first field is a unique identifier for the query (called the `qid`) and the second column is the query itself. +These queries are taken from Bing search logs, so they're "realistic" web queries in that they may be ambiguous, contain typos, etc. +
+ We can now perform a retrieval run using this smaller set of queries: ```bash @@ -72,6 +126,50 @@ Retrieval speed will vary by machine: On a modern desktop with an SSD, we can get ~0.07 s/query, so the run should finish in under ten minutes. We can perform multi-threaded retrieval by changing the `-threads` argument. +
+What's going on here? + +Congratulations, you've performed your first retrieval run! + +You feed a search engine a bunch of queries, and the retrieval run is the output of the search engine. +For each query, the search engine gives back a ranked list of results (i.e., a list of hits). + +Let's take a look: + +```bash +$ head runs/run.msmarco-passage.dev.small.tsv +1048585 7187158 1 +1048585 7187157 2 +1048585 7187163 3 +1048585 7546327 4 +1048585 7187160 5 +1048585 8227279 6 +1048585 7617404 7 +1048585 7187156 8 +1048585 2298838 9 +1048585 7187155 10 +``` + +The first column is the `qid` (corresponding to the query). +From above, we can see that `qid` 1048585 is the query "what is paula deen's brother". +The second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. +That is, in a search interface, `docid` 7187158 would be shown in the top position, `docid` 7187157 would be shown in the second position, etc. + +You can grep through the collection to see what that actual passage is: + +```bash +$ grep 7187158 collections/msmarco-passage/collection.tsv +7187158 Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's… Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba's†+``` + +In this case, the hit seems relevant. +That is, it answers the query. +So here, the search engine did well. + +Note that this particular passage is a bit dirty (garbage characters, dups, etc.)... but that's pretty much a fact of life when you're dealing with the web. + +
+ Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script: ```bash @@ -88,6 +186,48 @@ QueriesRanked: 6980 ##################### ``` +
+What's going on here? + +So how do we know if a search engine is any good? +One method is manual examination, which is what we did above. +That is, we actually looked at the results by hand. + +Obviously, this isn't scalable if we want to evaluate lots of queries... +If only someone told us which documents were relevant to which queries... + +Well, someone has! (Specifically, human editors hired by Microsoft Bing in this case.) +These are captured in what are known as relevance judgments. +Take a look: + +```bash +$ grep 1048585 collections/msmarco-passage/qrels.dev.tsv +1048585 0 7187158 1 +``` + +This says that `docid` 7187158 is relevant to `qid` 1048585, which confirms our intuition above. +The file is in what is known as the qrels format. +You can ignore the second column. +The fourth column "1", says that the `docid` is relevant. +In some cases (though not here), that column might say "0", i.e., that the `docid` is _not_ relevant. + +With relevance judgments (qrels), we can now automatically evaluate the search engine output (i.e., the run). +The final ingredient we need is a metric (i.e., how to score). + +Here, we're using a metric called MRR, or mean reciprocal rank. +The idea is quite simple: +We look at where the relevant `docid` appears. +If it appears at rank 1, the system gets a score of one. +If it appears at rank 2, the system gets a score of 1/2. +If it appears at rank 3, the system gets a score of 1/3. +And so on. +MRR@10 means that we only go down to rank 10. +If the relevant `docid` doesn't appear in the top 10, then the system gets a score of zero. + +That's the score of a query. +We take the average of the scores across all queries (6980 in this case), and we arrive at the score for the entire run. +
+ You can find this entry on the [MS MARCO Passage Ranking Leaderboard](https://microsoft.github.io/msmarco/) as entry "BM25 (Lucene8, tuned)", so you've just reproduced (part of) a leaderboard submission! We can also use the official TREC evaluation tool, `trec_eval`, to compute other metrics than MRR@10. @@ -119,6 +259,18 @@ recall_1000 all 0.8573 Average precision and recall@1000 are the two metrics we care about the most. +
+What's going on here? + +Don't worry so much about the details here for now. +The tl;dr is that there are different formats for run files and lots of different metrics you can compute. +`trec_eval` is a standard tool used by information retrieval researchers. + +In fact, researchers have been trying to answer the question "how do we know if a search result is good and how do we measure it" for over half a century... +and the question still has not been fully resolved. +In short, it's complicated. +
+ ## BM25 Tuning Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini (system-wide) default of `k1=0.9`, `b=0.4`.