From 9805e79c55cd2401252c777c082494a859cb225b Mon Sep 17 00:00:00 2001 From: Faraz <58580514+farazkh80@users.noreply.github.com> Date: Sun, 25 Dec 2022 21:29:04 -0500 Subject: [PATCH 1/5] reproduced results for pygaggle/docs/experiments-msmarco-passage-subset.md --- docs/experiments-msmarco-passage-subset.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/experiments-msmarco-passage-subset.md b/docs/experiments-msmarco-passage-subset.md index 38dc4e8..af92950 100644 --- a/docs/experiments-msmarco-passage-subset.md +++ b/docs/experiments-msmarco-passage-subset.md @@ -182,4 +182,5 @@ If you were able to replicate these results, please submit a PR adding to the re + Results replicated by [@lingwei-gu](https://github.com/lingwei-gu) on 2022-01-05 (commit [`d671f62`](https://github.com/castorini/pygaggle/commit/d671f62e4a269b5d79068f25267edd6078e568b5)) (Tesla T4 on Colab) + Results replicated by [@jx3yang](https://github.com/jx3yang) on 2022-05-10 (commit[`a326d49`](https://github.com/castorini/pygaggle/commit/a326d4983db6f84e4c519efa9e2dec91f776268e)) (Tesla T4 on Colab) + Results replicated by [@alvind1](https://github.com/alvind1) on 2022-05-12 (commit[`9d859a1`](https://github.com/castorini/pygaggle/commit/9d859a16d38e1c4281ac3c0588a4fa00e9e39e9a)) (Tesla T4 on Colab) -+ Results replicated by [@aivan6842](https://github.com/aivan6842) on 2022-08-09 (commit[`f54ae53`](https://github.com/castorini/pygaggle/commit/f54ae53d6183c1b66444fa5a0542301e0d1090f5)) (GeForce RTX 3070) \ No newline at end of file ++ Results replicated by [@aivan6842](https://github.com/aivan6842) on 2022-08-09 (commit[`f54ae53`](https://github.com/castorini/pygaggle/commit/f54ae53d6183c1b66444fa5a0542301e0d1090f5)) (GeForce RTX 3070) ++ + Results replicated by [@farazkh80](https://github.com/farazkh80) on 2022-12-25 (commit[`c1eb3bb`](https://github.com/castorini/pygaggle/commit/c1eb3bb963e119118807fe9b132f926b8aa14d7d)) (Tesla T4 on Colab) From 8dfd94f66662efb409c807d61e9466891b3ff066 Mon Sep 17 00:00:00 2001 From: faraz Date: Thu, 29 Dec 2022 04:38:27 +0000 Subject: [PATCH 2/5] added pre and post re-rank example visualization --- docs/experiments-msmarco-passage-subset.md | 173 ++++++++++++++++++++- 1 file changed, 170 insertions(+), 3 deletions(-) diff --git a/docs/experiments-msmarco-passage-subset.md b/docs/experiments-msmarco-passage-subset.md index af92950..6ae2a8b 100644 --- a/docs/experiments-msmarco-passage-subset.md +++ b/docs/experiments-msmarco-passage-subset.md @@ -4,7 +4,7 @@ This page contains instructions for running various neural reranking baselines o Note that there is also a separate [MS MARCO *document* ranking task](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md). Prior to running this, we suggest looking at our first-stage [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md). -We rerank the BM25 run files that contain ~1000 passages per query using both monoBERT and monoT5. +We rerank the BM25 run files that contain ~1000 passages per query using both monoBERT and monoT5. monoBERT and monoT5 are pointwise rerankers. This means that each document is scored independently using either BERT or T5 respectively. Since it can take many hours to run these models on all of the 6980 queries from the MS MARCO dev set, we will instead use a subset of 105 queries randomly sampled from the dev set. @@ -46,6 +46,24 @@ Next, we extract the contents into `data`. unzip data/msmarco_ans_small.zip -d data ``` +We should have these files in `data/msmarco_ans_small/` +``` +ls data/msmarco_ans_small -1 +qrels.dev.small.tsv +queries.dev.small.tsv +run.dev.small.tsv +scores +``` + +Let's also download MS MARCO passage dataset to visualize the actual passages after re-ranking. +``` +mkdir collections/msmarco-passage + +wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage + +tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage +``` + As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script. ``` @@ -61,6 +79,48 @@ QueriesRanked: 105 ##################### ``` +
+What's going on here? + +If you peak inside the `data/msmarco_ans_small/run.dev.small.tsv` file +``` +head -5 data/msmarco_ans_small/run.dev.small.tsv +188714 2133570 1 +188714 4321742 2 +188714 4321745 3 +188714 8523352 4 +188714 3573129 5 +``` + +you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 2133570 would be shown in the top position, `docid` 4321742 would be shown in the second position, etc. + +Now, let's see the actual query with `qid` 188714 +``` +grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv +188714 foods and supplements to lower blood sugar +``` + + +Let's see the passage text of the first hit by grepping `docid` 2133570 +``` +grep 2133570 collections/msmarco-passage/collection.tsv +2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. +``` +Let's verify if `docid` 2133570 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators + +``` +grep 188714 collections/msmarco-passage/qrels.dev.small.tsv +188714 0 8003843 1 +188714 0 4321745 1 +188714 0 8003849 1 +``` + +Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 2133570 does not appear in the third column of the passage hits for `qid` 188714, thus it is not a relevant passage that should be displayed to the user, especially at the top location! + +We will later see if re-ranking using MonoBert and MonoT5 has helped with improving our hit rankings. +
+
+ Let's download and extract the pre-built MS MARCO index into `indexes`: ``` @@ -102,6 +162,60 @@ In this case, assigning a batch size (using option `--batch-size`) which is smal The re-ranked run file `run.monobert.ans_small.dev.tsv` will also be available in the `runs` directory upon completion. +
+What's going on here? + +If you peak inside the generated `runs/run.monobert.ans_small.dev.tsv` +``` +head -5 runs/run.monobert.ans_small.dev.tsv +188714 4321745 1 +188714 6301923 2 +188714 6442308 3 +188714 1051360 4 +188714 4816868 5 +``` +you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. + +Now, let's see the actual query with `qid` 188714 +``` +grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv +188714 foods and supplements to lower blood sugar +``` + +let's also see the passage text of the first hit by grepping `docid` 4321745 +``` +grep 4321745 collections/msmarco-passage/collection.tsv +4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. +``` +In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the first passage hit for `qid` 188714 + +``` +grep 188714 data/msmarco_ans_small/run.dev.small.tsv | head -1 +188714 2133570 1 +``` + +Now, let's grep the passage with `docid` 2133570 +``` +grep 2133570 collections/msmarco-passage/collection.tsv +2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. +``` + +Notice that the top hit(`docid` 4321745) from the MonoBert re-ranked run file seems more relevant to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators + +``` +grep 188714 collections/msmarco-passage/qrels.dev.small.tsv +188714 0 8003843 1 +188714 0 4321745 1 +188714 0 8003849 1 +``` + +Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. + + +Thus, re-ranking with MonoBert certainly improved the top hit results. +
+
+ We can use the official MS MARCO evaluation script to verify the MRR@10: ``` @@ -142,6 +256,60 @@ It is worth noting again that you might need to modify the batch size to best fi Upon completion, the re-ranked run file `run.monot5.ans_small.dev.tsv` will be available in the `runs` directory. +
+What's going on here? + +If you peak inside the generated `runs/run.monot5.ans_small.dev.tsv` +``` +head -5 runs/run.monot5.ans_small.dev.tsv +188714 4321745 1 +188714 1051360 2 +188714 6442308 3 +188714 5499899 4 +188714 1022485 5 +``` +you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. + +Now, let's see the actual query with `qid` 188714 +``` +grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv +188714 foods and supplements to lower blood sugar +``` + +let's also see the passage text of the first hit by grepping `docid` 4321745 +``` +grep 4321745 collections/msmarco-passage/collection.tsv +4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. +``` +In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the first passage hit for `qid` 188714 + +``` +grep 188714 data/msmarco_ans_small/run.dev.small.tsv | head -1 +188714 2133570 1 +``` + +Now, let's grep the passage with `docid` 2133570 +``` +grep 2133570 collections/msmarco-passage/collection.tsv +2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. +``` + +Notice that the top hit(`docid` 4321745) from the MonoT5 re-ranked run file seems more relevant to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators + +``` +grep 188714 collections/msmarco-passage/qrels.dev.small.tsv +188714 0 8003843 1 +188714 0 4321745 1 +188714 0 8003849 1 +``` + +Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. + + +Thus, re-ranking with MonoT5 certainly improved the top hit results. +
+
+ We can use the official MS MARCO evaluation script to verify the MRR@10: ``` @@ -182,5 +350,4 @@ If you were able to replicate these results, please submit a PR adding to the re + Results replicated by [@lingwei-gu](https://github.com/lingwei-gu) on 2022-01-05 (commit [`d671f62`](https://github.com/castorini/pygaggle/commit/d671f62e4a269b5d79068f25267edd6078e568b5)) (Tesla T4 on Colab) + Results replicated by [@jx3yang](https://github.com/jx3yang) on 2022-05-10 (commit[`a326d49`](https://github.com/castorini/pygaggle/commit/a326d4983db6f84e4c519efa9e2dec91f776268e)) (Tesla T4 on Colab) + Results replicated by [@alvind1](https://github.com/alvind1) on 2022-05-12 (commit[`9d859a1`](https://github.com/castorini/pygaggle/commit/9d859a16d38e1c4281ac3c0588a4fa00e9e39e9a)) (Tesla T4 on Colab) -+ Results replicated by [@aivan6842](https://github.com/aivan6842) on 2022-08-09 (commit[`f54ae53`](https://github.com/castorini/pygaggle/commit/f54ae53d6183c1b66444fa5a0542301e0d1090f5)) (GeForce RTX 3070) -+ + Results replicated by [@farazkh80](https://github.com/farazkh80) on 2022-12-25 (commit[`c1eb3bb`](https://github.com/castorini/pygaggle/commit/c1eb3bb963e119118807fe9b132f926b8aa14d7d)) (Tesla T4 on Colab) ++ Results replicated by [@aivan6842](https://github.com/aivan6842) on 2022-08-09 (commit[`f54ae53`](https://github.com/castorini/pygaggle/commit/f54ae53d6183c1b66444fa5a0542301e0d1090f5)) (GeForce RTX 3070) \ No newline at end of file From 0abab7b9712a6e82cdbf8a19dfb2badc52092915 Mon Sep 17 00:00:00 2001 From: faraz Date: Thu, 29 Dec 2022 06:30:00 +0000 Subject: [PATCH 3/5] fixed some spelling errors --- docs/experiments-msmarco-passage-subset.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/experiments-msmarco-passage-subset.md b/docs/experiments-msmarco-passage-subset.md index 6ae2a8b..5033f56 100644 --- a/docs/experiments-msmarco-passage-subset.md +++ b/docs/experiments-msmarco-passage-subset.md @@ -101,7 +101,7 @@ grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv ``` -Let's see the passage text of the first hit by grepping `docid` 2133570 +Let's also see the passage text of the first hit by grepping `docid` 2133570 ``` grep 2133570 collections/msmarco-passage/collection.tsv 2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. @@ -115,7 +115,7 @@ grep 188714 collections/msmarco-passage/qrels.dev.small.tsv 188714 0 8003849 1 ``` -Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 2133570 does not appear in the third column of the passage hits for `qid` 188714, thus it is not a relevant passage that should be displayed to the user, especially at the top location! +Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid` (`1` is a hit and `0` is not). In this case, notice that `docid` 2133570 does not appear in the third column of the passage hits for `qid` 188714, thus it is not a relevant passage that should be displayed to the user, especially at the top location! We will later see if re-ranking using MonoBert and MonoT5 has helped with improving our hit rankings. @@ -182,7 +182,7 @@ grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv 188714 foods and supplements to lower blood sugar ``` -let's also see the passage text of the first hit by grepping `docid` 4321745 +Let's also see the passage text of the first hit by grepping `docid` 4321745 ``` grep 4321745 collections/msmarco-passage/collection.tsv 4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. @@ -200,7 +200,7 @@ grep 2133570 collections/msmarco-passage/collection.tsv 2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. ``` -Notice that the top hit(`docid` 4321745) from the MonoBert re-ranked run file seems more relevant to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators +Notice that the top hit from the MonoBert re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators ``` grep 188714 collections/msmarco-passage/qrels.dev.small.tsv @@ -209,7 +209,7 @@ grep 188714 collections/msmarco-passage/qrels.dev.small.tsv 188714 0 8003849 1 ``` -Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. +Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid` (`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. Thus, re-ranking with MonoBert certainly improved the top hit results. @@ -276,12 +276,12 @@ grep 188714 data/msmarco_ans_small/qrels.dev.small.tsv 188714 foods and supplements to lower blood sugar ``` -let's also see the passage text of the first hit by grepping `docid` 4321745 +Let's also see the passage text of the first hit by grepping `docid` 4321745 ``` grep 4321745 collections/msmarco-passage/collection.tsv 4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. ``` -In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the first passage hit for `qid` 188714 +In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the top passage hit for `qid` 188714 ``` grep 188714 data/msmarco_ans_small/run.dev.small.tsv | head -1 @@ -294,7 +294,7 @@ grep 2133570 collections/msmarco-passage/collection.tsv 2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. ``` -Notice that the top hit(`docid` 4321745) from the MonoT5 re-ranked run file seems more relevant to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators +Notice that the top hit from the MonoT5 re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators ``` grep 188714 collections/msmarco-passage/qrels.dev.small.tsv From eda23ea9eb59c23d5497e291602873e727f7bff8 Mon Sep 17 00:00:00 2001 From: faraz Date: Thu, 29 Dec 2022 06:42:21 +0000 Subject: [PATCH 4/5] formatted --- docs/experiments-msmarco-passage-subset.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/experiments-msmarco-passage-subset.md b/docs/experiments-msmarco-passage-subset.md index 5033f56..a40ece8 100644 --- a/docs/experiments-msmarco-passage-subset.md +++ b/docs/experiments-msmarco-passage-subset.md @@ -92,7 +92,7 @@ head -5 data/msmarco_ans_small/run.dev.small.tsv 188714 3573129 5 ``` -you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 2133570 would be shown in the top position, `docid` 4321742 would be shown in the second position, etc. +You will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 2133570 would be shown in the top position, `docid` 4321742 would be shown in the second position, etc. Now, let's see the actual query with `qid` 188714 ``` @@ -174,7 +174,7 @@ head -5 runs/run.monobert.ans_small.dev.tsv 188714 1051360 4 188714 4816868 5 ``` -you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. +You will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. Now, let's see the actual query with `qid` 188714 ``` @@ -187,8 +187,8 @@ Let's also see the passage text of the first hit by grepping `docid` 4321745 grep 4321745 collections/msmarco-passage/collection.tsv 4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. ``` -In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the first passage hit for `qid` 188714 +In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the first passage hit for `qid` 188714 ``` grep 188714 data/msmarco_ans_small/run.dev.small.tsv | head -1 188714 2133570 1 @@ -200,8 +200,9 @@ grep 2133570 collections/msmarco-passage/collection.tsv 2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. ``` -Notice that the top hit from the MonoBert re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators +Notice that the top hit from the MonoBert re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. +Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators ``` grep 188714 collections/msmarco-passage/qrels.dev.small.tsv 188714 0 8003843 1 @@ -211,7 +212,6 @@ grep 188714 collections/msmarco-passage/qrels.dev.small.tsv Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid` (`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. - Thus, re-ranking with MonoBert certainly improved the top hit results.
@@ -268,7 +268,7 @@ head -5 runs/run.monot5.ans_small.dev.tsv 188714 5499899 4 188714 1022485 5 ``` -you will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. +You will notice that the first column is the `qid` corresponding to a query from `data/msmarco_ans_small/queries.dev.small.tsv` and the second column is the `docid` of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, for `qid` 188714 `docid` 4321745 would be shown in the top position, `docid` 6301923 would be shown in the second position, etc. Now, let's see the actual query with `qid` 188714 ``` @@ -281,8 +281,8 @@ Let's also see the passage text of the first hit by grepping `docid` 4321745 grep 4321745 collections/msmarco-passage/collection.tsv 4321745 Food And Supplements That Lower Blood Sugar Levels. Cinnamon: Researchers are finding that cinnamon reduces blood sugar levels naturally when taken daily. If you absolutely love cinnamon you can sprinkle the recommended six grams of cinnamon on your food throughout the day to achieve the desired effect. ``` -In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the top passage hit for `qid` 188714 +In this case, the passage seems relevant to the query. Let's now compare this passage with the top passage hit from the original `data/msmarco_ans_small/run.dev.small.tsv`run file. Grep the top passage hit for `qid` 188714 ``` grep 188714 data/msmarco_ans_small/run.dev.small.tsv | head -1 188714 2133570 1 @@ -294,8 +294,9 @@ grep 2133570 collections/msmarco-passage/collection.tsv 2133570 A healthy diet is essential to reversing prediabetes. There are no foods, herbs, drinks, or supplements that lower blood sugar. Only medication and exercise can. But there are things you can eat and drink that are low on the glycemic index (GI). This means these foods wonât raise your blood sugar and may help you avoid a blood sugar spike. ``` -Notice that the top hit from the MonoT5 re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators +Notice that the top hit from the MonoT5 re-ranked run file(`docid` 4321745) seems more relevant than the top hit from the original run file(`docid` 2133570) to the query with `qid` 188714. +Let's verify if `docid` 4321745 is actually a relevant hit to our query (`qid` 188714) by checking the `data/msmarco_ans_small/qrels.dev.small.tsv` generated by human annotators ``` grep 188714 collections/msmarco-passage/qrels.dev.small.tsv 188714 0 8003843 1 @@ -305,7 +306,6 @@ grep 188714 collections/msmarco-passage/qrels.dev.small.tsv Recall that in a `qrel` file, the first column is the `qid` of a certain query, the third is the `docid` of a passage, and the last column is whether or not the `docid` is a hit to the `qid`(`1` is a hit and `0` is not). In this case, notice that `docid` 4321745 does appear in the third column of the passage hits relevant to `qid` 188714, thus it is a relevant passage that should be displayed to the user, unlike `docid` 2133570 (the top hit from the original run file) which does not appear at all as a relevant passage to `qid` 188714. - Thus, re-ranking with MonoT5 certainly improved the top hit results.
From f71e5f004c3860c74a4ea8318b2b2dd923fad274 Mon Sep 17 00:00:00 2001 From: faraz Date: Mon, 2 Jan 2023 23:39:42 +0000 Subject: [PATCH 5/5] added faiss instalation --- docs/experiments-msmarco-passage-subset.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/experiments-msmarco-passage-subset.md b/docs/experiments-msmarco-passage-subset.md index a40ece8..5306e19 100644 --- a/docs/experiments-msmarco-passage-subset.md +++ b/docs/experiments-msmarco-passage-subset.md @@ -25,6 +25,12 @@ Then install PyGaggle using: pip install pygaggle/ ``` +Lastly install `faiss` using: + +``` +pip install faiss-cpu +``` + ## Models + monoBERT-Large: Passage Re-ranking with BERT [(Nogueira et al., 2019)](https://arxiv.org/pdf/1901.04085.pdf)