Feature - compute doc vectors on the fly #1984

AileenLin · 2022-09-27T15:33:57Z

[RESOLVED]result consistent with stored vector version

todo: stored vector results need to be update:

2022-09-27 04:21:59,678 INFO [python] 2022-09-27 04:22:06,167 INFO [python] 2022-09-27 04:22:14,168 INFO [python] 2022-09-27 04:22:22,178 INFO [python] 2022-09-27 04:22:30,847 ERROR 2022-09-27 04:22:37,335 ERROR 2022-09-27 04:22:45,377 ERROR 2022-09-27 04:22:53,549 INFO [python] 2022-09-27 04:23:02,128 ERROR 2022-09-27 04:23:08,607 ERROR 2022-09-27 04:23:16,639 ERROR 2022-09-27 04:23:24,689 INFO [python] 2022-09-27 04:23:33,252 ERROR 2022-09-27 04:23:39,717 ERROR 2022-09-27 04:23:47,744 ERROR 2022-09-27 04:23:55,774 ERROR 2022-09-27 04:24:04,408 INFO [python] 2022-09-27 04:24:10,950 INFO [python] 2022-09-27 04:24:19,058 INFO [python] 2022-09-27 04:24:27,170 INFO [python] 2022-09-27 04:24:36,275 INFO [python] 2022-09-27 04:24:42,829 INFO [python] 2022-09-27 04:24:50,925 INFO [python] 2022-09-27 04:24:59,037 INFO [python] 2022-09-27 04:25:07,596 INFO [python] 2022-09-27 04:25:14,062 INFO [python] 2022-09-27 04:25:22,042 INFO [python] 2022-09-27 04:25:30,030 INFO [python] 2022-09-27 04:25:38,586 ERROR 2022-09-27 04:25:45,061 ERROR 2022-09-27 04:25:53,086 ERROR 2022-09-27 04:26:01,120 ERROR 2022-09-27 04:26:09,692 INFO [python] 2022-09-27 04:26:16,225 ERROR 2022-09-27 04:26:24,287 ERROR 2022-09-27 04:26:32,352 INFO [python] 2022-09-27 04:26:40,849 INFO [python] 2022-09-27 04:26:47,342 INFO [python] 2022-09-27 04:26:55,394 ERROR 2022-09-27 04:27:03,458 ERROR 2022-09-27 04:27:11,984 INFO [python] 2022-09-27 04:27:18,535 INFO [python] 2022-09-27 04:27:26,648 INFO [python] 2022-09-27 04:27:34,766 INFO [python] 2022-09-27 04:27:43,220 INFO [python] 2022-09-27 04:27:49,714 INFO [python] 2022-09-27 04:27:57,772 INFO [python] 2022-09-27 04:28:05,839 INFO [python] [OK] expected: 0.1926 actual: 0.1926 - metric: AP@1000 model: bm25-default topics: dev
[OK] expected: 0.1840 actual: 0.1840 - metric: RR@10 model: bm25-default topics: dev
[OK] expected: 0.6578 actual: 0.6578 - metric: R@100 model: bm25-default topics: dev
[OK] expected: 0.8526 actual: 0.8526 - metric: R@1000 model: bm25-default topics: dev
[python] [FAIL] expected: 0.1661 actual: 0.1663 - metric: AP@1000 model: bm25-default+rm3 topics: dev
[python] [FAIL] expected: 0.1564 actual: 0.1566 - metric: RR@10 model: bm25-default+rm3 topics: dev
[python] [FAIL] expected: 0.6494 actual: 0.6538 - metric: R@100 model: bm25-default+rm3 topics: dev
[OK] expected: 0.8606 actual: 0.8606 - metric: R@1000 model: bm25-default+rm3 topics: dev
[python] [FAIL] expected: 0.1692 actual: 0.1690 - metric: AP@1000 model: bm25-default+rocchio topics: dev
[python] [FAIL] expected: 0.1597 actual: 0.1595 - metric: RR@10 model: bm25-default+rocchio topics: dev
[python] [FAIL] expected: 0.6552 actual: 0.6553 - metric: R@100 model: bm25-default+rocchio topics: dev
[OK] expected: 0.8620 actual: 0.8620 - metric: R@1000 model: bm25-default+rocchio topics: dev
[python] [FAIL] expected: 0.1677 actual: 0.1676 - metric: AP@1000 model: bm25-default+rocchio-neg topics: dev
[python] [FAIL] expected: 0.1578 actual: 0.1576 - metric: RR@10 model: bm25-default+rocchio-neg topics: dev
[python] [FAIL] expected: 0.6561 actual: 0.6559 - metric: R@100 model: bm25-default+rocchio-neg topics: dev
[python] [FAIL] expected: 0.8649 actual: 0.8652 - metric: R@1000 model: bm25-default+rocchio-neg topics: dev
[OK] expected: 0.1625 actual: 0.1625 - metric: AP@1000 model: bm25-default+ax topics: dev
[OK] expected: 0.1517 actual: 0.1517 - metric: RR@10 model: bm25-default+ax topics: dev
[OK] expected: 0.6556 actual: 0.6556 - metric: R@100 model: bm25-default+ax topics: dev
[OK] expected: 0.8747 actual: 0.8747 - metric: R@1000 model: bm25-default+ax topics: dev
[OK] expected: 0.1520 actual: 0.1520 - metric: AP@1000 model: bm25-default+prf topics: dev
[OK] expected: 0.1421 actual: 0.1421 - metric: RR@10 model: bm25-default+prf topics: dev
[OK] expected: 0.6535 actual: 0.6535 - metric: R@100 model: bm25-default+prf topics: dev
[OK] expected: 0.8537 actual: 0.8537 - metric: R@1000 model: bm25-default+prf topics: dev
[OK] expected: 0.1958 actual: 0.1958 - metric: AP@1000 model: bm25-tuned topics: dev
[OK] expected: 0.1875 actual: 0.1875 - metric: RR@10 model: bm25-tuned topics: dev
[OK] expected: 0.6701 actual: 0.6701 - metric: R@100 model: bm25-tuned topics: dev
[OK] expected: 0.8573 actual: 0.8573 - metric: R@1000 model: bm25-tuned topics: dev
[python] [FAIL] expected: 0.1762 actual: 0.1741 - metric: AP@1000 model: bm25-tuned+rm3 topics: dev
[python] [FAIL] expected: 0.1668 actual: 0.1646 - metric: RR@10 model: bm25-tuned+rm3 topics: dev
[python] [FAIL] expected: 0.6655 actual: 0.6674 - metric: R@100 model: bm25-tuned+rm3 topics: dev
[python] [FAIL] expected: 0.8687 actual: 0.8704 - metric: R@1000 model: bm25-tuned+rm3 topics: dev
[OK] expected: 0.1777 actual: 0.1777 - metric: AP@1000 model: bm25-tuned+rocchio topics: dev
[python] [FAIL] expected: 0.1685 actual: 0.1684 - metric: RR@10 model: bm25-tuned+rocchio topics: dev
[python] [FAIL] expected: 0.6702 actual: 0.6706 - metric: R@100 model: bm25-tuned+rocchio topics: dev
[OK] expected: 0.8726 actual: 0.8726 - metric: R@1000 model: bm25-tuned+rocchio topics: dev
[OK] expected: 0.1762 actual: 0.1762 - metric: AP@1000 model: bm25-tuned+rocchio-neg topics: dev
[OK] expected: 0.1669 actual: 0.1669 - metric: RR@10 model: bm25-tuned+rocchio-neg topics: dev
[python] [FAIL] expected: 0.6744 actual: 0.6748 - metric: R@100 model: bm25-tuned+rocchio-neg topics: dev
[python] [FAIL] expected: 0.8756 actual: 0.8757 - metric: R@1000 model: bm25-tuned+rocchio-neg topics: dev
[OK] expected: 0.1699 actual: 0.1699 - metric: AP@1000 model: bm25-tuned+ax topics: dev
[OK] expected: 0.1594 actual: 0.1594 - metric: RR@10 model: bm25-tuned+ax topics: dev
[OK] expected: 0.6721 actual: 0.6721 - metric: R@100 model: bm25-tuned+ax topics: dev
[OK] expected: 0.8809 actual: 0.8809 - metric: R@1000 model: bm25-tuned+ax topics: dev
[OK] expected: 0.1582 actual: 0.1582 - metric: AP@1000 model: bm25-tuned+prf topics: dev
[OK] expected: 0.1484 actual: 0.1484 - metric: RR@10 model: bm25-tuned+prf topics: dev
[OK] expected: 0.6589 actual: 0.6589 - metric: R@100 model: bm25-tuned+prf topics: dev
[OK] expected: 0.8561 actual: 0.8561 - metric: R@1000 model: bm25-tuned+prf topics: dev

…ector test

…vec stored

� Conflicts: � src/main/java/io/anserini/rerank/lib/Rm3Reranker.java � src/main/java/io/anserini/rerank/lib/RocchioReranker.java

lintool · 2022-09-27T16:20:41Z

hi @AileenLin - if you merge in main trunk, the scores should be updated and match?

Also, can you compare speed with -storeDocVectors and using your new implementation?

lintool · 2022-09-30T00:12:03Z

src/main/java/io/anserini/rerank/lib/AxiomReranker.java

@@ -109,14 +111,14 @@
  private final long seed;
  private final String originalIndexPath;
  private final String externalIndexPath;  // Axiomatic reranking can opt to use
-                                           // external sources for searching the expansion


Can you undo these edits? I don't think they're supposed to be part of the PR?

It seems like you ran a linter... which is fine. We can fix these minor issues... but the comments alignment should be reverted.

codecov-commenter · 2022-10-05T14:23:11Z

Codecov Report

Base: 58.51% // Head: 58.83% // Increases project coverage by +0.32% 🎉

Coverage data is based on head (a050430) compared to base (b5ecc5a).
Patch coverage: 57.62% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1984      +/-   ##
============================================
+ Coverage     58.51%   58.83%   +0.32%     
- Complexity     1092     1129      +37     
============================================
  Files           187      187              
  Lines         10217    10780     +563     
  Branches       1413     1479      +66     
============================================
+ Hits           5978     6342     +364     
- Misses         3760     3948     +188     
- Partials        479      490      +11

Impacted Files	Coverage Δ
...ain/java/io/anserini/collection/CarCollection.java	`0.00% <0.00%> (ø)`
...ava/io/anserini/collection/DocumentCollection.java	`57.33% <ø> (ø)`
...in/java/io/anserini/collection/HtmlCollection.java	`51.85% <0.00%> (-6.49%)`	⬇️
...o/anserini/collection/TrialstreamerCollection.java	`0.00% <0.00%> (ø)`
.../java/io/anserini/collection/VectorCollection.java	`0.00% <0.00%> (ø)`
...ain/java/io/anserini/rerank/lib/AxiomReranker.java	`0.00% <0.00%> (ø)`
...n/java/io/anserini/rerank/lib/BM25PrfReranker.java	`0.00% <0.00%> (ø)`
...rini/rerank/lib/NewsBackgroundLinkingReranker.java	`0.00% <0.00%> (ø)`
...main/java/io/anserini/search/SearchCollection.java	`43.36% <12.50%> (-0.59%)`	⬇️
.../main/java/io/anserini/rerank/lib/Rm3Reranker.java	`43.93% <12.90%> (-9.84%)`	⬇️
... and 59 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

…i#1984) This means that we can perform pseudo-relevance feedback on an index that does not have docvectors stored.

AileenLin added 8 commits September 14, 2022 13:56

result reproduced (p40 server)

f6ebf21

Merge branch 'castorini:master' into master

ed0b7df

Merge remote-tracking branch 'origin/master'

e929123

add on the fly term vector calculations to analyzer utils, add term v…

258746b

…ector test

stash the work before adding "collection" param to the main

610567d

add collection param to searchArg, tested index with vec and without …

8ff1d65

…vec stored

Merge branch 'castorini:master' into master

77bca71

Merge branch 'master' of https://github.com/AileenLin/anserini

d709706

� Conflicts: � src/main/java/io/anserini/rerank/lib/Rm3Reranker.java � src/main/java/io/anserini/rerank/lib/RocchioReranker.java

lintool reviewed Sep 30, 2022

View reviewed changes

revert unnecessary comment format

6140da0

AileenLin added 3 commits October 10, 2022 22:28

add raw -> content to collections

a050430

fix condition order, all regression tests passed

5bd940c

Merge branch 'castorini:master' into master

f70c665

lintool self-requested a review November 6, 2022 12:47

lintool approved these changes Nov 6, 2022

View reviewed changes

lintool merged commit 1273619 into castorini:master Nov 6, 2022

lintool mentioned this pull request May 23, 2023

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature - compute doc vectors on the fly #1984

Feature - compute doc vectors on the fly #1984

AileenLin commented Sep 27, 2022 •

edited

Loading

lintool commented Sep 27, 2022

lintool Sep 30, 2022

lintool Sep 30, 2022

codecov-commenter commented Oct 5, 2022 •

edited

Loading

Feature - compute doc vectors on the fly #1984

Feature - compute doc vectors on the fly #1984

Conversation

AileenLin commented Sep 27, 2022 • edited Loading

lintool commented Sep 27, 2022

lintool Sep 30, 2022

Choose a reason for hiding this comment

lintool Sep 30, 2022

Choose a reason for hiding this comment

codecov-commenter commented Oct 5, 2022 • edited Loading

Codecov Report

AileenLin commented Sep 27, 2022 •

edited

Loading

codecov-commenter commented Oct 5, 2022 •

edited

Loading