Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to output external document ids in ExtractDocumentLengths #1283

Merged
merged 8 commits into from
Aug 17, 2020

Conversation

nsndimt
Copy link
Contributor

@nsndimt nsndimt commented Jun 16, 2020

The current implementation only outputs the internal document ids. If users want to get the external document ids, they need to use Pyserini to look up each id one-by-one. Adding the option will save users the trouble.

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #1283 into master will decrease coverage by 0.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1283      +/-   ##
============================================
- Coverage     51.86%   51.69%   -0.17%     
+ Complexity      810      803       -7     
============================================
  Files           154      154              
  Lines          8625     8627       +2     
  Branches       1224     1224              
============================================
- Hits           4473     4460      -13     
- Misses         3781     3796      +15     
  Partials        371      371              
Impacted Files Coverage Δ Complexity Δ
.../java/io/anserini/util/ExtractDocumentLengths.java 97.61% <100.00%> (+0.11%) 3.00 <0.00> (ø)
...java/io/anserini/ltr/feature/CountBigramPairs.java 70.12% <0.00%> (-19.49%) 26.00% <0.00%> (-7.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac266de...4757427. Read the comment docs.

@lintool
Copy link
Member

lintool commented Jun 17, 2020

I think it's fine if we output the external docids by default also... so no need for an option.

Can you improve the test case accordingly also? Thanks!

@@ -90,7 +92,8 @@ public static void main(String[] args) throws Exception {
// See https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
int lossyDoclength = SmallFloat.byte4ToInt(SmallFloat.intToByte4((int) exactDoclength));
int lossyTermCount = SmallFloat.byte4ToInt(SmallFloat.intToByte4((int) exactTermCount));
out.println(String.format("%d\t%d\t%d\t%d\t%d", i, exactDoclength, exactTermCount, lossyDoclength, lossyTermCount));
out.println(String.format("%s\t%d\t%d\t%d\t%d", IndexReaderUtils.convertLuceneDocidToDocid(reader, i),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry about the delayed response... can we keep both the internal id and the collection docid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished

@lintool lintool self-requested a review August 17, 2020 14:26
@lintool lintool merged commit 857f6da into castorini:master Aug 17, 2020
@nsndimt nsndimt deleted the external_did_dump_doc_length branch September 26, 2020 14:34
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants