-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to output external document ids in ExtractDocumentLengths #1283
add option to output external document ids in ExtractDocumentLengths #1283
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1283 +/- ##
============================================
- Coverage 51.86% 51.69% -0.17%
+ Complexity 810 803 -7
============================================
Files 154 154
Lines 8625 8627 +2
Branches 1224 1224
============================================
- Hits 4473 4460 -13
- Misses 3781 3796 +15
Partials 371 371
Continue to review full report at Codecov.
|
I think it's fine if we output the external docids by default also... so no need for an option. Can you improve the test case accordingly also? Thanks! |
…dimt/anserini into external_did_dump_doc_length
@@ -90,7 +92,8 @@ public static void main(String[] args) throws Exception { | |||
// See https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java | |||
int lossyDoclength = SmallFloat.byte4ToInt(SmallFloat.intToByte4((int) exactDoclength)); | |||
int lossyTermCount = SmallFloat.byte4ToInt(SmallFloat.intToByte4((int) exactTermCount)); | |||
out.println(String.format("%d\t%d\t%d\t%d\t%d", i, exactDoclength, exactTermCount, lossyDoclength, lossyTermCount)); | |||
out.println(String.format("%s\t%d\t%d\t%d\t%d", IndexReaderUtils.convertLuceneDocidToDocid(reader, i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry about the delayed response... can we keep both the internal id and the collection docid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finished
The current implementation only outputs the internal document ids. If users want to get the external document ids, they need to use Pyserini to look up each id one-by-one. Adding the option will save users the trouble.