Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Covid Passage Indexing #1049

Merged
merged 4 commits into from
Mar 22, 2020
Merged

Covid Passage Indexing #1049

merged 4 commits into from
Mar 22, 2020

Conversation

nikhilro
Copy link
Member

Resolves #1044 and #1045

Per the issue, a record is decomposed into (id, content, raw):

  • ID, TITLE + ABSTRACT, FULL_TEXT_JSON
  • ID.0001, TITLE + ABSTRACT + P1, ""
  • ID.0002, TITLE + ABSTRACT + P2, ""

Edits to documentation to resolve xargs issues and updates to expected number of documents.

In regards to #1048, the CovidCollection's SourceDocument adds raw which is ultimately what goes to raw for the document. This is the simplest implementation I think

@nikhilro nikhilro requested review from lintool and edwinzhng March 22, 2020 15:15
@lintool
Copy link
Member

lintool commented Mar 22, 2020

hey @nikhilro I'm thinking (but up for discussion) that the paragraph chopping should be performed in the Generator?

That is each SourceDocument still presents an independent article, but the Generator does the chopping into multiple paragraphs and a Lucene document for each.

@codecov
Copy link

codecov bot commented Mar 22, 2020

Codecov Report

Merging #1049 into master will decrease coverage by 0.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1049      +/-   ##
============================================
- Coverage     43.54%   43.03%   -0.51%     
- Complexity      626      636      +10     
============================================
  Files           130      132       +2     
  Lines          7943     8081     +138     
  Branches       1150     1168      +18     
============================================
+ Hits           3459     3478      +19     
- Misses         4155     4274     +119     
  Partials        329      329
Impacted Files Coverage Δ Complexity Δ
...o/anserini/collection/CovidFullTextCollection.java 0% <0%> (ø) 0 <0> (?)
...n/java/io/anserini/collection/CovidCollection.java 0% <0%> (ø) 0 <0> (ø) ⬇️
.../anserini/collection/CovidParagraphCollection.java 0% <0%> (ø) 0 <0> (?)
...va/io/anserini/index/generator/CovidGenerator.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...anserini/ltr/feature/base/PMIFeatureExtractor.java 86.53% <0%> (+1.92%) 13% <0%> (+1%) ⬆️
...a/io/anserini/ann/fw/FakeWordsEncoderAnalyzer.java 100% <0%> (+18.18%) 2% <0%> (ø) ⬇️
...java/io/anserini/ltr/feature/CountBigramPairs.java 89.61% <0%> (+23.37%) 33% <0%> (+9%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99e9709...590e7f6. Read the comment docs.

@nikhilro
Copy link
Member Author

My mental model for Collection vs Generator is that the collection deals with actual files on the system whereas the generator never has to worry about that. To generator, there is a guarantee that the document data exists and it's job is to convert it to LuceneDocument. So, I think it makes more sense to keep it in CovidCollection

@nikhilro nikhilro closed this Mar 22, 2020
@nikhilro nikhilro reopened this Mar 22, 2020
@lintool
Copy link
Member

lintool commented Mar 22, 2020

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

In this case, it'd be CSVRecord(\nJSON_FULL_TEXT)?. The Generator knows what to do with this in generating Lucene Documents.

@nikhilro
Copy link
Member Author

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

confused, the current implementation?

@lintool
Copy link
Member

lintool commented Mar 22, 2020

My mental model for Collection vs Generator is that the collection deals with actual files on the system whereas the generator never has to worry about that. To generator, there is a guarantee that the document data exists and it's job is to convert it to LuceneDocument. So, I think it makes more sense to keep it in CovidCollection

But what about the case if we want to change the indexing scheme? Let's say later on we find out that paragraph by paragraph isn't that good, and we want to change to, say sliding window of 2 paragraphs?

In your design, you'd need to modify the collection.

And if you want to keep both, you'd need two collections.

@lintool
Copy link
Member

lintool commented Mar 22, 2020

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

confused, the current implementation?

Yes, that's the implicit contract of the current design, I think.

@nikhilro nikhilro force-pushed the covid-passage-indexing branch from b81e181 to 7ff68c0 Compare March 22, 2020 15:28
@nikhilro nikhilro force-pushed the covid-passage-indexing branch from 7ff68c0 to adc1aa3 Compare March 22, 2020 15:28
@nikhilro
Copy link
Member Author

nikhilro commented Mar 22, 2020

In your design, you'd need to modify the collection.
And if you want to keep both, you'd need two collections.

Yes, that seems reasonable to me. Maybe Collection isn't the best word for it but Generator, to me, seems has a simple job. Convert 1 SourceDocument to 1 LuceneDocument

Collection is the one doing the logic handling with actual files into what it wants in SourceDocument

@lintool
Copy link
Member

lintool commented Mar 22, 2020

In your design, you'd need to modify the collection.
And if you want to keep both, you'd need two collections.

Yes, that seems reasonable to me. Maybe Collection isn't the best word for it but Generator to seems has a simple job. Convert 1 SourceDocument to 1 LuceneDocument

I think I like the contract of Generator as Convert 1 SourceDocument to N LuceneDocument.

So in my design, you'd have 1 Collection, but multiple generators. E.g.,

  • CovidFullDocumentGenerator
  • CovidParagraphGenerator
  • CovidSlidingParagraphWindowGenerator

@edwinzhng edwinzhng closed this Mar 22, 2020
@edwinzhng edwinzhng reopened this Mar 22, 2020
@nikhilro
Copy link
Member Author

nikhilro commented Mar 22, 2020

To implement 1 SourceDocument to N LuceneDocument, there'd be two complications:

  • We need to change pieces of our core design. Right now, all a generator does is createDocument(SourceDocument) which is expected to return 1 LuceneDocument. Each generator will have to implement some complex iterator logic to support the aforementioned 1:N SourceDocument:LuceneDocument.

  • Even if we could change the above, there is an additional complexity on creating those N LuceneDocuments in the Generator. This will involve JsonNode, CSVParser, BufferedReader etc. having to move to Generator in addition to staying in Collection.

So, I think it's cleaner to have 1:1 for SourceDocument:LuceneDocument because

  • The file handling logic all stays in Collection
  • Generators are simple and can deal with issues related to Lucene indexing
  • Collection's SourceDocument will clearly say what content and raw fields are expected

And yea, there'd be multiple Collections like: (I do agree with you Collection not the best name)

  • CovidFullDocumentCollection
  • CovidParagraphCollection
  • CovidSlidingParagraphWindowCollection

@edwinzhng
Copy link
Member

To implement 1 SourceDocument to N LuceneDocument, there'd be two complications:

  • We need to change pieces of our core design. Right now, all a generator does is createDocument(SourceDocument) which is expected to return 1 LuceneDocument. Each generator will have to implement some complex iterator logic to support the aforementioned 1:N SourceDocument:LuceneDocument.
  • Even if we could change the above, there is an additional complexity on creating those N LuceneDocuments in the Generator. This will involve JsonNode, CSVParser, BufferedReader etc. having to move to Generator in addition to staying in Collection.

In that case, we would just need to modify the IndexCollection to support a list of returned documents instead of a single one (and probably modify all existing generators to return a list of size == 1)

https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L195

@lintool
Copy link
Member

lintool commented Mar 22, 2020

Okay, I think you've sold me. We'll go with your design.

But let's keep both CovidCollection (the old one) and also CovidParagraphCollection (new one)? Maybe refactor to inherit from same super class?

@nikhilro
Copy link
Member Author

Sounds good!

@lintool
Copy link
Member

lintool commented Mar 22, 2020

Actually, @nikhilro can we have three versions? (1) title/abstract only, (2) full text, and (3) title/abstract/each paragraph.

Down the road, we probably want to evaluate them...

@nikhilro
Copy link
Member Author

nikhilro commented Mar 22, 2020

and to clarify those three versions

  • (1) title/abstract only: CovidCollection

    • (id, content, raw)
    • (ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)
  • (2) full text: CovidFullTextCollection

    • (id, content, raw)
    • (ROW_NUMBER, TITLE + ABSTRACT + FULL_JSON_TEXT, FULL_JSON_TEXT)
  • and (3) title/abstract/each paragraph: CovidParagraphCollection

    • (id, content, raw)
    • (ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)
    • (ROW_NUMBER.001, TITLE + ABSTRACT + P1, "")
    • (ROW_NUMBER.002, TITLE + ABSTRACT + P2, "")
    • ....

@lintool
Copy link
Member

lintool commented Mar 22, 2020

and to clarify those three versions

  • (1) title/abstract only: CovidCollection

    • (id, content, raw)
    • (ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)

First bullet is input, and remaining our outputs Lucene docs, right? If so, yes.

@nikhilro
Copy link
Member Author

yea, I was mentioning the format in each for the upcoming bullets (id, content, raw) 😅. Sounds good, on it.

@nikhilro
Copy link
Member Author

Gonna get @edwinzhng 's help to get the inheritance working in Java

Copy link
Member

@edwinzhng edwinzhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just merge? I'll take over the inheritance refactoring

@nikhilro
Copy link
Member Author

nikhilro commented Mar 22, 2020

sh target/appassembler/bin/IndexCollection \
    -collection CovidCollection -generator CovidGenerator \
    -threads 8 -input "${DATA_DIR}" \
    -index "${DATA_DIR}"/lucene-index-covid \
    -storePositions -storeDocvectors -storeRawDocs -storeTransformedDocs

works but same with CovidFullTextCollection does not because of inheritance issues that @edwinzhng will fix. Thanks!

@edwinzhng edwinzhng merged commit a90a0ce into master Mar 22, 2020
@edwinzhng edwinzhng deleted the covid-passage-indexing branch March 22, 2020 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COVID-19 collection: Implement passage indexing
3 participants