Covid Passage Indexing #1049

nikhilro · 2020-03-22T15:15:46Z

Per the issue, a record is decomposed into (id, content, raw):

ID, TITLE + ABSTRACT, FULL_TEXT_JSON
ID.0001, TITLE + ABSTRACT + P1, ""
ID.0002, TITLE + ABSTRACT + P2, ""

Edits to documentation to resolve xargs issues and updates to expected number of documents.

In regards to #1048, the CovidCollection's SourceDocument adds raw which is ultimately what goes to raw for the document. This is the simplest implementation I think

…text

lintool · 2020-03-22T15:19:40Z

hey @nikhilro I'm thinking (but up for discussion) that the paragraph chopping should be performed in the Generator?

That is each SourceDocument still presents an independent article, but the Generator does the chopping into multiple paragraphs and a Lucene document for each.

codecov · 2020-03-22T15:20:38Z

Codecov Report

Merging #1049 into master will decrease coverage by 0.5%.
The diff coverage is 0%.

@@             Coverage Diff              @@
##             master    #1049      +/-   ##
============================================
- Coverage     43.54%   43.03%   -0.51%     
- Complexity      626      636      +10     
============================================
  Files           130      132       +2     
  Lines          7943     8081     +138     
  Branches       1150     1168      +18     
============================================
+ Hits           3459     3478      +19     
- Misses         4155     4274     +119     
  Partials        329      329

Impacted Files	Coverage Δ	Complexity Δ
...o/anserini/collection/CovidFullTextCollection.java	`0% <0%> (ø)`	`0 <0> (?)`
...n/java/io/anserini/collection/CovidCollection.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
.../anserini/collection/CovidParagraphCollection.java	`0% <0%> (ø)`	`0 <0> (?)`
...va/io/anserini/index/generator/CovidGenerator.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...anserini/ltr/feature/base/PMIFeatureExtractor.java	`86.53% <0%> (+1.92%)`	`13% <0%> (+1%)`	⬆️
...a/io/anserini/ann/fw/FakeWordsEncoderAnalyzer.java	`100% <0%> (+18.18%)`	`2% <0%> (ø)`	⬇️
...java/io/anserini/ltr/feature/CountBigramPairs.java	`89.61% <0%> (+23.37%)`	`33% <0%> (+9%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99e9709...590e7f6. Read the comment docs.

nikhilro · 2020-03-22T15:23:26Z

My mental model for Collection vs Generator is that the collection deals with actual files on the system whereas the generator never has to worry about that. To generator, there is a guarantee that the document data exists and it's job is to convert it to LuceneDocument. So, I think it makes more sense to keep it in CovidCollection

lintool · 2020-03-22T15:23:37Z

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

In this case, it'd be CSVRecord(\nJSON_FULL_TEXT)?. The Generator knows what to do with this in generating Lucene Documents.

nikhilro · 2020-03-22T15:25:05Z

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

confused, the current implementation?

lintool · 2020-03-22T15:25:09Z

My mental model for Collection vs Generator is that the collection deals with actual files on the system whereas the generator never has to worry about that. To generator, there is a guarantee that the document data exists and it's job is to convert it to LuceneDocument. So, I think it makes more sense to keep it in CovidCollection

But what about the case if we want to change the indexing scheme? Let's say later on we find out that paragraph by paragraph isn't that good, and we want to change to, say sliding window of 2 paragraphs?

In your design, you'd need to modify the collection.

And if you want to keep both, you'd need two collections.

lintool · 2020-03-22T15:25:42Z

In this design, each Collection has arbitrary flexibility in presenting what it thinks its "content" should be.

confused, the current implementation?

Yes, that's the implicit contract of the current design, I think.

…st one contains it now

nikhilro · 2020-03-22T15:29:53Z

In your design, you'd need to modify the collection.
And if you want to keep both, you'd need two collections.

Yes, that seems reasonable to me. Maybe Collection isn't the best word for it but Generator, to me, seems has a simple job. Convert 1 SourceDocument to 1 LuceneDocument

Collection is the one doing the logic handling with actual files into what it wants in SourceDocument

lintool · 2020-03-22T15:32:17Z

In your design, you'd need to modify the collection.
And if you want to keep both, you'd need two collections.

Yes, that seems reasonable to me. Maybe Collection isn't the best word for it but Generator to seems has a simple job. Convert 1 SourceDocument to 1 LuceneDocument

I think I like the contract of Generator as Convert 1 SourceDocument to N LuceneDocument.

So in my design, you'd have 1 Collection, but multiple generators. E.g.,

CovidFullDocumentGenerator
CovidParagraphGenerator
CovidSlidingParagraphWindowGenerator

nikhilro · 2020-03-22T15:42:37Z

To implement 1 SourceDocument to N LuceneDocument, there'd be two complications:

We need to change pieces of our core design. Right now, all a generator does is createDocument(SourceDocument) which is expected to return 1 LuceneDocument. Each generator will have to implement some complex iterator logic to support the aforementioned 1:N SourceDocument:LuceneDocument.
Even if we could change the above, there is an additional complexity on creating those N LuceneDocuments in the Generator. This will involve JsonNode, CSVParser, BufferedReader etc. having to move to Generator in addition to staying in Collection.

So, I think it's cleaner to have 1:1 for SourceDocument:LuceneDocument because

The file handling logic all stays in Collection
Generators are simple and can deal with issues related to Lucene indexing
Collection's SourceDocument will clearly say what content and raw fields are expected

And yea, there'd be multiple Collections like: (I do agree with you Collection not the best name)

CovidFullDocumentCollection
CovidParagraphCollection
CovidSlidingParagraphWindowCollection

edwinzhng · 2020-03-22T15:44:54Z

To implement 1 SourceDocument to N LuceneDocument, there'd be two complications:

We need to change pieces of our core design. Right now, all a generator does is createDocument(SourceDocument) which is expected to return 1 LuceneDocument. Each generator will have to implement some complex iterator logic to support the aforementioned 1:N SourceDocument:LuceneDocument.

Even if we could change the above, there is an additional complexity on creating those N LuceneDocuments in the Generator. This will involve JsonNode, CSVParser, BufferedReader etc. having to move to Generator in addition to staying in Collection.

In that case, we would just need to modify the IndexCollection to support a list of returned documents instead of a single one (and probably modify all existing generators to return a list of size == 1)

https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L195

lintool · 2020-03-22T15:47:51Z

Okay, I think you've sold me. We'll go with your design.

But let's keep both CovidCollection (the old one) and also CovidParagraphCollection (new one)? Maybe refactor to inherit from same super class?

nikhilro · 2020-03-22T15:48:16Z

Sounds good!

lintool · 2020-03-22T16:03:24Z

Actually, @nikhilro can we have three versions? (1) title/abstract only, (2) full text, and (3) title/abstract/each paragraph.

Down the road, we probably want to evaluate them...

nikhilro · 2020-03-22T16:18:51Z

and to clarify those three versions

(1) title/abstract only: CovidCollection
- (id, content, raw)
- (ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)
(2) full text: CovidFullTextCollection
- (id, content, raw)
- (ROW_NUMBER, TITLE + ABSTRACT + FULL_JSON_TEXT, FULL_JSON_TEXT)
and (3) title/abstract/each paragraph: CovidParagraphCollection
- (id, content, raw)
- (ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)
- (ROW_NUMBER.001, TITLE + ABSTRACT + P1, "")
- (ROW_NUMBER.002, TITLE + ABSTRACT + P2, "")
- ....

lintool · 2020-03-22T16:20:51Z

and to clarify those three versions

(1) title/abstract only: CovidCollection

(id, content, raw)

(ROW_NUMBER, TITLE + ABSTRACT, FULL_JSON_TEXT)

First bullet is input, and remaining our outputs Lucene docs, right? If so, yes.

nikhilro · 2020-03-22T16:23:01Z

yea, I was mentioning the format in each for the upcoming bullets (id, content, raw) 😅. Sounds good, on it.

… techniques

nikhilro · 2020-03-22T17:07:20Z

Gonna get @edwinzhng 's help to get the inheritance working in Java

edwinzhng

Let's just merge? I'll take over the inheritance refactoring

nikhilro · 2020-03-22T17:27:41Z

sh target/appassembler/bin/IndexCollection \
    -collection CovidCollection -generator CovidGenerator \
    -threads 8 -input "${DATA_DIR}" \
    -index "${DATA_DIR}"/lucene-index-covid \
    -storePositions -storeDocvectors -storeRawDocs -storeTransformedDocs

works but same with CovidFullTextCollection does not because of inheritance issues that @edwinzhng will fix. Thanks!

Passage Indexing Works. Content contain paragraph. Raw contains full …

75c8850

…text

nikhilro requested review from lintool and edwinzhng March 22, 2020 15:15

nikhilro closed this Mar 22, 2020

nikhilro reopened this Mar 22, 2020

nikhilro force-pushed the covid-passage-indexing branch from b81e181 to 7ff68c0 Compare March 22, 2020 15:28

Readme edit to link covid doc. Change to raw for passage docs. Only 1…

adc1aa3

…st one contains it now

nikhilro force-pushed the covid-passage-indexing branch from 7ff68c0 to adc1aa3 Compare March 22, 2020 15:28

edwinzhng closed this Mar 22, 2020

edwinzhng reopened this Mar 22, 2020

nikhilro mentioned this pull request Mar 22, 2020

Discussion point: SourceDocument design #1048

Closed

CR. Add 3 versions of Covid Collections to support different indexing…

5aafcd6

… techniques

edwinzhng approved these changes Mar 22, 2020

View reviewed changes

Merge branch 'master' into covid-passage-indexing

590e7f6

edwinzhng merged commit a90a0ce into master Mar 22, 2020

edwinzhng deleted the covid-passage-indexing branch March 22, 2020 17:28

This was referenced Mar 22, 2020

COVID-19 documentation tweaks #1045

Closed

Update experiments-covid.md #1052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Covid Passage Indexing #1049

Covid Passage Indexing #1049

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

codecov bot commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

edwinzhng commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

nikhilro commented Mar 22, 2020

edwinzhng left a comment

nikhilro commented Mar 22, 2020 •

edited

Loading

Covid Passage Indexing #1049

Covid Passage Indexing #1049

Conversation

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

codecov bot commented Mar 22, 2020 • edited Loading

Codecov Report

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

edwinzhng commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

nikhilro commented Mar 22, 2020

edwinzhng left a comment

Choose a reason for hiding this comment

nikhilro commented Mar 22, 2020 • edited Loading

codecov bot commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading