-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Covid Passage Indexing #1049
Covid Passage Indexing #1049
Conversation
hey @nikhilro I'm thinking (but up for discussion) that the paragraph chopping should be performed in the That is each |
Codecov Report
@@ Coverage Diff @@
## master #1049 +/- ##
============================================
- Coverage 43.54% 43.03% -0.51%
- Complexity 626 636 +10
============================================
Files 130 132 +2
Lines 7943 8081 +138
Branches 1150 1168 +18
============================================
+ Hits 3459 3478 +19
- Misses 4155 4274 +119
Partials 329 329
Continue to review full report at Codecov.
|
My mental model for |
In this design, each In this case, it'd be |
confused, the current implementation? |
But what about the case if we want to change the indexing scheme? Let's say later on we find out that paragraph by paragraph isn't that good, and we want to change to, say sliding window of 2 paragraphs? In your design, you'd need to modify the collection. And if you want to keep both, you'd need two collections. |
Yes, that's the implicit contract of the current design, I think. |
b81e181
to
7ff68c0
Compare
…st one contains it now
7ff68c0
to
adc1aa3
Compare
Yes, that seems reasonable to me. Maybe
|
I think I like the contract of Generator as Convert 1 So in my design, you'd have 1 Collection, but multiple generators. E.g.,
|
To implement
So, I think it's cleaner to have 1:1 for
And yea, there'd be multiple Collections like: (I do agree with you
|
In that case, we would just need to modify the IndexCollection to support a list of returned documents instead of a single one (and probably modify all existing generators to return a list of size == 1) |
Okay, I think you've sold me. We'll go with your design. But let's keep both |
Sounds good! |
Actually, @nikhilro can we have three versions? (1) title/abstract only, (2) full text, and (3) title/abstract/each paragraph. Down the road, we probably want to evaluate them... |
and to clarify those three versions
|
First bullet is input, and remaining our outputs Lucene docs, right? If so, yes. |
yea, I was mentioning the format in each for the upcoming bullets (id, content, raw) 😅. Sounds good, on it. |
Gonna get @edwinzhng 's help to get the inheritance working in Java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just merge? I'll take over the inheritance refactoring
works but same with |
Resolves #1044 and #1045
Per the issue, a record is decomposed into (id, content, raw):
Edits to documentation to resolve
xargs
issues and updates to expected number of documents.In regards to #1048, the CovidCollection's SourceDocument adds
raw
which is ultimately what goes toraw
for the document. This is the simplest implementation I think