-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion point: SourceDocument design #1048
Comments
I'm about to create a PR which could serve as an example |
This doesn't hold because the way I implemented there is one-to-one mapping
I like this. Also, there is an ongoing confusion because we use |
I like the separate fields too for contents and raw, allowing the SourceDocument to build it's own separation makes things more organized imo |
For posterity, meat of the resolution in this comment in PR #1049: #1049 (comment) |
Okay, this will be a fairly painful refactoring to get everything in this format, so we'll punt it for later. |
Actually, if we take this design to the logical conclusion, should the generator also be folded into the collection? That is, each |
I don't even see the need for any refactoring because there is an implicit assumption that
I like the separation between Lucene and non-Lucene code. It's nice and clean |
I see, I suppose we can change |
I don't know Java too well but that sounds reasonable. cc @edwinzhng |
cf: #1048 - contents vs. raw fields of source documents
Refactoring of WashingtonPost collection based on the clarified definition of contents() and raw() in SourceDocument, per #1048.
* move hsearch into search.hybrid * rename hsearch -> search.hybrid * fake api to support original hsearch * hsearch -> search.hybird in integration * _hybrid.py -> _searcher.py
Follow up to #1047
We need to rethink the indexing pipeline re: "contents" and "raw", because there's a lot of confusion, mostly because the design has evolved over time...
In the Lucene document, we've agreed (and this is clear) that "contents" field should be the content as indexed for searching, and "raw" should be the raw source document.
What about
SourceDocument
then? Currently, it only hascontent()
method, and it's underspecified as to what it should provide. TheGenerator
is responsible for callingcontent()
and pulling out the necessary parts to populate the Lucene "contents" field and "raw" field.Does this design make sense? The alternative is that
SourceDocument
hascontents()
andraw()
methods, i.e., it "knows" its own contents and raw form.One argument against this is when we have one-to-many
SourceDocument
to LuceneDocument
mappings - like in COVID19 (see #1044), where we take a singleSourceDocument
, which may be full text, and index it at the paragraph level (i.e., multiple LuceneDocuments
)@nikhilro @edwinzhng thoughts?
If we stick to the original design, then it's a matter of clarifying in documentation. If we adopted the alternative design, it's going to be a painful refactoring job...
The text was updated successfully, but these errors were encountered: