Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion point: SourceDocument design #1048

Closed
lintool opened this issue Mar 22, 2020 · 9 comments
Closed

Discussion point: SourceDocument design #1048

lintool opened this issue Mar 22, 2020 · 9 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Mar 22, 2020

Follow up to #1047

We need to rethink the indexing pipeline re: "contents" and "raw", because there's a lot of confusion, mostly because the design has evolved over time...

In the Lucene document, we've agreed (and this is clear) that "contents" field should be the content as indexed for searching, and "raw" should be the raw source document.

What about SourceDocument then? Currently, it only has content() method, and it's underspecified as to what it should provide. The Generator is responsible for calling content() and pulling out the necessary parts to populate the Lucene "contents" field and "raw" field.

Does this design make sense? The alternative is that SourceDocument has contents() and raw() methods, i.e., it "knows" its own contents and raw form.

One argument against this is when we have one-to-many SourceDocument to Lucene Document mappings - like in COVID19 (see #1044), where we take a single SourceDocument, which may be full text, and index it at the paragraph level (i.e., multiple Lucene Documents)

@nikhilro @edwinzhng thoughts?

If we stick to the original design, then it's a matter of clarifying in documentation. If we adopted the alternative design, it's going to be a painful refactoring job...

@nikhilro
Copy link
Member

I'm about to create a PR which could serve as an example

@nikhilro
Copy link
Member

nikhilro commented Mar 22, 2020

One argument against this is when we have one-to-many SourceDocument to Lucene Document mappings - like in COVID19 (see #1044), where we take a single SourceDocument, which may be full text, and index it at the paragraph level (i.e., multiple Lucene Documents)

This doesn't hold because the way I implemented there is one-to-one mapping

alternative is that SourceDocument has contents() and raw() methods, i.e., it "knows" its own contents and raw form.

I like this. Also, there is an ongoing confusion because we use contents some places and content other. In source document, it's content. So, ultimately I'd vote in favor of each SourceDocument having content and raw.

@edwinzhng
Copy link
Member

I like the separate fields too for contents and raw, allowing the SourceDocument to build it's own separation makes things more organized imo

@nikhilro
Copy link
Member

nikhilro commented Mar 22, 2020

For posterity, meat of the resolution in this comment in PR #1049: #1049 (comment)

@lintool
Copy link
Member Author

lintool commented Mar 22, 2020

Okay, this will be a fairly painful refactoring to get everything in this format, so we'll punt it for later.

@lintool
Copy link
Member Author

lintool commented Mar 22, 2020

Actually, if we take this design to the logical conclusion, should the generator also be folded into the collection?

That is, each SourceDocument knows how to generate its own corresponding Lucene document? Or, do we like the separation between Lucene and non-Lucene code?

@nikhilro
Copy link
Member

nikhilro commented Mar 22, 2020

I don't even see the need for any refactoring because there is an implicit assumption that content == raw unless SourceDocument for that Collection specifies raw, no?

That is, each SourceDocument knows how to generate its own corresponding Lucene document? Or, do we like the separation between Lucene and non-Lucene code?

I like the separation between Lucene and non-Lucene code. It's nice and clean

@lintool
Copy link
Member Author

lintool commented Mar 22, 2020

I don't even see the need for any refactoring because there is an implicit assumption that content == raw unless SourceDocument for that Collection specifies raw, no?

I see, I suppose we can change SourceDocument to abstract, and make one dispatch to the other, overridable by subclasses if need be?

@nikhilro
Copy link
Member

nikhilro commented Mar 22, 2020

I see, I suppose we can change SourceDocument to abstract, and make one dispatch to the other, overridable by subclasses if need be?

I don't know Java too well but that sounds reasonable. cc @edwinzhng

@edwinzhng edwinzhng self-assigned this Mar 23, 2020
lintool added a commit that referenced this issue Apr 7, 2020
cf: #1048 - contents vs. raw fields of source documents
lintool added a commit that referenced this issue Apr 12, 2020
Refactoring of WashingtonPost collection based on the clarified definition of
contents() and raw() in SourceDocument, per #1048.
crystina-z added a commit to crystina-z/anserini that referenced this issue Oct 28, 2022
* move hsearch into search.hybrid

* rename hsearch -> search.hybrid

* fake api to support original hsearch

* hsearch -> search.hybird in integration

* _hybrid.py -> _searcher.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants