Discussion point: SourceDocument design #1048

lintool · 2020-03-22T15:10:00Z

Follow up to #1047

We need to rethink the indexing pipeline re: "contents" and "raw", because there's a lot of confusion, mostly because the design has evolved over time...

In the Lucene document, we've agreed (and this is clear) that "contents" field should be the content as indexed for searching, and "raw" should be the raw source document.

What about SourceDocument then? Currently, it only has content() method, and it's underspecified as to what it should provide. The Generator is responsible for calling content() and pulling out the necessary parts to populate the Lucene "contents" field and "raw" field.

Does this design make sense? The alternative is that SourceDocument has contents() and raw() methods, i.e., it "knows" its own contents and raw form.

One argument against this is when we have one-to-many SourceDocument to Lucene Document mappings - like in COVID19 (see #1044), where we take a single SourceDocument, which may be full text, and index it at the paragraph level (i.e., multiple Lucene Documents)

@nikhilro @edwinzhng thoughts?

If we stick to the original design, then it's a matter of clarifying in documentation. If we adopted the alternative design, it's going to be a painful refactoring job...

The text was updated successfully, but these errors were encountered:

nikhilro · 2020-03-22T15:11:24Z

I'm about to create a PR which could serve as an example

nikhilro · 2020-03-22T15:18:10Z

One argument against this is when we have one-to-many SourceDocument to Lucene Document mappings - like in COVID19 (see #1044), where we take a single SourceDocument, which may be full text, and index it at the paragraph level (i.e., multiple Lucene Documents)

This doesn't hold because the way I implemented there is one-to-one mapping

alternative is that SourceDocument has contents() and raw() methods, i.e., it "knows" its own contents and raw form.

I like this. Also, there is an ongoing confusion because we use contents some places and content other. In source document, it's content. So, ultimately I'd vote in favor of each SourceDocument having content and raw.

edwinzhng · 2020-03-22T15:22:51Z

I like the separate fields too for contents and raw, allowing the SourceDocument to build it's own separation makes things more organized imo

nikhilro · 2020-03-22T15:49:34Z

For posterity, meat of the resolution in this comment in PR #1049: #1049 (comment)

lintool · 2020-03-22T15:51:06Z

Okay, this will be a fairly painful refactoring to get everything in this format, so we'll punt it for later.

lintool · 2020-03-22T15:55:05Z

Actually, if we take this design to the logical conclusion, should the generator also be folded into the collection?

That is, each SourceDocument knows how to generate its own corresponding Lucene document? Or, do we like the separation between Lucene and non-Lucene code?

nikhilro · 2020-03-22T16:03:36Z

I don't even see the need for any refactoring because there is an implicit assumption that content == raw unless SourceDocument for that Collection specifies raw, no?

That is, each SourceDocument knows how to generate its own corresponding Lucene document? Or, do we like the separation between Lucene and non-Lucene code?

I like the separation between Lucene and non-Lucene code. It's nice and clean

lintool · 2020-03-22T16:06:41Z

I don't even see the need for any refactoring because there is an implicit assumption that content == raw unless SourceDocument for that Collection specifies raw, no?

I see, I suppose we can change SourceDocument to abstract, and make one dispatch to the other, overridable by subclasses if need be?

nikhilro · 2020-03-22T16:10:30Z

I see, I suppose we can change SourceDocument to abstract, and make one dispatch to the other, overridable by subclasses if need be?

I don't know Java too well but that sounds reasonable. cc @edwinzhng

cf: #1048 - contents vs. raw fields of source documents

Refactoring of WashingtonPost collection based on the clarified definition of contents() and raw() in SourceDocument, per #1048.

* move hsearch into search.hybrid * rename hsearch -> search.hybrid * fake api to support original hsearch * hsearch -> search.hybird in integration * _hybrid.py -> _searcher.py

nikhilro mentioned this issue Mar 22, 2020

Covid Passage Indexing #1049

Merged

edwinzhng mentioned this issue Mar 22, 2020

SourceDocument refactor for separate raw field #1054

Merged

edwinzhng self-assigned this Mar 23, 2020

nikhilro closed this as completed Mar 23, 2020

lintool mentioned this issue Mar 24, 2020

SourceDocument refactoring #1059

Merged

lintool added a commit that referenced this issue Apr 7, 2020

SourceDocument refactoring (#1059)

9a28a09

cf: #1048 - contents vs. raw fields of source documents

lintool mentioned this issue Apr 11, 2020

Refactoring of WashingtonPost collection for Core18 #1092

Merged

lintool added a commit that referenced this issue Apr 12, 2020

Refactoring of WashingtonPost collection for Core18 (#1092)

35f9f82

Refactoring of WashingtonPost collection based on the clarified definition of contents() and raw() in SourceDocument, per #1048.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion point: SourceDocument design #1048

Discussion point: SourceDocument design #1048

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

edwinzhng commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

lintool commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 •

edited

Loading

Discussion point: SourceDocument design #1048

Discussion point: SourceDocument design #1048

Comments

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

edwinzhng commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

lintool commented Mar 22, 2020

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

lintool commented Mar 22, 2020

nikhilro commented Mar 22, 2020 • edited Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading

nikhilro commented Mar 22, 2020 •

edited

Loading