increase accuracy and test coverage for PlaintextRenderer #250

jenshalm · 2024-10-25T12:18:06Z

I had this on my radar for quite a while: the previous implementation of PlaintextRenderer was somewhat rudimentary, had a very small test suite and a fallback for unknown nodes that would create unwanted index entries.

This PR aims to:

Significantly increase the test coverage to safeguard future maintenance - this is particularly relevant as many AST nodes in Laika implement more than one of the traits we match on, so a simple, seemingly innocent reordering of patterns matched on could break things.
Avoid the default fallback that includes productPrefix in the index, as this would produce entries such as Rule or PageBreak which are unwanted. Instead the fallback now discards unknown nodes as the last step.
Include more explicit handling of special node types that do not implement the main traits like BlockContainer or TextContainer since the fallback would now discard them. The relevant additions here are:
- Tables
- BlockContainer types that hold child nodes in more than just its content property (e.g. Figure)
- Image - add alt and title attributes to the index like most search engines do
- Extract text nodes from verbatim HTML
- Extract AST nodes originating from markup documents from the template AST (for the unlikely case someone specifies a template for the indexer)
- Handle some special node types like TargetFormat, Selection, Fallback - see comments in the new implementation for the low level details of what they do
Also add more explicit exclusions to avoid unwanted entries, these are:
- Hidden - marker trait for nodes which do not represent visual content
- Unresolved - marker trait for temporary nodes that should not occur in the final result
- Invalid - marker trait for invalid nodes, their presence will usually cause the transformation to fail (depending on user config)
- Comment - search engines usually do not index those (AFAIK)
- Raw content - this could be markup for example, where we cannot easily extract textual information
- RuntimeMessage - this is just embedded debug info
- NavigationList, NavigationItem, SectionInfo - these represent either unwanted entries (e.g. the headline of a different page linked to) or duplicate entries (e.g. a link to a section title on the current page)
- All template nodes that do not represent AST from the merged text markup document

This PR can be reviewed per commit, if you prefer to look at more bite-sized changes.

The expansion of the test suite also required a set of new helper methods:

Some tests now use reStructuredText instead of Markdown - some nodes we now test against are not produced by Markdown nor by Laika's directives so we need to use a text markup language that is more feature rich here. (The downside is that most people are not familiar with this syntax, the plus is that it serves as a good reminder that nothing about the indexer is specific to Markdown)
Some tests now run in IO - a small number of tests now check template nodes and we need the effect-full transformer for applying templates.

valencik

It made my day when I say this PR, thank you so much for all your hard work here. ❤️
And sorry for the delayed review, it was a chaotic week.

I've gone over this a few times and it's very clear and easy to follow.
I think we're in much better shape now indexing content from Laika.
The explicit testing with both markdown and ReStructured text is awesome, because of course you are right, the Indexer shouldn't care at all about the input format.

Thank you again, so much. This is very much appreciated.

valencik · 2024-11-02T14:45:25Z

laikaIO/src/main/scala/pink/cozydev/protosearch/analysis/PlaintextRenderer.scala

+      val cells = (table.head.content ++ table.body.content).flatMap(_.content)
+      renderBlocks(cells.flatMap(_.content)) + renderBlock(table.caption.content)


That's a lot of .content 😆
It did prompt me to double check, but it all makes sense, we go from table bits to rows, to cells, to blocks.
Nice.

Yes, just looking at those lines you might think the naming is unfortunate. But as much as I'd love to be more explicit, the content is always forced by implementing one of the container traits.

valencik · 2024-11-02T14:56:16Z

laikaIO/src/main/scala/pink/cozydev/protosearch/analysis/PlaintextRenderer.scala

+    def renderElement(e: Element): String = e match {
+
+      /* search engines tend to index alt and title attributes of images */
+      case img: Image => (img.alt.toList ++ img.title.toList).mkString(" ")


(not a change request, just me thinking outloud)

This line of code made me realize a cheap and cheerful way to identify "fragments" within the "plain text" in order to divide the text up into logical pieces during highlighting (#205). Which is to say, delimiting "fragments" by newlines. Effectively making every line in the plain text format a "fragment" and then, during highlighting we figure out the fragment that best matches the query.
So in some future effort I might change this line to use .mkString("\n"), making the caption and the title to different fragments to potentially highlight. Or maybe not, maybe it's better to have them together in one fragment. It's just nice that there is a pretty easy way to control this in this renderer, if we go that route.

Haven't thought about that aspect yet, but yes, makes a lot of sense. Another aspect I'm wondering about is scoring - in theory the AST model is ideal for that, you could have headlines score more than matches in body text, but I'm not sure how you'd ever carry that information in a plain text renderer without using an interim format that needs to be parsed again?

Yes, I think you're right about needing to reparse.
Because rendering in Laika requires us to output String, we're a bit limited in how we can preserve structure without requiring an additional parsing step.
But that's not the end of the world or anything. We have to tokenize the text anyways, so ideally we can handle the secondary parsing there.

jenshalm added 12 commits October 23, 2024 05:26

simplify PlaintextRenderer - remove Content indirection

0133ccf

add tests for block quotes and definition lists + rst transformer

e85729f

PlaintextRenderer - handle ListContainer

7729f8e

PlaintextRenderer - handle BlockContainer

d547ee1

PlaintextRenderer - handle SpanContainer + verbatim HTML

9dd9979

PlaintextRenderer - handle ElementContainer

975e0b7

PlaintextRenderer - handle TextContainer

52299fc

PlaintextRenderer - exclude some special marker traits

15f7992

PlaintextRenderer - handle tables

82624a6

PlaintextRenderer - handle non-container nodes

f0a063a

PlaintextRenderer - handle template spans

a152caa

ignore "possible missing interpolator" in PlaintextRendererSuite

fb34009

valencik approved these changes Nov 2, 2024

View reviewed changes

valencik merged commit ba9d057 into cozydev-pink:main Nov 2, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase accuracy and test coverage for PlaintextRenderer #250

increase accuracy and test coverage for PlaintextRenderer #250

jenshalm commented Oct 25, 2024

valencik left a comment

valencik Nov 2, 2024

jenshalm Nov 2, 2024

valencik Nov 2, 2024

jenshalm Nov 2, 2024

valencik Nov 2, 2024

		val cells = (table.head.content ++ table.body.content).flatMap(_.content)
		renderBlocks(cells.flatMap(_.content)) + renderBlock(table.caption.content)

increase accuracy and test coverage for PlaintextRenderer #250

increase accuracy and test coverage for PlaintextRenderer #250

Conversation

jenshalm commented Oct 25, 2024

valencik left a comment

Choose a reason for hiding this comment

valencik Nov 2, 2024

Choose a reason for hiding this comment

jenshalm Nov 2, 2024

Choose a reason for hiding this comment

valencik Nov 2, 2024

Choose a reason for hiding this comment

jenshalm Nov 2, 2024

Choose a reason for hiding this comment

valencik Nov 2, 2024

Choose a reason for hiding this comment