Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase accuracy and test coverage for PlaintextRenderer #250

Merged
merged 12 commits into from
Nov 2, 2024

Conversation

jenshalm
Copy link
Contributor

I had this on my radar for quite a while: the previous implementation of PlaintextRenderer was somewhat rudimentary, had a very small test suite and a fallback for unknown nodes that would create unwanted index entries.

This PR aims to:

  • Significantly increase the test coverage to safeguard future maintenance - this is particularly relevant as many AST nodes in Laika implement more than one of the traits we match on, so a simple, seemingly innocent reordering of patterns matched on could break things.
  • Avoid the default fallback that includes productPrefix in the index, as this would produce entries such as Rule or PageBreak which are unwanted. Instead the fallback now discards unknown nodes as the last step.
  • Include more explicit handling of special node types that do not implement the main traits like BlockContainer or TextContainer since the fallback would now discard them. The relevant additions here are:
    • Tables
    • BlockContainer types that hold child nodes in more than just its content property (e.g. Figure)
    • Image - add alt and title attributes to the index like most search engines do
    • Extract text nodes from verbatim HTML
    • Extract AST nodes originating from markup documents from the template AST (for the unlikely case someone specifies a template for the indexer)
    • Handle some special node types like TargetFormat, Selection, Fallback - see comments in the new implementation for the low level details of what they do
  • Also add more explicit exclusions to avoid unwanted entries, these are:
    • Hidden - marker trait for nodes which do not represent visual content
    • Unresolved - marker trait for temporary nodes that should not occur in the final result
    • Invalid - marker trait for invalid nodes, their presence will usually cause the transformation to fail (depending on user config)
    • Comment - search engines usually do not index those (AFAIK)
    • Raw content - this could be markup for example, where we cannot easily extract textual information
    • RuntimeMessage - this is just embedded debug info
    • NavigationList, NavigationItem, SectionInfo - these represent either unwanted entries (e.g. the headline of a different page linked to) or duplicate entries (e.g. a link to a section title on the current page)
    • All template nodes that do not represent AST from the merged text markup document

This PR can be reviewed per commit, if you prefer to look at more bite-sized changes.

The expansion of the test suite also required a set of new helper methods:

  • Some tests now use reStructuredText instead of Markdown - some nodes we now test against are not produced by Markdown nor by Laika's directives so we need to use a text markup language that is more feature rich here. (The downside is that most people are not familiar with this syntax, the plus is that it serves as a good reminder that nothing about the indexer is specific to Markdown)
  • Some tests now run in IO - a small number of tests now check template nodes and we need the effect-full transformer for applying templates.

Copy link
Contributor

@valencik valencik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made my day when I say this PR, thank you so much for all your hard work here. ❤️
And sorry for the delayed review, it was a chaotic week.

I've gone over this a few times and it's very clear and easy to follow.
I think we're in much better shape now indexing content from Laika.
The explicit testing with both markdown and ReStructured text is awesome, because of course you are right, the Indexer shouldn't care at all about the input format.

Thank you again, so much. This is very much appreciated.

Comment on lines +95 to +96
val cells = (table.head.content ++ table.body.content).flatMap(_.content)
renderBlocks(cells.flatMap(_.content)) + renderBlock(table.caption.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of .content 😆
It did prompt me to double check, but it all makes sense, we go from table bits to rows, to cells, to blocks.
Nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just looking at those lines you might think the naming is unfortunate. But as much as I'd love to be more explicit, the content is always forced by implementing one of the container traits.

def renderElement(e: Element): String = e match {

/* search engines tend to index alt and title attributes of images */
case img: Image => (img.alt.toList ++ img.title.toList).mkString(" ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a change request, just me thinking outloud)

This line of code made me realize a cheap and cheerful way to identify "fragments" within the "plain text" in order to divide the text up into logical pieces during highlighting (#205). Which is to say, delimiting "fragments" by newlines. Effectively making every line in the plain text format a "fragment" and then, during highlighting we figure out the fragment that best matches the query.
So in some future effort I might change this line to use .mkString("\n"), making the caption and the title to different fragments to potentially highlight. Or maybe not, maybe it's better to have them together in one fragment. It's just nice that there is a pretty easy way to control this in this renderer, if we go that route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought about that aspect yet, but yes, makes a lot of sense. Another aspect I'm wondering about is scoring - in theory the AST model is ideal for that, you could have headlines score more than matches in body text, but I'm not sure how you'd ever carry that information in a plain text renderer without using an interim format that needs to be parsed again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right about needing to reparse.
Because rendering in Laika requires us to output String, we're a bit limited in how we can preserve structure without requiring an additional parsing step.
But that's not the end of the world or anything. We have to tokenize the text anyways, so ideally we can handle the secondary parsing there.

@valencik valencik merged commit ba9d057 into cozydev-pink:main Nov 2, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants