-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase accuracy and test coverage for PlaintextRenderer #250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It made my day when I say this PR, thank you so much for all your hard work here. ❤️
And sorry for the delayed review, it was a chaotic week.
I've gone over this a few times and it's very clear and easy to follow.
I think we're in much better shape now indexing content from Laika.
The explicit testing with both markdown and ReStructured text is awesome, because of course you are right, the Indexer shouldn't care at all about the input format.
Thank you again, so much. This is very much appreciated.
val cells = (table.head.content ++ table.body.content).flatMap(_.content) | ||
renderBlocks(cells.flatMap(_.content)) + renderBlock(table.caption.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of .content
😆
It did prompt me to double check, but it all makes sense, we go from table bits to rows, to cells, to blocks.
Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just looking at those lines you might think the naming is unfortunate. But as much as I'd love to be more explicit, the content
is always forced by implementing one of the container traits.
def renderElement(e: Element): String = e match { | ||
|
||
/* search engines tend to index alt and title attributes of images */ | ||
case img: Image => (img.alt.toList ++ img.title.toList).mkString(" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a change request, just me thinking outloud)
This line of code made me realize a cheap and cheerful way to identify "fragments" within the "plain text" in order to divide the text up into logical pieces during highlighting (#205). Which is to say, delimiting "fragments" by newlines. Effectively making every line in the plain text format a "fragment" and then, during highlighting we figure out the fragment that best matches the query.
So in some future effort I might change this line to use .mkString("\n")
, making the caption and the title to different fragments to potentially highlight. Or maybe not, maybe it's better to have them together in one fragment. It's just nice that there is a pretty easy way to control this in this renderer, if we go that route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't thought about that aspect yet, but yes, makes a lot of sense. Another aspect I'm wondering about is scoring - in theory the AST model is ideal for that, you could have headlines score more than matches in body text, but I'm not sure how you'd ever carry that information in a plain text renderer without using an interim format that needs to be parsed again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right about needing to reparse.
Because rendering in Laika requires us to output String
, we're a bit limited in how we can preserve structure without requiring an additional parsing step.
But that's not the end of the world or anything. We have to tokenize the text anyways, so ideally we can handle the secondary parsing there.
I had this on my radar for quite a while: the previous implementation of
PlaintextRenderer
was somewhat rudimentary, had a very small test suite and a fallback for unknown nodes that would create unwanted index entries.This PR aims to:
productPrefix
in the index, as this would produce entries such asRule
orPageBreak
which are unwanted. Instead the fallback now discards unknown nodes as the last step.BlockContainer
orTextContainer
since the fallback would now discard them. The relevant additions here are:BlockContainer
types that hold child nodes in more than just itscontent
property (e.g.Figure
)Image
- addalt
andtitle
attributes to the index like most search engines doTargetFormat
,Selection
,Fallback
- see comments in the new implementation for the low level details of what they doHidden
- marker trait for nodes which do not represent visual contentUnresolved
- marker trait for temporary nodes that should not occur in the final resultInvalid
- marker trait for invalid nodes, their presence will usually cause the transformation to fail (depending on user config)Comment
- search engines usually do not index those (AFAIK)RuntimeMessage
- this is just embedded debug infoNavigationList
,NavigationItem
,SectionInfo
- these represent either unwanted entries (e.g. the headline of a different page linked to) or duplicate entries (e.g. a link to a section title on the current page)This PR can be reviewed per commit, if you prefer to look at more bite-sized changes.
The expansion of the test suite also required a set of new helper methods:
IO
- a small number of tests now check template nodes and we need the effect-full transformer for applying templates.