Laika integration #102

armanbilge · 2023-08-19T19:34:05Z

An umbrella issue to discuss the Laika integration. Some work has already been done in this area.

In typelevel/Laika#495 I sketched an integration as follows:

Indexing is implemented as a Laika renderer that produces an index artifact. This can be deployed as part of the site (similar to how e.g. epub / pdfs are included in the site).

protosearch publishes a JS library to NPM and is available via CDNs. This JS library can load an index file and run queries on it.

A search page is added to the site which uses a small bit of JS to glue together a search bar, the protosearch.js library, and index file.

jenshalm · 2023-08-20T00:22:42Z

I had a quick look at the existing code and have a bunch of questions. I think the answers will make it easier for me to make a few useful suggestions.

I noticed the use of parseUnresolved in IngestMarkdown. Was there a specific reason to avoid a fully rewritten AST? I'm asking as I see a number of issues with this approach (e.g. incomplete indexes).
Is the final structure of the index in fact just one large JSON array for the entire site or did you use some interim representation here?
Do you already have a concrete idea about how it will be integrated into the Helium theme? Will it be a simple search box in the left navigation pane for example? And will results be presented in a popup or on a separate page? I'm asking as the above points explicitly talk about a "search page" whereas most doctools integrate a search bar into the standard pages.
And on a more general level, I assume some of the entry point APIs (e.g. IngestMarkdown) were exploratory and not necessarily meant to be the final API? For seamless integration for end users and correctness of the index a very different entry point API would be required. Just want to check whether a different public API would be fine by you, before I go into more detail.

valencik · 2023-08-30T12:35:28Z

Hi folks, sorry for the delay.

Regarding parseUnresolved, I believe that is an old temporary hack. Removing it in Remove usage of parseUnresolved #108. We use
One thing to note, the way I'm currently building up the index of http4s docs for the demo I have to disable link validation as I'm grabbing just the raw markdown docs prior to mdoc running. So with link validation on we run into errors like:

unresolved internal reference: @API_URL@org/http4s/server/middleware/Throttle$$TokenBucket.html
As the @API_URL@ hasn't been handled yet.

The final structure of the index is actually a binary format built with scodec. The json file floating around is just some metadata about the docs themselves. The index is capable of answering a query with a list of document IDs, the json file is what tells us what those doc IDs represent (title, filename, etc).
I haven't yet given much thought about possibly merging these two files to we have one nice contained single file index.
I haven't really played around with integrating it into Helium yet. I was indeed thinking a simple search box in the left nav pane. That seems to be the most common experience.
Yes, in fact all APIs are experimental here. We've been prioritizing just getting things glued together end to end. The API has just grown out of necessity, it wasn't particularly "designed". Feedback on a different public API would be lovely.

jenshalm · 2023-08-30T15:34:27Z

No problem with the delay, I'm not building this, I'm just assisting... 🙂

Although it would probably be to your advantage if we'd have at least a rough design idea before Laika 1.0 goes into its RC cycle which is only a few weeks away. The intention there is to seal the APIs for the next 127 years 🙂 , so if we spot something where you'd need new hooks it would be good to squeeze them in before that.

Unfortunately I still cannot give some concrete suggestions, as with the info you gave I have a few more questions...

Regarding the AST you look at, I think the only way it would ever work is by using the exact same AST that is later also fed to the site's HTML renderer, for two reasons: having user APIs similar to those of other renderers and for having correct results. The link validation config for that purpose would always need to be the config provided by the end user, not some hard-coded settings/defaults within the library. In fact, the library should not parse at all! The input of your APIs would need to be the AST. If you look at the raw input it would fail if it contains just a single user-defined directive for example, that your lib is not aware of. Likewise, the term Markdown should ideally not appear anywhere in type names, as there is no reason why this would not work for reStructuredText (yes, there are Laika users out there using it!).
Regarding the index there is one ideal format that would fit precisely into all existing API hooks which would mean that users get familiar APIs and you have minimal work (in the sense of mostly relying on internal Laika machinery without dealing with it yourself). But I am not sure whether this is feasible with the design you have (or need to have). The one approach that would work would involve some pre-processing on a per-document level that produces a single string result per document, and then all those strings for all the documents would be passed to a post-processor which merges those and produces one output (which can be binary). The advantages of this would be so big that even if the string result per document would be something that would need to be split again by the post-processor it might be worth it for the simple fact that everything else would fall it into place. If that's not feasible at all your library would need to provide a hook that takes a DocumentTree and produces an index, but then you'd need to provide your own APIs for Stream/File I/O (for the output only, not the inputs) which is somewhat less convenient for users and for you as you could not hook into Laika's existing render APIs for that.

When I looked at the code I could only see the interim structure (SubDocument) and the JSON it produces. Is there some example for how you get from those interim structures to the final binary format? It might be easier for me to recommend an approach if I understand the design a bit better.

valencik · 2023-11-04T02:10:34Z

Hey @jenshalm, thanks for your thoughts on the matter :)
I'm hoping to get back to this in the next couple days and wanted to clarify some things.

It sounds like you're suggesting that we should two functions:

an AST => String function that Laika runs for every document
some type of List[String] => Index function that runs as a post-processor

Does that sound about right? If so, that sounds doable to me.
What should be the initial input type to the first function? Is it laika.ast.Document (https://javadoc.io/doc/org.typelevel/laika-docs_2.12/latest/laika/ast/Document.html)?

And is there a good example of the post-processor workflow you describe?

jenshalm · 2023-11-05T13:32:36Z

The key issue for deeper integration into existing APIs is only whether you can merge the two index files into one. Everything I write below assumes that this is possible. If you look at the code examples for renderers in the manual you could then simply add the search index as the 4th renderer. Meaning in the 2nd code block you would add

val indexRenderer  = Renderer.of(SearchIndex).withConfig(config).parallel[IO].build

and in the 4th code block you would then add

val indexOp = indexRenderer.from(tree.root).toFile("whatever-your-name-pattern-is").render

This would be the integration layer for API users. For users of sbt-typelevel it can obviously then be even simpler (e.g. simply driven by a boolean setting whether an index is created or not).

This means the only public type your glue library would need to provide is the SearchIndex type in the example above, and this would need to implement the same API as the other binary renderers (EPUB and PDF). The trait is called TwoPhaseRenderFormat and all of the work would happen in the two members interimFormat and postProcessor (the prepareTree method can most likely be a no-op in your case).

The interimFormat property is for the first step (AST to string). The type is RenderFormat which you already implemented in the initial prototype, but it only did part of the work needed per document. It's best to get as much processing as possible into this step as it's more optimized (it runs in parallel for each document).

The postProcessor property is for the second step (all strings to binary). It is a generic type, but the only concrete type current Laika knows how to deal with is BinaryPostProcessor.Builder. This would get a tree structure of the strings produced by the first step and write the binary to the output. Unfortunately it will be a java.io.OutputStream which will get passed to the implementation as the existing binary renderers have to integrate with Java APIs, but of course you can use fs2 internally. The second step is working on the results for all documents and therefore does not run in parallel.

If you want to look at existing implementations you can look at the EPUB renderer and the PDF renderer, but of course their processing logic is very different from what you need (and in fact very different from each other, too).

I think overall it's most likely quite straightforward compared to the complexity you have to deal with for the underlying search engine.

valencik · 2023-11-07T02:13:47Z

Thanks again @jenshalm

I think I have a silly Formatter and TwoPhaseRenderFormat implemented in #140. These aren't real implementations, just stubs that output some plaintext.

I'm now wondering how I can use this in an existing sbt build using laika. I was poking around the laika sbt plugin but couldn't find anything obvious to me.
Is there a setting or something I can pass a TwoPhaseRenderFormat to?

valencik · 2023-11-07T02:15:10Z

Oh, I should also confirm that I think the per-document and then merge to one index approach will work fine. :)
So I think we're definitely on the right path for an integration here.

jenshalm · 2023-11-08T15:34:22Z

Oh, I should also confirm that I think the per-document and then merge to one index approach will work fine.

That's excellent and good for both, the end user and the maintainers.

I'm now wondering how I can use this in an existing sbt build using laika. I was poking around the laika sbt plugin but couldn't find anything obvious to me.

Yes, there is no corresponding hook in the plugin setup yet. I was aware of that, but since our discussion did not reach the concrete stage before 1.0 was out and I knew a hook can be added later for 1.1 in a backwards-compatible way I didn't do anything yet.

What you can do as a temporary workaround until such a hook exists is creating a custom task that invokes the renderer manually. Just make sure you use the configured parser for creating the AST:

val tree = Settings.parser.value.use(_.fromInput(laikaInputs.value.delegate).parse).unsafeRunSync()

The downsides of this temporary workaround are: a) some unnecessary temporary boilerplate, b) the need to parse twice for the site output and the index and c) not participating in sbt caching that the Laika plugin taps into.

Once you are getting closer to having something you want to release as part of sbt-typelevel we can look into adding a more convenient and more deeply integrated hook to the Laika plugin.

It would add a customization option for the existing laikaSite and laikaGenerate tasks. Currently laikaSite simply delegates to laikaGenerate html epub pdf where the presence of the 2nd and 3rd argument is simply driven by the laikaIncludePDF and laikaIncludeEPUB settings. The current limitation is that this flexibility only works with known/builtin formats.

We could add a new setting laikaCustomRenderFormats with a list of case classes that provide all information required to integrate a 3rd party render format. It would run as part of the laikaSite task depending on a boolean flag the user has set and it could be run via laikaGenerate searchIndex independently from that flag.

Ideally I'd add that hook once you are getting closer to integrating with sbt-typelevel and have you testing everything based on snapshots before releasing to avoid the risk of needing to break the new API after 1.1 is out.

valencik mentioned this issue Nov 22, 2023

Upgrade to laika 1.0.0 #140

Merged

valencik pinned this issue Dec 7, 2023

valencik mentioned this issue Dec 7, 2023

Add prototype sbt plugin for laikaSite integration #148

Merged

valencik added the laika Laika integration label Dec 7, 2023

valencik self-assigned this Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Laika integration #102

Laika integration #102

armanbilge commented Aug 19, 2023

jenshalm commented Aug 20, 2023 •

edited

Loading

valencik commented Aug 30, 2023

jenshalm commented Aug 30, 2023 •

edited

Loading

valencik commented Nov 4, 2023

jenshalm commented Nov 5, 2023

valencik commented Nov 7, 2023

valencik commented Nov 7, 2023

jenshalm commented Nov 8, 2023 •

edited

Loading

Laika integration #102

Laika integration #102

Comments

armanbilge commented Aug 19, 2023

jenshalm commented Aug 20, 2023 • edited Loading

valencik commented Aug 30, 2023

jenshalm commented Aug 30, 2023 • edited Loading

valencik commented Nov 4, 2023

jenshalm commented Nov 5, 2023

valencik commented Nov 7, 2023

valencik commented Nov 7, 2023

jenshalm commented Nov 8, 2023 • edited Loading

jenshalm commented Aug 20, 2023 •

edited

Loading

jenshalm commented Aug 30, 2023 •

edited

Loading

jenshalm commented Nov 8, 2023 •

edited

Loading