Ability to override tags processing in FlexmarkHtmlParser #313

dmitrymurashenkov · 2019-02-24T13:37:37Z

Is your feature request related to a problem? Please describe.

I need to convert html to markdown. Input html has <iframe> tag with YouTube video. I want this tag preserved as-is, but FlexmarkHtmlParser removes it and tries to output tag content

Describe the solution you'd like

Make FlexmarkHtmlParser take option with tag processors map and method to get default processors map. User will be able to override TagParam for specific tag before conversion.

Describe alternatives you've considered

Refactor all FlexmarkHtmlParser.processXXX() methods into separate classes and make API to provide custom class for specic tag.

This seems an overkill in most cases, since html-markdown conversion is usually straightforward and doesn't need complex customization.

vsch · 2019-02-25T21:51:35Z

@dmitrymurashenkov, I did not implement it originally because I needed the functionality fast and had no clue to what is involved. Now that I have added tons of processing and option flags I am in a better position to create an extension API that will work.

It will take a bit of time since I am excruciatingly busy for the next couple of weeks but if you have some ideas and want to throw around options I can exchange comments or emails with you.

dmitrymurashenkov · 2019-02-26T15:26:04Z

I'm currently thoroughly testing FlexmarkHtmlParser and found several minor problems related to this issue. I'll report them as separate issues (because they are) and will write summary of ideas to this thread in a day or two, after I've gone through all the cases.

vsch · 2019-02-26T15:28:39Z

@dmitrymurashenkov, much appreciated.

Please don't bother with the full issue form. That is for people who think "It does not work" is a complete issue report.

Just include the information you think is pertinent and I will ask for clarification if needed.

dmitrymurashenkov · 2019-02-28T13:20:09Z

Ok, I've tested most cases and it works!

The only case that is unsupported now is lists inside table cell - the list markup is formed correctly, but then output as a single line. This is an outlier case which likely shouldn't be supported by default, but GitHub supports it with html in table cell:

https://stackoverflow.com/questions/19950648/how-to-write-lists-inside-a-markdown-table

I think this case shows one of the purposes of conversion process customization - ability to pass certain content though as original or slightly modified html.

Regarding the API - I'm not sure what is technically possible, but seems that some straightforward approach should work fine:

FlexmarkHtmlParser.build(new MutableDataSet(TAG_PROCESSOR_UL, new MyULTagProcessor()));

private static class MyULTagProcessor implements FlexmarkHtmlParser.TagProcessor {
    public void process(FormattingAppendable out, Element element) {
        //returns List<String> with parent html elements in parsed tree
        if (context.parent().getName().equals("table")) {
            //passthrough if inside <table>
            out.append(element);
        }
        else {
            //For example we want to output custom caption before each list:
            out.append("*List " + counter++ + "*");
            new DefaultULProcessor().process(out, element);
        }
    }
}

There are two possible cases:

We want to render element in a custom way depending on it's parent elements
We want to render element in a custom way depeneding on his child elements

Both cases are addressed by providing jsoup Element which has an API to lookup neighbor elements.

vsch · 2019-03-02T16:09:57Z

@dmitrymurashenkov, For your use case the simpler API works but it is not sufficient to implement core tag processors.

The best way to make an extension API usable and address general needs is to use for implementing the core functionality. Effectively, dog-fooding the API by the core.

I think that the custom HTML tag processors will need an HTML parser context to expose internal HTML parser state so that core tag processors can be implemented using the same API.

I will do a quick next release with bug fixes and add typographic mapping and table options.

I will add converter from Parser options to HTML parser options and HTML parser extensions in the release after that because it will take a bit more effort and I do not want to delay the bug fixes.

I will post my progress and propose an API in this issue.

vsch · 2019-03-02T16:10:08Z

@dmitrymurashenkov, to expand on my terse comment about context.

What I am thinking is making HTML parser follow the same model as Parser, HtmlRenderer, Formatter where you make a builder then build to get the parser/renderer with the same extension model as ParserExtension, RendererExtension, FormatterExtension where you provide customization for specific extensions and the extension point provides the nodes it overrides.

At invocation time the node customization gets a Context parameter which gives insight into the parsing/rendering state. For HTML this will give access to the context stack used for nested element processing.

In the case of the HTML parser the nodes will be based on jsoup's Element tag name so there will not be a need for me to provide all possible tag variations as processor extension points. If I don't provide any then I cannot forget to provide some that you may need.

Instead, the processor extension point will use something like the NodeAdaptingVisitor<> except in this case it will be based on the element tag name, not node class.

I also want to change the model for processor results so that each processor will not append to a FormattingAppendable but return a Collection<CharSequence> where each sequence is a rendered line. This will allow the customized processor:

to invoke the default processor, manipulate its output and return it as its own or just return its own. The flexibility will make re-use of default processors easier.
will give each parent element access to all the child lines which it can prefix with indentation, make decisions based on child content, etc. All without convoluted mess of FormattingAppendable getting in the way.
Convoluted implementation of FormattingAppendable makes generating output much slower than plain Appendable.
Instead a LineFormattingAppendable will be used with a similar interface to FormattingAppendable for
1. append(...), line(), blankLine(), indent(), prefix(), etc.
2. will add append(Collection<CharSequence) for easy appending of child rendered results
3. be without the callbacks and convoluted logic implementing indentations since these can be applied easily to individual lines once they are generated.
4. Its result will be a Collection<CharSequence> where each char sequence represents a single line of output.
This will make generating the lines as easy as it is now. Be simpler and faster in implementation. Have the added benefit of having individual lines and line count for making decisions in the parent element.

I want to make the above change to all renderer/formatter extensions for the same reason just need to find a way to do with an easy migration path for existing user implemented extensions.

dmitrymurashenkov · 2019-03-04T09:10:32Z

Agree on replacing FormattingAppendable with Collection<CharSequence>.

On other points it's difficult for me to comment, because I'm not that familiar with current internals. I guess - only one way to find out :)

Meanwhile here is a couple of possible additional cases for this functionality:

CSS passthrough - for example, if HTML element has specific class or style - output it's text as bold/italic.
Element replace - sort of XSLT, for example, we have <table> with specific class and know that it contains a single column and we want to render it as list.
Skip element - this can be done before invoking flexmark, but may be simpler to just skip it during parsing based on attribute criteria.
Link redirect - manipulate certain links in some way. Make them relative/absolute, redirect to another domain, etc.

vsch · 2019-03-04T09:59:08Z

@dmitrymurashenkov,

CSS passthrough - for example, if HTML element has specific class or style - output it's text as bold/italic.

Element replace - sort of XSLT, for example, we have <table> with specific class and know that it contains a single column and we want to render it as list.

Skip element - this can be done before invoking flexmark, but may be simpler to just skip it during parsing based on attribute criteria.

All 3 fall under the same implementation/API. Effectively, custom element handling or pass-through to default. I have a ToDo item for table to text conversion when the table has a single cell without a header.

Link redirect - manipulate certain links in some way. Make them relative/absolute, redirect to another domain, etc.

This one can be addressed by re-use of LinkResolver from HtmlRenderer for HTML to Markdown conversion. Another option is not to duplicate functionality and address HTML -> Markdown separately with further manipulation done by transforming markdown by parsing and using custom formatting for final output.

I think this is a better option since it is difficult to get the full Markdown context when generating it from HTML and trying to do everything in HTML -> Markdown conversion will duplicate code already handling the particular use cases.

This is the reason I use Formatter table for final output. It was impossible to produce formatted table during conversion.

Another example, is escaping of special markdown chars for inline text just to be sure they are not combined with other special characters further in the stream. There is no way to suppress this at time of conversion. You really need the full markdown to see if the escaped char can be un-escaped without consequences.

vsch · 2019-03-06T20:12:07Z

@dmitrymurashenkov, I took a look at the full rework and it is panning out to be extensive and since you don't need all those extras, making you wait for it makes no sense.

I can quickly add extension points for you to accomplish what you need right now and when the full reworked HTML parser is out you will need to convert your current implementation to the new model. It should not be a big task since the new model will be a superset of what you need.

Let me know what elements you need to override to get the job done and I can make a quick kludge/enhancement release so you don't have to wait for the rewrite.

dmitrymurashenkov · 2019-03-06T20:31:33Z

@vsch At the moment I use a hack replacing tagProcessors map using reflection and it covered my current cases, so I'm good and can wait for a proper future fix.

vsch · 2019-03-06T20:35:56Z

Good to hear you are not stuck waiting. I agree, there is no need to replace one working hack with another.

vsch · 2019-07-09T20:45:48Z

@dmitrymurashenkov, I got the new extensible HTML conversion working. The extension API is limited to custom renderers, selected by case insensitive tag with ability to override and customize existing renderers. Also has a custom link resolver to allow modifying URLs during the conversion. Branch 0.50 is now merged into master. Older version is in brach 0.42.

Add: flexmark-html2md-converter module which implements HTML to Markdown conversion with an extension API to allow customizing the conversion process. Sample: HtmlToMarkdownCustomizedSample.java

Now the tags which output tags and content and tags which output only their content are also definable through options or can be overridden with a custom renderer for the tag.

UNWRAPPED_TAGS, default new String[] { "article", "address", "frameset", "section", "small", "iframe", }, defines tags whose inner html content should be rendered
WRAPPED_TAGS, default new String[] { "kbd", "var" }, defines tags which should render as outer HTML. Inner text will be converted to markdown.

The extension mechanism is the same as for HTML renderer and Formatter. The converter now has a builder(options).build() to create an instance of the converter with given options.

Extensions implement HtmlConverterExtension interface and register extensions through the passed builder parameter.

vsch · 2019-07-09T20:46:01Z

Fix for this is available. Repo updated, maven updated but may take a while to show up in maven central.

vsch added the 🔥 enhancement label Feb 25, 2019

dmitrymurashenkov mentioned this issue Feb 26, 2019

Ability to disable table caption in FlexmarkHtmlParser #318

Closed

vsch added this to the Version 0.40.22 milestone Mar 1, 2019

vsch modified the milestones: V 0.40.22, V 0.50.0 Mar 9, 2019

vsch added 🚰 fix available 🎉 fixed labels Jul 9, 2019

vsch mentioned this issue Jul 9, 2019

How to self modify parse method when htmltomarkdown #353

Open

vsch closed this as completed Aug 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to override tags processing in FlexmarkHtmlParser #313

Ability to override tags processing in FlexmarkHtmlParser #313

dmitrymurashenkov commented Feb 24, 2019

vsch commented Feb 25, 2019

dmitrymurashenkov commented Feb 26, 2019

vsch commented Feb 26, 2019

dmitrymurashenkov commented Feb 28, 2019

vsch commented Mar 2, 2019

vsch commented Mar 2, 2019

dmitrymurashenkov commented Mar 4, 2019

vsch commented Mar 4, 2019

vsch commented Mar 6, 2019

dmitrymurashenkov commented Mar 6, 2019

vsch commented Mar 6, 2019

vsch commented Jul 9, 2019

vsch commented Jul 9, 2019

Ability to override tags processing in FlexmarkHtmlParser #313

Ability to override tags processing in FlexmarkHtmlParser #313

Comments

dmitrymurashenkov commented Feb 24, 2019

vsch commented Feb 25, 2019

dmitrymurashenkov commented Feb 26, 2019

vsch commented Feb 26, 2019

dmitrymurashenkov commented Feb 28, 2019

vsch commented Mar 2, 2019

vsch commented Mar 2, 2019

dmitrymurashenkov commented Mar 4, 2019

vsch commented Mar 4, 2019

vsch commented Mar 6, 2019

dmitrymurashenkov commented Mar 6, 2019

vsch commented Mar 6, 2019

vsch commented Jul 9, 2019

vsch commented Jul 9, 2019