-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to override tags processing in FlexmarkHtmlParser #313
Comments
@dmitrymurashenkov, I did not implement it originally because I needed the functionality fast and had no clue to what is involved. Now that I have added tons of processing and option flags I am in a better position to create an extension API that will work. It will take a bit of time since I am excruciatingly busy for the next couple of weeks but if you have some ideas and want to throw around options I can exchange comments or emails with you. |
I'm currently thoroughly testing FlexmarkHtmlParser and found several minor problems related to this issue. I'll report them as separate issues (because they are) and will write summary of ideas to this thread in a day or two, after I've gone through all the cases. |
@dmitrymurashenkov, much appreciated. Please don't bother with the full issue form. That is for people who think "It does not work" is a complete issue report. Just include the information you think is pertinent and I will ask for clarification if needed. |
Ok, I've tested most cases and it works! The only case that is unsupported now is lists inside table cell - the list markup is formed correctly, but then output as a single line. This is an outlier case which likely shouldn't be supported by default, but GitHub supports it with html in table cell: https://stackoverflow.com/questions/19950648/how-to-write-lists-inside-a-markdown-table I think this case shows one of the purposes of conversion process customization - ability to pass certain content though as original or slightly modified html. Regarding the API - I'm not sure what is technically possible, but seems that some straightforward approach should work fine:
There are two possible cases:
Both cases are addressed by providing jsoup Element which has an API to lookup neighbor elements. |
@dmitrymurashenkov, For your use case the simpler API works but it is not sufficient to implement core tag processors. The best way to make an extension API usable and address general needs is to use for implementing the core functionality. Effectively, dog-fooding the API by the core. I think that the custom HTML tag processors will need an HTML parser context to expose internal HTML parser state so that core tag processors can be implemented using the same API. I will do a quick next release with bug fixes and add typographic mapping and table options. I will add converter from Parser options to HTML parser options and HTML parser extensions in the release after that because it will take a bit more effort and I do not want to delay the bug fixes. I will post my progress and propose an API in this issue. |
@dmitrymurashenkov, to expand on my terse comment about context. What I am thinking is making HTML parser follow the same model as At invocation time the node customization gets a In the case of the HTML parser the nodes will be based on jsoup's Element tag name so there will not be a need for me to provide all possible tag variations as processor extension points. If I don't provide any then I cannot forget to provide some that you may need. Instead, the processor extension point will use something like the I also want to change the model for processor results so that each processor will not append to a
I want to make the above change to all renderer/formatter extensions for the same reason just need to find a way to do with an easy migration path for existing user implemented extensions. |
Agree on replacing On other points it's difficult for me to comment, because I'm not that familiar with current internals. I guess - only one way to find out :) Meanwhile here is a couple of possible additional cases for this functionality:
|
All 3 fall under the same implementation/API. Effectively, custom element handling or pass-through to default. I have a ToDo item for table to text conversion when the table has a single cell without a header.
This one can be addressed by re-use of I think this is a better option since it is difficult to get the full Markdown context when generating it from HTML and trying to do everything in HTML -> Markdown conversion will duplicate code already handling the particular use cases. This is the reason I use Formatter table for final output. It was impossible to produce formatted table during conversion. Another example, is escaping of special markdown chars for inline text just to be sure they are not combined with other special characters further in the stream. There is no way to suppress this at time of conversion. You really need the full markdown to see if the escaped char can be un-escaped without consequences. |
@dmitrymurashenkov, I took a look at the full rework and it is panning out to be extensive and since you don't need all those extras, making you wait for it makes no sense. I can quickly add extension points for you to accomplish what you need right now and when the full reworked HTML parser is out you will need to convert your current implementation to the new model. It should not be a big task since the new model will be a superset of what you need. Let me know what elements you need to override to get the job done and I can make a quick kludge/enhancement release so you don't have to wait for the rewrite. |
@vsch At the moment I use a hack replacing tagProcessors map using reflection and it covered my current cases, so I'm good and can wait for a proper future fix. |
Good to hear you are not stuck waiting. I agree, there is no need to replace one working hack with another. |
@dmitrymurashenkov, I got the new extensible HTML conversion working. The extension API is limited to custom renderers, selected by case insensitive tag with ability to override and customize existing renderers. Also has a custom link resolver to allow modifying URLs during the conversion. Branch 0.50 is now merged into master. Older version is in brach 0.42.
Now the tags which output tags and content and tags which output only their content are also definable through options or can be overridden with a custom renderer for the tag.
The extension mechanism is the same as for HTML renderer and Formatter. The converter now has a Extensions implement |
Fix for this is available. Repo updated, maven updated but may take a while to show up in maven central. |
Is your feature request related to a problem? Please describe.
I need to convert html to markdown. Input html has <iframe> tag with YouTube video. I want this tag preserved as-is, but FlexmarkHtmlParser removes it and tries to output tag content
Describe the solution you'd like
Make FlexmarkHtmlParser take option with tag processors map and method to get default processors map. User will be able to override TagParam for specific tag before conversion.
Describe alternatives you've considered
Refactor all FlexmarkHtmlParser.processXXX() methods into separate classes and make API to provide custom class for specic tag.
This seems an overkill in most cases, since html-markdown conversion is usually straightforward and doesn't need complex customization.
The text was updated successfully, but these errors were encountered: