Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to self modify parse method when htmltomarkdown #353

Open
gaofeiseu opened this issue Jun 25, 2019 · 5 comments
Open

How to self modify parse method when htmltomarkdown #353

gaofeiseu opened this issue Jun 25, 2019 · 5 comments

Comments

@gaofeiseu
Copy link

Is your feature request related to a problem? Please describe.
Hi, I come from China, flexmark is really good tools, during my development, I found some problem.
I need to convert html to markdown.But when I convert, some tag in html has unusual src like this
<img src="//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg" >
such src cannot convert to markdown and behavior correct.

Describe the solution you'd like
how can I modify src parse method in img tag in a extension options way.
And get result like this
<img src="//abc.com/cde/efg.jpg" >
convert to
![](https://abc.com/cde/efg.jpg)

Describe alternatives you've considered
some extension options or already has some options I just ignore?

Additional context

@vsch
Copy link
Owner

vsch commented Jun 25, 2019

@gaofeiseu, I tried the HTML you gave and the converted markdown seems to be correct, the markdown is the first line the HTML is last:

![](//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg)
.
<img src="//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg" >

Can you create a small test with options which does not work for you?

You can use the sample as a starting point and add the configuration you use in your code:

HtmlToMarkdownSample.java

@vsch
Copy link
Owner

vsch commented Jun 25, 2019

@gaofeiseu, sorry, I just realized what you really wanted was to add "https:" prefix to the image URL if it is missing.

The easiest way to do this in the current implementation is to use the standard HTML parser to get the Markdown, then parse the Markdown and replace the URLs in the AST with what you want before passing the AST document node to formatter, which will output the changed Markdown.

The sample FormatterWithMods.java shows how to change the URLs in the AST so that the formatted Markdown has replaced URLs.

All you need to do is replace the logic in FormatterWithMods.java: Lines 68-71 with:

            if (node.getPageRef().startsWith("/")) {
                node.setUrlChars(PrefixedSubSequence.of("https:", node.getPageRef()));
                node.setChars(SegmentedSequence.of(Arrays.asList(node.getSegmentsForChars())));
            }

To have all URLs starting with / prefixed with https:

@gaofeiseu
Copy link
Author

@vsch
thanks a lot for your patient!Use standard HTML parser to get Markdown from HTML content is what I have done.You mean I need continue to parse Markdown to HTML and replace URLs with similar method you had gave in demo code:FormatterWithMods.java. Then I still need to parse the HTML content after replacing to Markdown?
I agree this will be a solution, but as you see, too many convert between HTML and Markdown I need to do in this solution.
Is there other solution, more light weight, less convert, direct from HTML to markdown

@vsch
Copy link
Owner

vsch commented Jun 26, 2019

@gaofeiseu, what you need to do is simply combine HTML to Markdown then parse the Markdown to AST, replace the URLs in the AST and render the AST as Markdown using the formatter. It is combining the two samples I mentioned into a single process.

If you take the modified FormatterWithMods you can see the needed steps: FormatterWithMods2.java

The current version of HTML to Markdown implementation is not extensible so there is no easy way to modify the markdown it generates. I am working on a new version that supports extensions similar to HTML Renderer and Markdown Formatter which will allow some customization to generated Markdown without needing to re-parse the markdown but this is not yet available.

@vsch
Copy link
Owner

vsch commented Jul 9, 2019

@gaofeiseu, new module with extension API for HTML to Markdown conversion implemented.

See #313, last comment has a link to a sample which modifies some link URLs during conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants