Skip to content

HTML-to-Markdown converter that adaptively preserves HTML when needed (eg. when center-aligning, or resizing images)

License

Notifications You must be signed in to change notification settings

EvitanRelta/htmlarkdown

Repository files navigation

HTMLarkdown Title

Coverage Version License


HTMLarkdown is a HTML-to-Markdown converter that's able to output HTML-syntax when required.
Like when center-aligning, or resizing images:

Switching to HTML showcase


How is this different?

Switching to HTML-syntax

Whenever elements cannot be represented in markdown-syntax, HTMLarkdown will switch to HTML-syntax:

Input HTML Output Markdown
<h1>Normal-heading is <strong>boring</strong></h1>

<h1 align="center">
  Centered-heading is <strong>da wae</strong>
</h1>

<p><img src="https://image.src" /></p>

<p><img width="80%" src="https://image.src" /></p>
# Normal-heading is **boring**

<h1 align="center">
  Centered-heading is <b>da wae</b>
</h1>

![](https://image.src)

<img width="80%" src="https://image.src" />

Note: The HTML-switching is controlled by the rules' Rule.toUseHtmlPredicate.


But HTMLarkdown tries to use as little HTML-syntax as possible. Mixing markdown and HTML if needed:

Input HTML Output Markdown
<blockquote>
  <p align="center">
    Centered-paragraph
  </p>
  <p>Below is a horizontal-rule in blockquote:</p>
  <hr>
</blockquote>
> <p align="center">
>   Centered-paragraph
> </p>
> Below is a horizontal-rule in blockquote:
> 
> <hr>

Depending on the situation, HTMLarkdown will switch between markdown's backslash-escaping or HTML-escaping:

Input HTML Output Markdown
<!-- In markdown -->
<p>&lt;TAG&gt;, **NOT BOLD**</p>

<!-- In in-line HTML -->
<p>
  <sup>&lt;TAG&gt;, **NOT BOLD**</sup>
</p>

<!-- In block HTML -->
<p align="center">
  &lt;TAG&gt;, **NOT BOLD**
</p>
\<TAG>, \*\*NOT BOLD\*\*

<sup>\<TAG>, \*\*NOT BOLD\*\*</sup>

<p align="center">
  &lt;TAG>, **NOT BOLD**
</p>

Handling of edge cases

Adding separators in-between adjacent lists to prevent them from being combined by markdown-renderers:

Input HTML Output Markdown
<ul>
  <li>List 1 > item 1</li>
  <li>List 1 > item 2</li>
</ul>
<ul>
  <li>List 2 > item 1</li>
  <li>List 2 > item 2</li>
</ul>
- List 1 > item 1
- List 1 > item 2

<!-- LIST_SEPARATOR -->

- List 2 > item 1
- List 2 > item 2

And more!
But this section is getting too long so...


Installation

npm install htmlarkdown

Usage

Markdown conversion (either from Element or string)

import { HTMLarkdown } from 'htmlarkdown'

/** Convert an element! */
const htmlarkdown = new HTMLarkdown()
const container = document.getElementById('container')
console.log(container.outerHTML)
// => '<div id="container"><h1>Heading</h1></div>'
htmlarkdown.convert(container)
// => '# Heading'


/** 
 * Or a HTML string! 
 * Whichever u prefer. It's 2022, I don't judge :^)
 */
const htmlString = `
<h1>Heading</h1>
<p>Paragraph</p>
`
const htmlStrWithContainer = `<div>${htmlString}</div>`
htmlarkdown.convert(htmlString)
// Set 2nd param 'hasContainer' to true, for container-wrapped string.
htmlarkdown.convert(htmlStrWithContainer, true)
// Both output => '# Heading\n\nParagraph'

Note: If an element is given to convert, it's deep-cloned before any processing/conversion.
Thus, you don't have to worry about it mutating the original element :)


Configuring

/** Configure when creating an instance. */
const htmlarkdown = new HTMLarkdown({
    htmlEscapingMode: '&<>',
    maxPrettyTableWidth: Number.POSITIVE_INFINITY,
    addTrailingLinebreak: true
})

/** Or on an existing instance. */
htmlarkdown.options.maxPrettyTableWidth = -1

Plugins

Plugins are of type (htmlarkdown: HTMLarkdown): void.
They take in a HTMLarkdown instance and configure it by mutating it.

There's 2 plugin-options available in the options object: preloadPlugins and plugins.
The difference is:

  • preloadPlugins loads the plugins first, before your other options. (likes "presets")
    Allowing you to overwrite the plugins' changes:
    const enableTrailingLinebreak: Plugin = (htmlarkdown) => {
        htmlarkdown.options.addTrailingLinebreak = true
    }
    const htmlarkdown = new HTMLarkdown({
        addTrailingLinebreak: false,
        preloadPlugins: [enableTrailingLinebreak],
    })
    htmlarkdown.options.preloadPlugins // false
  • plugins loads the plugins after your other options.
    Meaning, plugins can overwrite your options.
    const enableTrailingLinebreak: Plugin = (htmlarkdown) => {
        htmlarkdown.options.addTrailingLinebreak = true
    }
    const htmlarkdown = new HTMLarkdown({
        addTrailingLinebreak: false,
        plugins: [enableTrailingLinebreak],
    })
    htmlarkdown.options.preloadPlugins // true

You can also load plugins on existing instances:

htmlarkdown.loadPlugins([myPlugin])

Making a copy of an instance

The conversion of a HTMLarkdown instance solely depends on its options property.
Meaning, you create a copy of an instance like this:

const htmlarkdown = new HTMLarkdown()
const copy = new HTMLarkdown(htmlarkdown.options)

Configuring rules/processes

See this section for info on what the rules/processes do.

/**
 * Overwriting default rules/processes.
 * (does NOT include the defaults)
 */
const htmlarkdown = new HTMLarkdown({
    preProcesses: [myPreProcess1, myPreProcess2],
    rules: [myRule1, myRule2],
    textProcesses: [myTextProcess1, myTextProcess2],
    postProcesses: [myPostProcess1, myPostProcess2]
})

/**
 * Adding on to default rules/processes.
 * (includes the defaults)
 */
const htmlarkdown = new HTMLarkdown()
htmlarkdown.addPreProcess(myPreProcess)
htmlarkdown.addRule(myRule)
htmlarkdown.addTextProcess(myTextProcess)
htmlarkdown.addPostProcess(myPostProcess)

How it works

HTMLarkdown has 3 distinct phases:

  1. Pre-processing
    The container-element that's received (and deep-cloned) by the convert method is passed consecutively to each PreProcess in options.preProcesses.

  2. Conversion
    The pre-processed container-element is then recursively converted to markdown.
    Elements are converted by Rule in options.rules.
    Text-nodes are converted by TextProcess in options.textProcesses.
    The rule/text-process outputs strings are then appended to each other, to give the raw markdown.

  3. Post-processing
    The raw markdown string is then passed consecutively to each PostProcess in options.postProcess, to give the final markdown.

Rule-processes flowchart
(image: the general conversion flow of HTMLarkdown)


Contributing

Bugs

HTMLarkdown is still under-development, so there'll likely be bugs.

So the easiest way to contribute is submit an issue (with the bug label), especially for any incorrect markdown-conversions :)

For any incorrect markdown-conversions, state the:

  • input HTML
  • current incorrect markdown output
  • expected markdown output

New conversions, ideas, features, tests

If you have any new elements-conversions / ideas / features / tests that you think should be added, leave an issue with feature or improve label!

  • feature label is for new features
  • improve label is for improvements on existing features

Understandably, there are gray areas on what is a "feature" and what is an "improvement". So just go with whichever seems more appropriate :)


Other markdown specs

Currently, HTMLarkdown has been designed to output markdown for GitHub specifically (ie. GFM).
BUT, if there's another markdown spec. that you'd like to design for (maybe as a plugin?), do leave an issue/discussion :D


Coding-related stuff

Code-formatting is handled by Prettier, so no need to worry bout it :)

Any new feature should

  • be documented via TSDoc
  • come with new unit-tests for them
  • and should pass all new/existing tests

As for which merging method to use, check out the discussion.


Contributors

So far it's just me, so pls send help! :^)


Roadmap

If you've any new ideas / features, check out the Contributing section for it!


Element conversions

Block-elements:

  • Headings (For now, only ATX-style)
  • Paragraph
  • Codeblock
  • Blockquote
  • Lists
    (ordered, unordered, tight and loose)
  • (GFM) Table
  • (GFM) Task-list

    (Below are some planned block-elements that don't have markdown-equivalent)
  • <span> (handled by a noop-rule)
  • <div> (For now, handled by a noop-rule)
  • Definition list (ie. <dl>, <dt>, <dd>)
  • Collapsible section (ie. <details>)

Text-formattings:

  • Bold (For now, only outputs in asterisks **BOLD**)
  • Italic (For now, only outputs in asterisks *ITALIC*)
  • (GFM) Strikethrough
  • Code
  • Link (For now, only inline links)
  • Superscript (ie. <sup>)
  • Subscript (ie. <sub>)
  • Underline (ie. <u>, <ins>)
    (didn't know underlines possible till recently)

Misc:

  • Images (For now, only inline links)
  • Horizontal-rule (ie. <hr>)
  • Linebreaks (ie. <brr>)
  • Preserved HTML comments (Issue #25) (eg. <!-- COMMENT -->)

Features to be added:

  • Custom id attributes
    Go to [section with id](#my-section)
    
    <p id="my-section">
      My section
    </p>
  • Reversing GitHub's Issue/PR autolinks
    Input HTML Output Markdown
    <p>
      Issue autolink: 
      <a href="https://github.com/user/repo/issues/7">#7</a>
    </p>
    Issue autolink: #7
  • Ability to customise how codeblock's syntax-highlighting langauge is obtained from the <pre><code> elements

noop-rule:
They only pass-on their converted inner-contents to their parents.
They themselves don't have any markdown conversions, not even in HTML-syntax.

License

The MIT License (MIT).
So it's freeeeeee