Markdig supports parsing trivia characters and tracks the source position of these characters. This gives the ability to parse a document and then render a slightly changed document back. Without tracking trivia characters, the renderer must make all kinds of assumptions on newlines, tabs, whitespace characters and other document details.
To use this functionality, set the optional trackTrivia
parameter to true when using the static Markdown
class:
MarkdownDocument markdownDocument = Markdown.Parse(inputMarkdown, trackTrivia: true);
You will get a parse tree where Block
and Inline
instances now have various Trivia*
properties.
To write a document to Markdown using this tree, use the RoundtripRenderer
:
var sw = new StringWriter();
var rr = new RoundtripRenderer(sw);
rr.Write(markdownDocument);
var outputMarkdown = sw.ToString();
You should expect the outputMarkdown
to be equal to the inputMarkdown
.
For a simple test showcasing the feature, see the TestExample.cs
.
Trivia are not specified by the CommonMark standard. As such, any implementation decides for itself which tree nodes trivia are attached to.
Trivia characters are:
- newlines:
\n
,\r
,\r\n
\f
(form feed),\v
(vertical tab)- unescaped string characters
Blocks almost always end with a newline, therefore the Block
class has it defined as a property:
/// <summary>
/// The last newline of this block
/// </summary>
public NewLine NewLine { get; set; }
Consider a very simple valid Markdown document (for clarity's sake, the \n
character is added):
p\n
\n
p\n
Above document consists of 5 characters, p
, \n
, \n
, p
, \n
in sequence.
Obviously, the two p
characters are part of a separate paragraph block.
The \n
right next to each p
is easy: we'll just attach it to either paragraph block as well.
However, it is not clear what we should do with the middle \n
: should it be attached to the first p
or the second p
?
Let's look at a different example:
\n
p\n
\n
Here, we only have one (paragraph)block, and thus must attach the first \n
and last \n
to that paragraph block.
The Block
class therefore has LinesBefore
and LinesAfter
defined:
/// <summary>
/// Gets or sets the empty lines occurring before this block.
/// Trivia: only parsed when <see cref="MarkdownPipeline.TrackTrivia"/> is enabled, otherwise null.
/// </summary>
public List<StringSlice> LinesBefore { get; set; }
/// <summary>
/// Gets or sets the empty lines occurring after this block.
/// Trivia: only parsed when <see cref="MarkdownPipeline.TrackTrivia"/> is enabled, otherwise null.
/// </summary>
public List<StringSlice> LinesAfter { get; set; }
The choice where to attach the middle \n
from the first example to is arbitrary.
When parsing, it's easier and simpler to attach it to the first occuring block, so that's what Markdig does.
Rule: Newlines are attached to the first occurring node
The parse tree of the first example then becomes:
- paragraph block
p
- newline:
\n
- after:
\n
- newline:
- paragraph block
p
- newline:
\n
- newline:
In the second example, the parse tree is:
- paragraph block
p
- newline:
\n
- before:
\n
- after:
\n
- newline:
Stated differently: Blocks almost always have a newline, often have trivia after and sometimes have trivia before.
Keep in mind that paragraphs are a bit of a special case in Markdown.
This is also the case with trivia parsing, where the LineBreakInline
is considered part of the paragraph block, and not part of the trivia.
Consider the following example:
\n
text1\n
text2\n
\n
The first \n
is attached to the paragraph block as trivia before.
The second \n
is a LineBreakInline inline element, and not considered trivia.
The third \n
is the newline of the paragraph block.
The fourth \n
is attached as trivia after.
All trivia in a document should be attached to a node. The Block
class defines two properties to capture this:
/// <summary>
/// Gets or sets the trivia right before this block.
/// Trivia: only parsed when <see cref="MarkdownPipeline.TrackTrivia"/> is enabled, otherwise
/// <see cref="StringSlice.IsEmpty"/>.
/// </summary>
public StringSlice TriviaBefore { get; set; }
/// <summary>
/// Gets or sets trivia occurring after this block.
/// Trivia: only parsed when <see cref="MarkdownPipeline.TrackTrivia"/> is enabled, otherwise
/// <see cref="StringSlice.IsEmpty"/>.
/// </summary>
public StringSlice TriviaAfter { get; set; }
Typically, this trivia occurs within the document before, in between or after blocks.
Take these examples (the interpunct, aka middle dot: ·
is used to visualize a space character):
·*·item1
··*·item1
·*··item1
All is valid markdown that defines an unordered list with one paragraph block. The parse tree looks like this:
- ListBlock
- ListItemBlock
- Paragraph
- LiteralInline "item1"
- Paragraph
- ListItemBlock
The parser assigns the trivia (spaces in above example) to the ListItemBlock
and ParagraphBlock
nodes respectively.
Trivia may occur within nodes. In such case, a property is defined for each part of the syntax where trivia may occur. Some inlines have escaped strings. These strings are set seperately on the parse tree of that inline.
LinkInline
and FencedCodeBlock
are both examples where trivia is parsed within the node and the node contains properties for both escaped an unescaped strings.
Links and LinkReferences have a complex parsing implementation. The codebase currrently consists of a separate set of Parse*Trivia
methods.
These methods are duplicated from their source Parse*
methods for simplicity's sake.
Abstracting the trivia parsing in the source methods was considered, but that would make already complex parsing logic even more complex.
Instead, the cost of maintaining a (mature) duplicated codebase was considered to be easier and less complex.
While LinkReferences are parsed, the LinkReferenceDefinitionGroup
is not added to the document.
The reason for this is to have the parse tree represent the input text as precise as possible.
Adding the LinkReferenceDefinitionGroup
would add a node not representing input text, and as such is omitted.
As per the [CommonMark 0.29 spec], the /0
aka U+0000
character is replaced with /uFFFD
.
Therefore, it is not - and never will be - possible to have exactly equal output Markdown as input, whenever there is a /0
character in the input.
Rule: Exactly equal output Markdown given an input Markdown is only possible when the /0
character is not present in the input Markdown
The spec states
Any sequence of characters is a valid CommonMark document.
where
A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.
As such an input document containing trivia is, technically, also valid Markdown.
To support rountrip parsing for documents that contain input characters - but these input characters do not resolve to any blocks, the EmptyBlock
is defined.
Rule: the EmptyBlock
is a Block representing a block of Markdown trivia where no other Block types are matched on
Extensions are currently not supported. If you're a writer or maintainer of an existing extension, would you be interested in writing a pull request to have your extension support roundtrip parsing? If you need any assistance, please reach out to @generateui. I'd be happy to help.