Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializing mdast to markdown #64

Closed
Enoumy opened this issue Apr 29, 2023 · 19 comments · Fixed by #127
Closed

Serializing mdast to markdown #64

Enoumy opened this issue Apr 29, 2023 · 19 comments · Fixed by #127

Comments

@Enoumy
Copy link

Enoumy commented Apr 29, 2023

(Whoops accidentally hit enter before drafting a content for this question, my apologies for the noise!)

Hi! I have a perhaps newbie question! I can use markdown::to_mdast to go from &str -> Node. Is it possible/is there a function to go back to a string - Node -> &str` - in a way that roundtrips?

I came across Node::to_string, and it does seem to convert nodes into a string but it also deletes the links/titles/and most other ast nodes, which if re-parsed again, results in a different ast. Unsure if this question is reasonable/within the context of this crate, but is there an alternate function elsewhere that is round-trippable to/from &str <-> Node? I am also happy to take a stab at implementing this "rountrippable" unparser function myself, but was wondering if a function like it already existed.

For further clarification, by "roundtripping", I would be writing a property based test, like markdown::to_mdast(to_string(node)) == node be true for all node's.

Thanks!

@Enoumy Enoumy changed the title Is it possible to roundtrip parse? Is it possible to roundtrip parse between Node <-> mdast? Apr 29, 2023
@Enoumy Enoumy changed the title Is it possible to roundtrip parse between Node <-> mdast? Is it possible to roundtrip parse between mdast nodes <-> strings? Apr 29, 2023
@wooorm
Copy link
Owner

wooorm commented Apr 29, 2023

No, this is not yet possible, as mdast-util-to-markdown has not been implemented in Rust yet.

You can work on this. Though, it is involved work that takes a while. The good part is that everything has already been implemented in JavaScript.

Finally, “complete” roundtripping (toString(fromString(x)) == x) is impossible with ASTs. ASTs are abstract. They loose information. That is intentional. So the results will never be exact, but the results will be equivalent.

@wooorm wooorm changed the title Is it possible to roundtrip parse between mdast nodes <-> strings? Serializing mdast to markdown Apr 29, 2023
@h7kanna
Copy link
Contributor

h7kanna commented May 5, 2023

Will this work? Passing on the 'serde_json' serialized format to mdast-util-to-markdown?

@wooorm
Copy link
Owner

wooorm commented May 9, 2023

perhaps

@a-viv-a
Copy link

a-viv-a commented Aug 18, 2023

I wrote a likely crummy implementation of this for a personal project here, would something like this make sense as a PR or a new crate?

It passes a (much) weaker version of the proptest @Enoumy proposes, where string -> mdast -> string2 -> mdast -> string3 produces an equivalent string2 and string3 (assuming I understand how proptest works 😁 )

I don't think it covers all the possible nodes mdasts can include, and it applies some opinionated formatting. I also suspect this recursive approach is bad for performance. (I'm learning rust through this project, so I wouldn't be surprised to learn something about this code is very far from best practices)

@wooorm
Copy link
Owner

wooorm commented Aug 18, 2023

Nice start and welcome to rust :)

  • this project is no_std, looks like you’re using a bunch of that?
  • some potential bugs are fine, but it should be good from the start I think, have you looked at mdast-util-to-markdown? it’s battle tested and supports everything. Being mostly compatible across JS and Rust is also important to me!

@a-viv-a
Copy link

a-viv-a commented Aug 18, 2023

I'll leave this code in my own project then. I found this issue when I was already mostly done with this implementation, so I couldn't until it was too late. I'll take a look now, but I don't plan to write something new when I have something that works for me.

Edit: if nothing else I need to copy the unsafe character support...

@moy2010
Copy link

moy2010 commented Nov 16, 2023

@wooorm, do you know why wouldn't leveraging the ToString implementation for this be a good idea? Or is the intention to have a separate method for this?

@wooorm
Copy link
Owner

wooorm commented Nov 17, 2023

“to string” is already a thing in the mdast world, getting just the text out.
Formatting markdown is complex. And not always needed.
Yes, separate methods. See the first comment. https://github.com/syntax-tree/mdast-util-to-markdown

@moy2010
Copy link

moy2010 commented Nov 17, 2023

I see. I will try to work on a PR then 🙂.

@bnchi
Copy link
Contributor

bnchi commented Aug 20, 2024

Thanks for you work @wooorm it will be so handy to have this as crate/feature, is there any plans soon to add it into the project ?

@wooorm
Copy link
Owner

wooorm commented Aug 21, 2024

Please read the comments above yours. Thanks.

@bnchi
Copy link
Contributor

bnchi commented Aug 21, 2024

Yeah, my bad I asked that question because the comment was in 2023 and I thought there might be other plans.

I have time now which I can spend to help with this issue, @moy2010 mind me if I ask, do you need any help here, have you started working on this issue if not maybe I can start kicking things off with the Rust version ?

@moy2010
Copy link

moy2010 commented Aug 21, 2024

Yeah, my bad I asked that question because the comment was in 2023 and I thought there might be other plans.

I have time now which I can spend to help with this issue, @moy2010 mind me if I ask, do you need any help here, have you started working on this issue if not maybe I can start kicking things off with the Rust version ?

Sorry, @bnchi . I started looking into it, but ultimately decided to ditch it because my original plan was to convert from my domain format -> mdast -> markdown, and instead I just did the conversion directly without mdast as the intermediary model.

@bnchi
Copy link
Contributor

bnchi commented Aug 22, 2024

@wooorm

I know a little bit about parsers and I read part of the JS implementation, appreciate if you can help me here and let me know if my intuition and the steps in which I'm going to be implementing this make sense.

The JS depends on the use of unist-util-visit which seems to help visit nodes of some type in a tree, it has a little bit of useful methods, and the Rust implementation doesn't seem to have that abstraction.

It probably make sense to do the following :

  • Introduce a Visitor trait
  • Add a new method on the Node type to accept a type that implements the Visitor trait in this case it's the markdown serialization type, the accept method matches over all the variants of a Node and call the correct method.
  • The markdown serialization type will impl Visitor and will have the logic that allow us to serialize and walk down the tree, I'm going to be copying the logic from the handle methods in the JS implementation here: https://github.com/syntax-tree/mdast-util-to-markdown/tree/main/lib/handle
  • In the end convert all the tests into Rust to make sure the implementation matches the JS one (this will probably be with the help of an LLM since there are a lot of tests written in JS and manually doing them is going to be alot of work)

One part which I don't have a clear mental model of in the JS implementation :

state.enter
tracker.move

what's the difference between a state and a tracker, is there a documentation that describes the logic behind them ?

@bnchi
Copy link
Contributor

bnchi commented Aug 22, 2024

Hmm the Visitor abstraction might be a nice to have and is not needed for this specific case since we're only going to be serializing straight to markdown so we can match against the Node variant and call the appropriate method to do the serializations.

The only part which is not clear to me is state.enter('type?') How this type help in traversing the tree ?

@wooorm
Copy link
Owner

wooorm commented Aug 22, 2024

There are indeed many utilities in the javascript ecosystem already, which are not there yet in Rust. unist-util-visit is very important in javascript. But for mdast-util-to-markdown, it’s only used in one place. https://github.com/syntax-tree/mdast-util-to-markdown/blob/fd6a508cc619b862f75b762dcf876c6b8315d330/lib/util/format-heading-as-setext.js#L17. So, perhaps you can do without it, in this case.

what's the difference between a state and a tracker, is there a documentation that describes the logic behind them ?

The types might help:

https://github.com/syntax-tree/mdast-util-to-markdown/blob/fd6a508cc619b862f75b762dcf876c6b8315d330/lib/types.js#L56-L76

The tracker tracks positional info. There are several trackers. Small objects.

The state is a bigger thing, it tracks the entire process.

The only part which is not clear to me is state.enter('type?') How this type help in traversing the tree ?

It is often important to know where you are, when serializing markdown.
Are you in a heading? Code? Something else?

@bnchi
Copy link
Contributor

bnchi commented Aug 23, 2024

If I understood what you said correctly does that mean state is conceptually a state machine that track where you're in the tree and it's used to call the correct handle method for the children of a parent from within the containerPhrasing and conteinerFlow to serialize into markdown.

If my above assumption is correct do you think it makes sense to replicate the same model in Rust or do you have a better idea ?

Also do you have any resources or a way to help me know which type is considered a phrasing content or a flow content ?

@wooorm

@wooorm
Copy link
Owner

wooorm commented Aug 23, 2024

a) state does more than tracking in which things you are, it also has helper utilities, and tracks other state (such as which bullet marker was just used)
b) it’s not about which handle has to be called. That has more to do with nodes. Nodes are simplified: they are “abstract”. There is one heading node. enter/exit/stack have to do with “constructs”, which really has to do with how things will be serialized: there are different ways to write headings: ## ATX, and Setext\n======. Which one is chosen, affects how things work. In this case, ATX headings cannot include \n, but Setext headings can

If my above assumption is correct do you think it makes sense to replicate the same model in Rust or do you have a better idea ?

I consider mdast-util-to-markdown to be very good. I think it can be copied over to rust. Essentially one-to-one.

Also do you have any resources or a way to help me know which type is considered a phrasing content or a flow content ?

This is defined in mdast:

@bnchi
Copy link
Contributor

bnchi commented Aug 24, 2024

Thanks for the clarification @wooorm

I made a small initial commit to lay down the foundation to what I think make sense to serialize in rust : bnchi@973f760

I'm thinking that the traits PhrasingParent and FlowParent will be generated for node type using a macro

Let me know if that's an approach you think make sense to move forward and start the full implementation

edit
I added a very simple way to serialize texts here to show how this will work :
bnchi@691bb87

@bnchi bnchi mentioned this issue Aug 27, 2024
21 tasks
@wooorm wooorm closed this as completed in 567e7e0 Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants