-
-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extensions redux #154
Extensions redux #154
Conversation
The object-loading code and the extension-specific changes in core were two things I didn't like about #123, so this seems a step in the right direction. However, I'd be interested in comments from @MathieuDuponchelle and @nwellnhof. This is a big code revision. It's hard to see the forest for the trees. To ease review, could you please rebase it so that (a) all of the changes are on top of current master, (b) commits are reordered and consolidated into (a smaller number of) logically coherent commits? Part of the difficulty with #123 was that it was hard to survey. Also note that travis builds are failing due to a warning. |
Travis was failing due to an older Valgrind on the Linux instances. That's been fixed up. I'll look into cleaning up the PR. It is rather large; I probably won't touch commits from #123. |
We should really try to split this up into smaller chunks. First of all, I'd suggest to tackle block and inline extensions separately. Then, before writing any new code, we should start by discussing the internals. Discuss changes to core data structuresHow to identify extension nodesThe new PR improves upon the previous approach by generating numeric ids for new node types dynamically. Personally, I'd still prefer to have two single node types Render hooks and extension-specific node dataI'd propose to add two new fields to the
The PR adds a separate field for the destructor and stores other callbacks in the extension struct. I think it's cleaner to have separate struct for each node type which can contain callbacks and other static data . Discuss public extension APII think the best approach is to look at the core node types and restructure the code in
These callbacks only need access to the current line and offset in This isn't too different from the PR, but it seems a bit simpler to me. This also makes it straightforward to support core blocks inside extensions blocks. I'm not sure how this would be handled in the PR. I can publish my changes to the block parser for review if anyone is interested. I haven't looked at the inline extension API yet. It will probably be a lot more complicated than the API for extension blocks. I only noted that the PR exposes the delimiter struct. Agree on the designAs I already mentioned in #123, I'm still not happy with some design decisions. See "Discuss changes to core data structures" above where I reiterate my criticism. ImplementThis should really be the last step. |
These are all very good suggestions, @nwellnhof! Unfortunately we're short on time as we're rolling out these changes to the website in the upcoming weeks, so we can't afford to sit down and discuss the extensions API for the hundredth time. Making progress on this stuff has been extremely painful so far. The truth is that CMark is not gaining traction because nobody is using it on production, and nobody is using it on production because it doesn't support any of the features that the large websites (like us at GitHub) need to use it. So, to speed up things, we'll be publishing soon our own fork of CMark. Hopefully you can pick up from there all the bits and pieces until this CMark implementation becomes more usable, or maybe other websites will just start using our fork -- either way we all win because it means the CMark standard is getting adoption. Thank you and everybody involved for the fantastic work so far. |
woot :) I'll take some time to look at it later today, thanks a bunch :) |
I have to admit, adding an extension mechanism hasn't been a big priority for me. The primary aim of this library is to be a reference implementation for the CommonMark spec and a demonstration of algorithms for efficiently parsing the syntax defined by the spec. It was never intended to be the only implementation. (See Markdig and markdown-it for extensible CommonMark implementations in C# and JavaScript.) In principle I like the idea of an extension mechanism, but everything I've seen so far has been too complex for my taste, and I haven't had time to try to come up with anything I like better. (Maybe it's not possible.) Publishing a fork that explores ideas for extensibility is a reasonable thing to do. That said, I completely agree with the things @nwellnhof has said here, and I'm a bit taken aback by @vmg's response. As a user of GitHub who has often been plagued by Markdown-related problems, I think that GitHub ought to be interested in thinking through the details of the extension architecture and API very carefully before rushing code into production. It seems foolish to give up the chance to discuss this with people who have thought quite a lot about it. |
OK, I'll take the time to formulate a more constructive answer tomorrow, but I'm not so enthusiastic anymore, @vmg saying you're gonna use your own fork is fine and a good thing to do, and I'm happy if my code can be useful in some manner to you folks, but seriously this pull request is a joke, and stating 13 hours after you made it "maybe other websites will start using our fork" is just ruthlessly bad-mannered in my very humble opinion, especially coming from someone representing a company that does have the power to carry out this kind of aggressive fork, which is what that would be, no matter how you may sugar-coat it. More constructively, the reason why I didn't complain about the perceived lack of reactivity from upstream is that the reduction of scope needed to be done, I would have indeed liked a bit more guidance from @jgm, but we all have our time constraints and I respect that faced with 20+ large commits it's hard to take the time to sit down and go through them. However the code dump that's proposed there is the exact opposite of "redux", with partial manual reverts, obviously WIP patches such as "Super hack to move table state out of HTML", I don't know how you guys even have the nerve to propose this seriously! Important the main reason why @jgm was unhappy with my request was a very meaningful design decision that I made, which was to not allow registration of new types of nodes from the outside, because then mixing and matching extensions from different providers that wouldn't know about each other, and thus couldn't specify correctly containing rules, was not possible. That use case might have been a bit too extreme, and imposing severe restrictions on the design, as a side note that's for this use case that I half-implemented the plugin loading mechanism. As I don't even need this for my use case, as I want to expose a specific set of extensions, over which I have complete control, much like I suppose you guys at github do, I decided some time ago I'd undertake a rework, and provide an actual redux version which upstream would be happy with. You guys should have come discuss on #123 to figure out how to approach this task which you visibly agree needed doing, I'd have been more than happy to help and we would have ended up with something more mergeable than at the beginning, in its state that branch is, in my opinion, way less mergeable than the original one, especially as it entirely contains it then partly deconstructs it ... |
+ Document the parser structure for internal usage.
As opposed to the previous commit, where I exposed getters for the private parser structure, this exposes two methods that will influence parsing, cmark_parser_add_child and cmark_parser_advance_offset.
This is necessary for extensions implementing vertical rules. An example is setext headings: A heading --------- When cmark parses 'A heading', it first creates a paragraph block to contain it, it's only when cmark parses the second line that the type of the block is changed to the CMARK_NODE_TYPE_HEADING type.
Ideally, this would be passed in set_user_data, but this would break API.
By implementing cmark_node_get_string_content and cmark_node_set_string_content. This is useful for vertical rules in extensions, as they may need to access it in order to decide whether to update the block. Unfortunately, this overlaps with get_literal and set_literal. As far as I can tell we should deprecate these functions, and have them follow the get_string_content code path and set_string_content for the while.
And expose and implement block parsing hooks
Linux-only support
Allow listing and attaching extensions. Also cleanup valgrind a little by removing exits and using cleanup gotos
We have no syntax rules yet for creating them natively, but future extensions may provide some.
This slightly breaks API as finish will now return NULL after it has been called once, but I believe that's for the best. This fixes potential leaks of root when a parser was fed but not finished, and makes reusing a parser instance multiple times possible.
+ And pointer to a free function as well.
This can be useful for transclusion extensions for example, http://talk.commonmark.org/t/transclusion-or-including-sub-documents-for-reuse/270
Bugfixes in strikethrough included; on some edges it would segfault.
Please "super-squash" what I actually did in a commit authored by me, and split out your own additions in a separate one, that would be way more polite |
@MathieuDuponchelle: can you suggest a solution you'd be happy with? I'm not really interested in authorship itself as a concept, so if retagging the commit as yours would be sufficient, I'd most like to do that. |
@kivikakk , see above, thanks, you don't get to decide whether your "lack of interest for authorship as a concept" extends to other people's work as well ;) |
Same for the strikethrough extension, and the table extension, you seem to have reworked the latter but it would have been way more practical to actually see what your changes were if you had committed them separately, in any case I do have authorship on the original work, you didn't even mention it in the commit messages there ... |
@MathieuDuponchelle: I've split the commits up accordingly. It's not possible to separate the table and strikethrough extensions into separate commits because they're in entirely different files to begin with; the diff is simply minus I hope this helps. |
@kivikakk thanks. I'd really like to know what your intentions are with this fork in the long term? |
Thanks everyone, this is cool. You guys mention that this is already in use to render github comments. Is the syntax highlighting in github flavored markdown done as a preprocessing step or so? |
@MathieuDuponchelle: We don't have a specific goal with the fork. It's the code we use to render Markdown in GitHub and we wanted to OSS it so users can refer to it. Of course, we would love it if @jgm were interested on reviewing the extensions we've implemented and consider merging those upstream! But he's already busy enough as it is, so we're happy to just start rendering all the documents we currently have as CommonMark + our unofficial extensions. @jeroenooms: It's actually a post-processing step on the generated HTML, just like all the other small GitHub-specific features, like autolinking commits and user names. Check out http://github.com/jch/html-pipeline for an example of how to do this yourself. |
OK. One final question: if no extensions are enabled in the parser, the output of this fork will be exactly identical to the original jgm/cmark code, correct? If we default to not enabling extensions, I should be able to use this fork as a drop-in replacement without any side effects? |
@jeroenooms: Yes! That's been the intention since day one and we've been trying really hard to stick to that. Just this morning we've upstreamed the only slight modification we had performed on the CommonMark spec (a small change on the way links are parsed) and @jgm has merged it. So we should be 100% compatible. By the way, I was going to followup by email, but yeah... There's the fork. Hopefully it'll be useful for you. :) |
+++ Vicent Martí [Dec 01 16 02:53 ]:
***@***.***: It's actually a post-processing step on the generated
HTML, just like all the other small GitHub-specific features, like
autolinking commits and user names. Check out
[4]http://github.com/jch/html-pipeline for an example of how to do this
yourself.
It's pretty easy to use cmark's iterator interface to do
this kind of thing prior to rendering, instead of
postprocessing.
Here's an example of doing this in lcmark, my lua cmark
wrapper:
https://github.com/jgm/lcmark/blob/master/filters/highlight.lua
I would guess this approach would be more efficient than
postprocessing HTML, because you're crawling an efficient
representation of the document's nodes, and you don't need
to parse HTML.
|
OK this is all very cool stuff. I am going to release R bindings soon. Many R users have been waiting for a complete and portable markdown renderer which does not depend on haskell ;) I do hope that the split will be temporary and we will eventually converge to combine efforts into a single "cmark" implementation. |
Another question: in an embedded application, should |
@vmg I am getting the following compiler warnings on Windows 64 bit:
@MathieuDuponchelle what is the point of these double casts? unsigned char c = (unsigned char) (unsigned long) tmp_char->data; Windows doesn't like that. Can I just take out the |
@jeroenooms , I don't know why they even kept this API, it's absolutely useless given there's no plugin loading mechanism anymore. What you want to do is instantiate extensions, and attach them on a per-parser basis. That way you can have multiple parsers with different extensions enabled in the same program. Here is an example of how that worked with a previous version of the extensions API: https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_module.c#L462 and https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_include_extension.c |
@MathieuDuponchelle hmmm OK. Well I need to get rid of these warnings to publish my bindings:
So it looks like on Win64, the pointer is not the same size as |
Hm, the problem is similar to this: https://developer.gnome.org/glib/stable/glib-Type-Conversion-Macros.html and should be handled in the same way. I would suggest going with a solution along these lines as well. |
Unfortunately, the code for this is a bit obfuscated in the configure.ac file of the glib, not readily copypastable I'm afraid @jeroenooms . The API could also be reworked to take a null-terminated char array fwiw. |
@jeroenooms I'd either stop using linked lists which are unnecessarily complex, or make the payload a union of pointer and integer types. But all of this this doesn't belong here. @vmg Can you allow to open issues on github/cmark? |
That's my second suggestion, it shouldn't be very hard to implement
There shouldn't be any reason to do that, I don't think cmark is intended to be compilable on platforms with < 32 bits pointer sizes? |
@nwellnhof @kivikakk @vmg it would be great to open issues / pr's for the fork. I'd like to add support for aligning tables (example). |
I opened issues on the github/cmark fork; I forgot they might not be enabled by default. We kept the plugin registration mechanism so e.g. the $ build/src/cmark --list-extensions
Available extensions:
table
strikethrough
autolink
tagfilter
$ (Per @MathieuDuponchelle's original design!) Without a registration mechanism, there's no way for the library to list them ahead of time. This is important for wrapping the library. @jeroenooms: here's how we've done it in the Ruby We register the known core extensions once, on initialisation of the gem, and then provide the method As for table alignment, it's already added! See the relevant part of the extensions tests, and this example: | longer table | headers to | demonstrate alignment |
| :-- | :-: | --: |
| a | b | c |
|
@jeroenooms: could you try building this branch for your R bindings? https://github.com/github/cmark/tree/kivikakk/win32 Unfortunately I'm unable to get an exact build environment up that simulates yours (or perhaps that's CRAN's), but this might clear all the warnings. |
If there are any more issues with our fork, please feel free to open them at https://github.com/github/cmark/issues/new! |
Given the reduction of scope in your branch, and the removal of plugins, the whole cmark_plugin API is simply useless. You could simply (and should, in order to provide an API that makes a minimal amount of sense), at cmark_init time, directly instantiate the extensions, as you compile them along the rest of your sources, instead of going through convoluted maneuvers in the cmark executable, involving (sic), a |
For those following this thread, I have posted an announcement of the R bindings here: https://ropensci.org/blog/blog/2016/12/02/commonmark |
@MathieuDuponchelle: (a) there is no more
I hope this has helped elucidate some of the design factors for you. Seeing as we've published our fork, I'll close this issue — please feel free to open issues on github/cmark if you have questions about our fork specifically! @jeroenooms: the R package release notes look great! :) |
Last remark, the design that was implemented by this pull request does not make it possible for a third party to implement a new inline extension and have it containable in, for example a table cell as implemented by the table extension. So much for compiling the core once ;) |
* fix(table): recoginse-without-empty-line (commonmark#141) * fix(table): fix bufsize_t not convert to uint16_t * fix(table): fix uint16_6 not convert to int * fix(table): fix uint16_6 not convert to int * fix(table): clear unused type conversion * restore whitespace * Always free `paragraph_content` `cmark_node_set_string_content` allocates and copies the data in `paragraph_content` so it is not needed afterwards. ``` ================================================================= ==14==ERROR: LeakSanitizer: detected memory leaks Direct leak of 24 byte(s) in 1 object(s) allocated from: 0 0x4dd330 in calloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:97 1 0x59e243 in xcalloc /src/octofuzz/src/cmark.c:18:15 2 0x58fd75 in unescape_pipes /src/octofuzz/extensions/table.c:95:39 3 0x58fd75 in try_inserting_table_header_paragraph /src/octofuzz/extensions/table.c:187 4 0x58fd75 in try_opening_table_header /src/octofuzz/extensions/table.c:254 5 0x58fd75 in try_opening_table_block /src/octofuzz/extensions/table.c:370 6 0x5b22d5 in open_new_blocks /src/octofuzz/src/blocks.c:1275:27 7 0x5b22d5 in S_process_line /src/octofuzz/src/blocks.c:1465 8 0x5aa7f0 in cmark_parser_finish /src/octofuzz/src/blocks.c:1492:5 9 0x58f2fc in LLVMFuzzerTestOneInput /src/octofuzz/test/cmark-fuzz.c:46:23 Indirect leak of 8 byte(s) in 1 object(s) allocated from: 0 0x4dd580 in realloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:107 1 0x59e2d3 in xrealloc /src/octofuzz/src/cmark.c:27:19 2 0x640364 in cmark_strbuf_grow /src/octofuzz/src/buffer.c:57:31 3 0x640364 in cmark_strbuf_init /src/octofuzz/src/buffer.c:31 4 0x58fd8b in unescape_pipes /src/octofuzz/extensions/table.c:98:3 5 0x58fd8b in try_inserting_table_header_paragraph /src/octofuzz/extensions/table.c:187 6 0x58fd8b in try_opening_table_header /src/octofuzz/extensions/table.c:254 7 0x58fd8b in try_opening_table_block /src/octofuzz/extensions/table.c:370 8 0x5b22d5 in open_new_blocks /src/octofuzz/src/blocks.c:1275:27 9 0x5b22d5 in S_process_line /src/octofuzz/src/blocks.c:1465 10 0x5aa7f0 in cmark_parser_finish /src/octofuzz/src/blocks.c:1492:5 11 0x58f2fc in LLVMFuzzerTestOneInput /src/octofuzz/test/cmark-fuzz.c:46:23 SUMMARY: AddressSanitizer: 32 byte(s) leaked in 2 allocation(s). ```
Hi there!
I've taken the work in #123 and rejigged it a bit. At GitHub we're currently using a Sundown-based parser/renderer, but it's not super extensible. So, we've decided to roll out CommonMark to replace it.
Here are some of the changes to #123 in this PR:
This functionality has all been exposed as opt-in in the Ruby gem commonmarker, which is our primary interface.