-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit TextType to include Markdown #275
Comments
I agree with you in principle, but this would be a backwards-incompatible change, and so would be out of scope for 2.2. If I had it to do today, it might make more sense to support markdown over docbook. Of course, in 2002 or so, docbook had a similar lock on the markup market as markdown does today, so in 10 to 15 years more we might very well have something else that is ascendant. RST came in between and was looking good for a while. Not sure how we predict the future of markup languages. Our bottom line is we need some simple formatting to support some emphasis text, some lists, subscript and superscript as these are common in scientific texts, and a few other things. But you are right that it is hard to just support these. Let's continue this discussion to figure out how to move forward. |
Yup, guess that's as I surmised. I'm curious what insight you have onto how metadata authors are generating TextType content in general though -- I don't image many people just type out abstracts and methods with the relevant tags. As you've probably seen, in the R package we've attempted to suggest authors can write such long-form docs in markdown, html, or probably most popularly, MS Word, and the package will attempt to convert it into DocBook via pandoc. Obviously this is asking for trouble though, since full Docbook isn't supported. At very least the R package needs some mechanism to filter out invalid tags created this way... I suppose moving in this direction with the EML spec, e.g. for full Docbook support, would be more backwards compatible (e.g. since it would add rather than remove tags), but I guess would open a larger can of worms (allbeit one that might be largely addressed by existing Docbook tooling). I completely agree with you that markdown isn't the ideal here; it's great for content creation but only if being rendered into something else, it's far to fractured and loosely defined to be desirable here. (Docbook, being the kindred open XML format, is probably still the most natural choice). In any event, I think any revision should of course be driven by understanding how users are creating these EML elements, (and also how they want to create these elements) and how they are consuming them (or would like to be able to consume them -- e.g. render a nice Scientific Data article from the relevant EML methods/protocols sections etc). Still probably all beyond what you want to tackle with 2.2 |
Not necessarily out of scope. I'd note that #269 is intended to allow formatting and inline images that let us build the equivalent of Scientific Data articles in EML as a structured metadata language. I'm guessing that will require changes to TextType that are similar to what you are requesting. Maybe this becomes EML 3? Yikes! |
After some further contemplation, I have moved this into the EML 2.2 project with the intention of seeing if we can support the full docbook schema in Advantages:
Disadvantages
Overall, though, I think its a win. Commentary, @cboettig ? |
Nice, I think this is the way to go. Certainly it get's more complex to render, but it opens up all the docbook tooling to deal with that, which I think is a big win. Just because it can be more complex doesn't necessarily mean that complexity will be widely used, which I think is already illustrated by the current textType complexity (e.g. display tools already don't support all textType fields and it's largely not a big problem). Full docbook support in textType seems like it would appeal to journals with the data articles stuff from ESA etc, right? |
sounds like an improvement. Another common use of text type is methods descriptions, which often include images and tables. Currently, the only option is to include these in an external doc (protocol PDF). We'd need good guidelines for when these objects should be in text and when they belong in the dataset entities. |
Thanks, @mobb. I was thinking that inline references would be to data entities in the package, as those are available. So, the steps for including, for example, a site layout figure in the methods would be to first include the figure as an otherEntity, and then reference it inline in the methods section using |
@amoeba — now that you’ve implemented the XSL for TextType for EML 2.1.1, do you have further thoughts on the feasibility of extending TextType to support all (or most?) of docbook, as described in this issue #275? How much more work would be involved in supporting the display of the full docbook spec, and in particular the |
Hey @mbjones. I think supporting full DocBook makes a lot of sense here. Certainly would be a big help for authoring EML from Word/md/etc via Pandoc. As far as updating our display systems to fully support all of DocBook in EML As a potential tangent, my recollection is that, for display systems, we've been considering moving away from an XSLT-based rendering process toward a fully client-side process. If we were to support full DocBook in EML |
Yikes, DocBook is huge. Looking into inclusion of DocBook, its more intimidating than ever. Here's the content model for just the mediaobject element, which is what we might want for graphics (along with inlinemediaobject):
and here's an example of it in use: <mediaobject>
<imageobject>
<imagedata fileref="figures/eiffeltower.eps" format="EPS"/>
</imageobject>
<imageobject>
<imagedata fileref="figures/eiffeltower.png" format="PNG"/>
</imageobject>
<textobject>
<phrase>The Eiffel Tower</phrase>
</textobject>
<caption>
<para>Designed by Gustave Eiffel in 1889, The Eiffel Tower is one of the
most widely recognized buildings in the world.
</para>
</caption>
</mediaobject> While I think it is still true that processors like pandoc will handle this with aplomb, as will the standard As I see it now, we could follow a few different paths:
The pros and cons of these need to be discussed. Here are some of the key things I think we need to support:
If we were to consider Markdown support, which flavor du jour would be best? |
yeah, I was feeling queasy about going all in on doc-book the other day as well. For markdown, I suspect CommonMark would be the best choice: http://commonmark.org/ |
@mbjones said:
Of your list of options, I'd add a
<abstract><![CDATA[
# title
## subheader
![alt][image.png]
]]></abstract> I like this one because I'm imagining it to be more common to want an entire methods section or abstract to be in one form or other (i.e. all in markdown)
<abstract>
<para mime-type="markdown">
<!-- markdown here -->
</para>
</abstract> I like this idea because it allows you to mix but I think mixing is pointless so I'm not a fan of this one. @mbjones said:
Can you elaborate more on this:
Why inline? That would make for some really ugly rendered output if you could inline any of those. IMHO they should always be block-level. As for the rest of the features list, Markdown appears to be the best choice for getting us there and getting us there reasonably fast. |
<abstract>
<markdown><![CDATA[
# title
## subheader
![alt][image.png]
]]>
</markdown>
</abstract>
<abstract>
# title
This is a markdown line with <emphasis>docbook formatting</emphasis> and an <para>embedded para</para> so how should it be rendered?
## subheader
![alt][image.png]
</abstract>
|
OK, so I got a start on this using in commit SHA 54c7cd7 implementing option 4 from my list above, which is to create a dedicated <abstract>
<markdown><![CDATA[
Some intro text in abstract, then break into subsections.
## Level 2 heading
We use a level 2 heading because Level 1 would be at the same level as
the main sections of the paper.
## Another level 2 heading
With some information.
Plus, it can include all of the other features of
[Github Flavored Markdown (GFM)](https://github.github.com/gfm/). Note that this
version of GFM is a superset of CommonMark, and is intended to eventually be an
official extension of CommonMark.
]]>
</markdown>
</abstract> Still need to improve the documentation, and include explicit instructions for how to:
|
Added documentation on indenting, CDATA, and image reference links. At this point, I think the addition of |
At this point, feedback from the broad EML community would be helpful on this markdown feature addition. Please leave a comment in this ticket with your take on whether it would be useful and whether you think you would use it in your EML documents. Thanks! |
@mbjones et al. - thanks for your hard work on this issue. This would be a very welcome addition, that I would find extremely helpful. |
👍 Proposal looks great to me and would find this very useful. |
This proposal looks like the most flexible to me, and would be very useful to use either TextType or markdown. But it is not clear to me what will be the best way to handle equations -- a solution for them that can both be rendered and interpreted. It does not look like markdown supports equations on its own, github/markup#897 I really am not a fan of using inline images for equations, per comment #275 (comment) I’ve played around with embedding LaTeX equations in EML text, but am no expert - it was a relatively simple case. MathJax.js renders it, but needs to be added to stylesheets. I know a lot of EML-based groups are moving toward using the R package, specifically so that the text elements (abstract, methods) can be created by scientists, instead of by a data manager - which is a great improvement to the workflow! @cboettig , what does the R package do with equations in word docs? Pandoc seems to have no problem with Math (with the proper switches). |
The R package just calls out to pandoc on TextType. Given an EML TextType object this is pretty robust: the TextType node is read into pandoc's DocBook parser, and then pandoc renders to the desired format. This should handle docbook equations, though not sure if those are valid EML? Going the other way this can create problems, pandoc's parsing of Word isn't perfect; and when this is converted to Docbook the package does nothing to make sure that only the EML subset of Docbook is included. In general I don't think this workflow is really well tested. I think things will be cleaner with the Markdown version. Matt's proposal uses GitHub markdown as the default, which doesn't support equations, but since it's all just plain text nothing stops a user from writing |
For maximum acceptance, equations should look as pretty coming from EML as they do in print -- ie, typeset. I think this will be particularly desirable for a data paper. Some strings are readable:
So I will spend some time looking at a few options (e.g., mathML, LaTeX), and see how they fit with this proposal and some of the ways people are generating their EML. |
@mobb Thanks, but not sure I follow what you mean as "coming from EML". I would imagine the only way a user saw such a raw string is if they were reading the XML document directly. Users writing equations probably do want to write them in LaTeX, just like in your hard-to-read string example. How that string look to a consumer viewing the EML will depend on the tool to view it, right? Assuming the markdown is rendered to HTML and dumped to a website, then it's just a matter of turning on mathjax on the website for such a equation string to be formatted properly. |
sorry @cboettig - i was imprecise. should have said rendered from EML (with HTML being the most common). [Side note: even data mgrs (who script EML construction, and understand the schema) are using word-doc and the eml-R package for text elements, because it lets them delegate the science-related tasks to someone who has no knowledge of arcane metadata models like EML. This is a huge improvement. ] I have checked: the eml-R package converts MS-word-equations into laTeX, so this pathway will work as it does now. I added a LaTeX equation to the eml-data-paper.xml example in a markdown field, so its rendering can be tested as well. |
👏 thanks, that's awesome to know pandoc can convert MS word equations into nice LaTeX, and that the Word -> R EML ->EML pipeline is useful! I'm still nervous it will break when pandoc uses some docbook that EML doesn't supprt; so switching to Markdown will avoid that. I had a vague recollection that Word supports LaTeX input in its equation editor, but I see your point that many people prefer the WYSIWYG equation editor (to be honest I'd forgotten about those; I mostly interact with equations with other theorists who hate doing equations in those editors!) |
I agree that equation handling is critical. Unfortunately, I could not find a version of markdown that unambiguously supports all of the features that we want. I thought that GFM extensions to CommonMark were most likely to be long-term stable, pretty complete, and well supported. So I suggested that. I also considered adding a This has been a good discussion, and I think we need to add some equations to our testing docs, but otherwise I don't see a proposal to change anything from this thread. @mobb, if you think something needs to be changed, could you provide a proposal? Thanks. |
This is probably out of scope for 2.2, and my apologies if it's already discussed in parts elsewhere that I've overlooked.
It seems like the use of partial doc-book markup in TextType is often falling in a funny middle ground where it's not complex enough to just treat as full doc-book, but otherwise is too complex for common tools. For instance, it appears that common rendering tools for EML, including I think the obvious reference implementation of the MetaCat web display aren't actually rendering all text-type elements in full but are stripping them out. Meanwhile, the structure appears to be often challenging for users to generate or manipulate easily (think word-counts, common words and other text mining operations).
See ropensci/EML#217 for more discussion.
From my limited experience so far here, it seems like it would be preferable to either adopt some rich-text format for which there is more comprehensive tool support (e.g. full docbook, or any other open, archival quality format) or just opt for something much simpler (sections & paras with plain-text content).
The text was updated successfully, but these errors were encountered: