Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit TextType to include Markdown #275

Closed
cboettig opened this issue Jul 17, 2017 · 25 comments
Closed

Revisit TextType to include Markdown #275

cboettig opened this issue Jul 17, 2017 · 25 comments
Assignees
Milestone

Comments

@cboettig
Copy link
Member

This is probably out of scope for 2.2, and my apologies if it's already discussed in parts elsewhere that I've overlooked.

It seems like the use of partial doc-book markup in TextType is often falling in a funny middle ground where it's not complex enough to just treat as full doc-book, but otherwise is too complex for common tools. For instance, it appears that common rendering tools for EML, including I think the obvious reference implementation of the MetaCat web display aren't actually rendering all text-type elements in full but are stripping them out. Meanwhile, the structure appears to be often challenging for users to generate or manipulate easily (think word-counts, common words and other text mining operations).

See ropensci/EML#217 for more discussion.

From my limited experience so far here, it seems like it would be preferable to either adopt some rich-text format for which there is more comprehensive tool support (e.g. full docbook, or any other open, archival quality format) or just opt for something much simpler (sections & paras with plain-text content).

@mbjones
Copy link
Member

mbjones commented Jul 17, 2017

I agree with you in principle, but this would be a backwards-incompatible change, and so would be out of scope for 2.2. If I had it to do today, it might make more sense to support markdown over docbook. Of course, in 2002 or so, docbook had a similar lock on the markup market as markdown does today, so in 10 to 15 years more we might very well have something else that is ascendant. RST came in between and was looking good for a while. Not sure how we predict the future of markup languages. Our bottom line is we need some simple formatting to support some emphasis text, some lists, subscript and superscript as these are common in scientific texts, and a few other things. But you are right that it is hard to just support these. Let's continue this discussion to figure out how to move forward.

@cboettig
Copy link
Member Author

Yup, guess that's as I surmised. I'm curious what insight you have onto how metadata authors are generating TextType content in general though -- I don't image many people just type out abstracts and methods with the relevant tags.

As you've probably seen, in the R package we've attempted to suggest authors can write such long-form docs in markdown, html, or probably most popularly, MS Word, and the package will attempt to convert it into DocBook via pandoc. Obviously this is asking for trouble though, since full Docbook isn't supported. At very least the R package needs some mechanism to filter out invalid tags created this way...

I suppose moving in this direction with the EML spec, e.g. for full Docbook support, would be more backwards compatible (e.g. since it would add rather than remove tags), but I guess would open a larger can of worms (allbeit one that might be largely addressed by existing Docbook tooling).

I completely agree with you that markdown isn't the ideal here; it's great for content creation but only if being rendered into something else, it's far to fractured and loosely defined to be desirable here. (Docbook, being the kindred open XML format, is probably still the most natural choice).

In any event, I think any revision should of course be driven by understanding how users are creating these EML elements, (and also how they want to create these elements) and how they are consuming them (or would like to be able to consume them -- e.g. render a nice Scientific Data article from the relevant EML methods/protocols sections etc). Still probably all beyond what you want to tackle with 2.2

@mbjones
Copy link
Member

mbjones commented Jul 18, 2017

Not necessarily out of scope. I'd note that #269 is intended to allow formatting and inline images that let us build the equivalent of Scientific Data articles in EML as a structured metadata language. I'm guessing that will require changes to TextType that are similar to what you are requesting. Maybe this becomes EML 3? Yikes!

@mbjones
Copy link
Member

mbjones commented Sep 9, 2017

After some further contemplation, I have moved this into the EML 2.2 project with the intention of seeing if we can support the full docbook schema in eml:TextType, which would allow inline insertion of figures and tables and other needed formats for data papers. As these would be additions, this would not be a breaking change (all previously valid TextType elements would still be valid here).

Advantages:

  • users can use any tool that generates valid docbook to both create and consume the contents of TextType (whereas now its hard to use the current format)
  • rich features of Docbook would be available, including the use of inlinemediaobject to provide for in-text images and tables, which are a requirement to support data papers as described in issue add fields to support data papers #269

Disadvantages

  • the TextType field is more complex, and so display software will need to be richer to properly display the contents

Overall, though, I think its a win. Commentary, @cboettig ?

@cboettig
Copy link
Member Author

cboettig commented Sep 9, 2017

Nice, I think this is the way to go. Certainly it get's more complex to render, but it opens up all the docbook tooling to deal with that, which I think is a big win.

Just because it can be more complex doesn't necessarily mean that complexity will be widely used, which I think is already illustrated by the current textType complexity (e.g. display tools already don't support all textType fields and it's largely not a big problem).

Full docbook support in textType seems like it would appeal to journals with the data articles stuff from ESA etc, right?

@mobb
Copy link
Contributor

mobb commented Sep 11, 2017

sounds like an improvement. Another common use of text type is methods descriptions, which often include images and tables. Currently, the only option is to include these in an external doc (protocol PDF). We'd need good guidelines for when these objects should be in text and when they belong in the dataset entities.

@mbjones
Copy link
Member

mbjones commented Sep 11, 2017

Thanks, @mobb. I was thinking that inline references would be to data entities in the package, as those are available. So, the steps for including, for example, a site layout figure in the methods would be to first include the figure as an otherEntity, and then reference it inline in the methods section using inlinemediaobject.

@mbjones
Copy link
Member

mbjones commented Oct 5, 2017

@amoeba — now that you’ve implemented the XSL for TextType for EML 2.1.1, do you have further thoughts on the feasibility of extending TextType to support all (or most?) of docbook, as described in this issue #275? How much more work would be involved in supporting the display of the full docbook spec, and in particular the inlinemediaobject tag to render images and other media in the body of the text?

@amoeba
Copy link
Contributor

amoeba commented Oct 5, 2017

Hey @mbjones. I think supporting full DocBook makes a lot of sense here. Certainly would be a big help for authoring EML from Word/md/etc via Pandoc. As far as updating our display systems to fully support all of DocBook in EML TextType modules, I don't expect it would be a ton of work. Supporting EML 2.1.1's TextType as I've recently done was less than a days work. There already exist official DocBook XML -> HTML5 XSLs (though I'm not sure if they'd be of use to us). If fully supporting DocBook rendering was too hard, supporting the rendering of a subset might work for our use case.

As a potential tangent, my recollection is that, for display systems, we've been considering moving away from an XSLT-based rendering process toward a fully client-side process. If we were to support full DocBook in EML TextType, this would be a pain because in-browser rendering facilities for DocBook appear non-existent while the ecosystem on the web for Markdown already exists.

@mbjones mbjones added the next label Oct 30, 2017
@mbjones
Copy link
Member

mbjones commented Oct 31, 2017

Yikes, DocBook is huge. Looking into inclusion of DocBook, its more intimidating than ever. Here's the content model for just the mediaobject element, which is what we might want for graphics (along with inlinemediaobject):

mediaobject ::= (objectinfo?,
 (videoobject|audioobject|imageobject|imageobjectco|textobject)+, 
  caption?)

and here's an example of it in use:

<mediaobject>
    <imageobject>
        <imagedata fileref="figures/eiffeltower.eps" format="EPS"/>
    </imageobject>
    <imageobject>
        <imagedata fileref="figures/eiffeltower.png" format="PNG"/>
    </imageobject>
    <textobject>
        <phrase>The Eiffel Tower</phrase>
    </textobject>
    <caption>
        <para>Designed by Gustave Eiffel in 1889, The Eiffel Tower is one of the
            most widely recognized buildings in the world.
        </para>
    </caption>
</mediaobject>

While I think it is still true that processors like pandoc will handle this with aplomb, as will the standard docbook.xsl, its not as clear to me how we would handle the complexity in JavaScript. Do we really want to parse and display a content model using javascript that has literally thousands of permutations of formatting directives? I think it'll be a challenge. This issue needs discussion.

As I see it now, we could follow a few different paths:

  1. add select fields to TextType to support embedded images and a few new formatting features
  2. replace all of TextType with a new version that allows docbook child content; this means that TextType would be redefined as one of the components, most likely a DocBook Article
    • despite what I said earlier, I think this would be a backwards incompatible change because TextType currently doesn't really follow the DocBook model
  3. redefine TextType to Markdown or something similar (backwards incompatible)
  4. add a new choice to TextType that allows an alternate formatting model in its own section, such as a markdown element. This would mean redefining TextType as:
    • TextType (section | para | markdown)+

The pros and cons of these need to be discussed. Here are some of the key things I think we need to support:

  • enable inline and block level images, videos, and interactive objects to be embedded in text sections, with figure captions.
    • format control for these, including justification, size/scaling, wrapping, and other key layout features
  • enable inline and block level tables in text sections, with table captions
    • enable merging across row and column cells
    • demarcation of header rows
    • need to discuss formatting needs for these tables
  • inline citations with Literature Cited section
  • enable streamlined authoring through various editing tools with GUI formatting support
  • enable complete transformation into display-ready HTML
  • preferably enable complete transformation into display-ready PDF as well
  • inline equations?
  • others?

If we were to consider Markdown support, which flavor du jour would be best?

@cboettig
Copy link
Member Author

yeah, I was feeling queasy about going all in on doc-book the other day as well. For markdown, I suspect CommonMark would be the best choice: http://commonmark.org/

@amoeba
Copy link
Contributor

amoeba commented Oct 31, 2017

@mbjones said:

As I see it now, we could follow a few different paths:

Of your list of options, I'd add a

  1. (Similar to 3) Define the current set of TextType elements as { TextType | MarkdownType }, allowing for mixed content where text nodes are interpreted as Markdown and element nodes are TextType. Can that be done?
<abstract><![CDATA[
  # title
  ## subheader
  ![alt][image.png]
]]></abstract>

I like this one because I'm imagining it to be more common to want an entire methods section or abstract to be in one form or other (i.e. all in markdown)

  1. Add an attribute to para and section like mime-type with a default of text and an allowable value of markdown., e.g.,
<abstract>
  <para mime-type="markdown">
    <!-- markdown here -->
  </para>
</abstract>

I like this idea because it allows you to mix but I think mixing is pointless so I'm not a fan of this one.

@mbjones said:

Here are some of the key things I think we need to support:

Can you elaborate more on this:

enable inline and block level images, videos, and interactive objects to be embedded in text sections...

Why inline? That would make for some really ugly rendered output if you could inline any of those. IMHO they should always be block-level.

As for the rest of the features list, Markdown appears to be the best choice for getting us there and getting us there reasonably fast.

@mbjones
Copy link
Member

mbjones commented Oct 31, 2017

  • For your option 5, we can't really do { TextType | MarkdownType }, as any given element must be assigned to one and only one ComplexType. However, we can redefine TextType to allow a markdown element (my path 4), which admittedly is not as clean as having the markdown directly in the text. This would then look like this:
<abstract>
  <markdown><![CDATA[
  # title
  ## subheader
  ![alt][image.png]
  ]]>
  </markdown>
</abstract>
  • However, we could if desired reinterpret TextType, as it is mixed content already, and so we could change the definition of TextType to say that any text outside of para/section elements is to be interpreted as Markdown. It would be a bit weird, and may result in some formatting oddities for existing EML documents that aren't expecting that interpretation. Which is why I didn't propose it. I think its cleaner to keep the markdown and the docbook separate.
  • I like your option 6 in principle, although I would call the attribute something like 'text-type' rather than mime-type; we could create a new SimpleType called MarkdownText or something like that and use it for the data type for those fields. This would completely break our linkage to docbook however (which doesn't seem to have much love anyways). The problem I see is the mix of markdown and docbook formatting that would then be possible in those fields. I think that could be quite confusing, which is why I didn't include it as an option (just as for the mixed content model for TextType itself). For example, I'm not sure how we would render this:
<abstract>
  # title
  This is a markdown line with <emphasis>docbook formatting</emphasis> and an <para>embedded para</para> so how should it be rendered?

  ## subheader
  ![alt][image.png]
</abstract>
  • regarding inline versus block: I agree we generally would want block. The inline was for supporting inline equations and things like that, which are actually shown in the data paper example that ESA prototyped. The nice thing about docbook is it completely supports that type of markup, including both inline and block level equations, images, videos, etc. (e.g., see inlineequation)

mbjones added a commit that referenced this issue Feb 10, 2018
This includes a new `markdown` element in `txt:TextType` to support
Github flavored markdown.  And new elements for an introduction,
gettingStarted, and acknowledgements.  See issue #269 and #275.
@mbjones mbjones self-assigned this Feb 12, 2018
@mbjones
Copy link
Member

mbjones commented Feb 12, 2018

OK, so I got a start on this using in commit SHA 54c7cd7 implementing option 4 from my list above, which is to create a dedicated markdown element within the TextType choice. This has the advantage of making it crystal clear where markdown parsing rules will apply, and allows interleaving of markdown blocks with Docbook blocks and plain text. Here's an example of a markdown section in action:

<abstract>
    <markdown><![CDATA[
        Some intro text in abstract, then break into subsections.

        ## Level 2 heading 

        We use a level 2 heading because Level 1 would be at the same level as
        the main sections of the paper.

        ## Another level 2 heading 
        With some information.

        Plus, it can include all of the other features of 
        [Github Flavored Markdown (GFM)](https://github.github.com/gfm/).  Note that this
        version of GFM is a superset of CommonMark, and is intended to eventually be an
        official extension of CommonMark.
        ]]>
    </markdown>
</abstract>

Still need to improve the documentation, and include explicit instructions for how to:

  • Ensure text is wrapped properly in CDATA sections
  • Make sure alignment is consistent so that markdown indentation works
  • Specify how inline image references should work in terms of relative URLs

@mbjones mbjones added in progress and removed next labels Feb 12, 2018
@mbjones mbjones changed the title Revisiting TextType? Revisit TextType to include Markdown Feb 12, 2018
@mbjones
Copy link
Member

mbjones commented Feb 12, 2018

Added documentation on indenting, CDATA, and image reference links. At this point, I think the addition of markdown elements to TextType is ready for review.

@mbjones
Copy link
Member

mbjones commented Feb 12, 2018

At this point, feedback from the broad EML community would be helpful on this markdown feature addition. Please leave a comment in this ticket with your take on whether it would be useful and whether you think you would use it in your EML documents. Thanks!

@srearl
Copy link
Contributor

srearl commented Feb 12, 2018

@mbjones et al. - thanks for your hard work on this issue. This would be a very welcome addition, that I would find extremely helpful.

@cboettig
Copy link
Member Author

👍 Proposal looks great to me and would find this very useful.

@mobb
Copy link
Contributor

mobb commented Feb 21, 2018

This proposal looks like the most flexible to me, and would be very useful to use either TextType or markdown.

But it is not clear to me what will be the best way to handle equations -- a solution for them that can both be rendered and interpreted. It does not look like markdown supports equations on its own, github/markup#897
And even if it did, if markdown and TextType cannot be mixed, then it means converting an existing system wholesale (ie, away from TextType). Not a deal breaker, but would be good to know.

I really am not a fan of using inline images for equations, per comment #275 (comment)
Having to build/store/describe those images as other entities is tedious, and clutters up the package. Super-, sub-scripting currently offered by textType is too simplistic. Docbook says that equation can hold MatML as well as an image-ref, http://tdg.docbook.org/tdg/4.5/equation.html
So one alternative is to add MathML to TextType.

I’ve played around with embedding LaTeX equations in EML text, but am no expert - it was a relatively simple case. MathJax.js renders it, but needs to be added to stylesheets.

I know a lot of EML-based groups are moving toward using the R package, specifically so that the text elements (abstract, methods) can be created by scientists, instead of by a data manager - which is a great improvement to the workflow! @cboettig , what does the R package do with equations in word docs? Pandoc seems to have no problem with Math (with the proper switches).
I think that before we finalized this (if there is a solution for equations), it would be best to know more about the EML text generation options. Carl had asked something similar way back in this thread.

@cboettig
Copy link
Member Author

The R package just calls out to pandoc on TextType. Given an EML TextType object this is pretty robust: the TextType node is read into pandoc's DocBook parser, and then pandoc renders to the desired format. This should handle docbook equations, though not sure if those are valid EML? Going the other way this can create problems, pandoc's parsing of Word isn't perfect; and when this is converted to Docbook the package does nothing to make sure that only the EML subset of Docbook is included. In general I don't think this workflow is really well tested.

I think things will be cleaner with the Markdown version. Matt's proposal uses GitHub markdown as the default, which doesn't support equations, but since it's all just plain text nothing stops a user from writing $e = m c ^ 2$. A user could render the markdown with a different markdown parser (i.e. pandoc flavor), or simply let this appear in the HTML as raw $e = m c ^2$ but manually add MathJax styles to render that, (as you suggest). Markdown also permits raw HTML, so a user could just write MathML right into the markdown if they really wanted to (hey, MathML is built into the HTML5 spec, despite not being implemented by many browsers); the key thing being that from EML's perspective, it's all just a string; which is exactly what it should be, in my opinion (EML being a metadata language and not a markup language)

@mobb
Copy link
Contributor

mobb commented Feb 21, 2018

For maximum acceptance, equations should look as pretty coming from EML as they do in print -- ie, typeset. I think this will be particularly desirable for a data paper.

Some strings are readable: $e = m c ^ 2$ or d_{ij,t}, but most people consider that rather primitive. And it doesn't take long for these to become hard to read:

$$ \frac{(D_{xy,t,t}+D_{yz,t})}{(L_{xy}+L_{yz})}=\frac{d_{ij,t}}{l_{ij}} $$

So I will spend some time looking at a few options (e.g., mathML, LaTeX), and see how they fit with this proposal and some of the ways people are generating their EML.

@cboettig
Copy link
Member Author

@mobb Thanks, but not sure I follow what you mean as "coming from EML". I would imagine the only way a user saw such a raw string is if they were reading the XML document directly. Users writing equations probably do want to write them in LaTeX, just like in your hard-to-read string example. How that string look to a consumer viewing the EML will depend on the tool to view it, right? Assuming the markdown is rendered to HTML and dumped to a website, then it's just a matter of turning on mathjax on the website for such a equation string to be formatted properly.

@mobb
Copy link
Contributor

mobb commented Feb 21, 2018

sorry @cboettig - i was imprecise. should have said rendered from EML (with HTML being the most common).
I agree-- LaTeX + mathjax is my preference too. However, IMO, the typical users writing equations do not want to write them in LaTeX -- they are too used to WYSIWYG editors. That's why I asked about interpreting equations from Word docs - I was concerned that it might be outputting mathML, or something else altogether.

[Side note: even data mgrs (who script EML construction, and understand the schema) are using word-doc and the eml-R package for text elements, because it lets them delegate the science-related tasks to someone who has no knowledge of arcane metadata models like EML. This is a huge improvement. ]

I have checked: the eml-R package converts MS-word-equations into laTeX, so this pathway will work as it does now. I added a LaTeX equation to the eml-data-paper.xml example in a markdown field, so its rendering can be tested as well.

@cboettig
Copy link
Member Author

👏 thanks, that's awesome to know pandoc can convert MS word equations into nice LaTeX, and that the Word -> R EML ->EML pipeline is useful! I'm still nervous it will break when pandoc uses some docbook that EML doesn't supprt; so switching to Markdown will avoid that.

I had a vague recollection that Word supports LaTeX input in its equation editor, but I see your point that many people prefer the WYSIWYG equation editor (to be honest I'd forgotten about those; I mostly interact with equations with other theorists who hate doing equations in those editors!)

@mbjones
Copy link
Member

mbjones commented Feb 23, 2018

I agree that equation handling is critical. Unfortunately, I could not find a version of markdown that unambiguously supports all of the features that we want. I thought that GFM extensions to CommonMark were most likely to be long-term stable, pretty complete, and well supported. So I suggested that. I also considered adding a syntax attribute on the markdown element for people to indicate what markdown syntax they follow, but I thought that would get messy as processors might then be tasked with supporting all of the markdown variants, which is not really feasible given their incompatibilities. CommonMark is really trying to tackle that problem. As @cboettig said, we're likely to just deploy this through pandoc or another common markdown library to convert to HTML, so any features like MathML and LaTeX equations should be supported, and hopefully will be added to COmmonMark as extensions eventually.

This has been a good discussion, and I think we need to add some equations to our testing docs, but otherwise I don't see a proposal to change anything from this thread. @mobb, if you think something needs to be changed, could you provide a proposal? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants