Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc keeps span and div tags when converting html to org #3771

Closed
tshu-w opened this issue Jun 28, 2017 · 20 comments
Closed

Pandoc keeps span and div tags when converting html to org #3771

tshu-w opened this issue Jun 28, 2017 · 20 comments

Comments

@tshu-w
Copy link

tshu-w commented Jun 28, 2017

Version: 1.19.2.1

pandoc -f html -t org and pandoc -f html -t org-raw_html-native_divs-native_spans make no difference.

Input html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>
<body>
  <div class="Section1">
    <p class="Question"><span style="FONT-SIZE: 10pt">Today</span> <span style=
    "FONT-SIZE: 10pt">is</span> <span lang="HR" style=
    "FONT-SIZE: 10pt; mso-ansi-language: HR">a</span><span style=
    "FONT-SIZE: 10pt">nice</span> <span style="FONT-SIZE: 10pt">day</span> 
    </p>
  </div>
</body>
</html>

Both output org:

#+BEGIN_HTML
  <div class="Section1">
#+END_HTML

Today is anice day

#+BEGIN_HTML
  </div>
#+END_HTML

Expected:

Today is anice day

pandoc -f html -t markdown-raw_html-native_divs-native_spans seems to be no problem.

@jgm
Copy link
Owner

jgm commented Jun 28, 2017 via email

@jgm
Copy link
Owner

jgm commented Jun 28, 2017

The question is just how native spans should be rendered in Org mode. I don't know enough about Org mode to know all the options - pinging @tarleb.

@tarleb
Copy link
Collaborator

tarleb commented Jun 28, 2017

One method would be to use special blocks, like this:

#+ATTR_HTML: :class all classes but the first :key-value pairs
#+BEGIN_section1
here goes the content
#+END_section1

I've shied away from this solution for two reasons:

  1. Nesting is not possible (i.e., problems will result if the first class of nested divs are the same).
  2. Exporting from org will be ok when writing HTML, but LaTeX will give \begin{section1}…\end{section1}, which might be unexpected.

It might be the lesser evil though. Input of other org-mode users would be appreciated.

@alphapapa
Copy link

alphapapa commented Jun 29, 2017

There is no need for these DIVs to be represented in the Org output in the first place. Org organizes content by headings, which are H1, H2, etc. in HTML, and those are already converted correctly. DIVs like these are primarily used in HTML to organize and style content with CSS, which is irrelevant to Org.

These DIVs are simply useless clutter in the converted Org output. Instead of a user getting a clean, useful conversion of an HTML document to Org, he must spend extra time manually removing these irrelevant HTML DIV code blocks.

Previous Pandoc versions did not output this spurious HTML content; e.g. I'm using Pandoc 1.12.2.1 from Ubuntu 14.04, and it outputs only useful content (see example screenshot at https://github.com/alphapapa/org-protocol-capture-html).

I'm sure the ability to track and output these HTML DIVs in some output formats is very useful. But it's actually a detriment to Org output. And, as @tarleb pointed out, they cannot be nested properly in Org's native syntax. If there's a way to preserve them with extra options, that's great, but that's definitely a corner-case compared to a user simply wanting to capture the plain-text and outline-structured content from a simple HTML page. No one is expecting HTML->Pandoc->Org->Pandoc->HTML to produce a 1:1 conversion. So these HTML DIVs should simply be disabled for normal org output.

Thanks.

@tshu-w
Copy link
Author

tshu-w commented Jun 29, 2017

From my point, there are ways to meet different needs will be nice.
Like markdown, I can avoid span and div tags by turning off the raw_html, native_divs and native_spans extensions.

@jgm
Copy link
Owner

jgm commented Jun 29, 2017

I'm sympathetic to the idea that native Divs should not come across as raw HTML in Org.
But what about Divs with an id attribute that serve as anchors?
Is there a way to represent internal anchors in Org?

@tarleb
Copy link
Collaborator

tarleb commented Jun 29, 2017

Divs with only an id and no class or key-value pairs are currently unwrapped and the id is inserted as an <<anchor>>. Furthermore, if one of the classes is either quote or center, we wrap the content in the respective block type. If one of the classes is drawer, we create a drawer.

My current preference is to add a rule checking if the content is a single block, and to prefix that block with #+ATTR_HTML if that's the case. Otherwise, we should just keep the id if present and just output the content. This should keep the important information without shoehorning divs into the output. Does this seem reasonable?

@jgm
Copy link
Owner

jgm commented Jun 29, 2017 via email

@alphapapa
Copy link

Divs with only an id and no class or key-value pairs are currently unwrapped and the id is inserted as an <<anchor>>.

That seems sensible. If it's feasible, it would be nice to be able to disable the anchors in the output, as well.

Furthermore, if one of the classes is either quote or center, we wrap the content in the respective block type.

Seems like a clever hack to avoid parsing CSS while still getting the idea. Nice.

If one of the classes is drawer, we create a drawer.

I'm not sure about this one. Drawers are peculiar to Org-mode, and I wouldn't expect HTML elements with a class name drawer to be the equivalent., unless the author of the HTML was an Org user so fond of drawers that he reimplemented them in his HTML/CSS. :)

My current preference is to add a rule checking if the content is a single block, and to prefix that block with #+ATTR_HTML if that's the case.

I guess by "single block" you mean a non-nested DIV, i.e. a DIV containing no other DIVs?

I guess that's a decent compromise, as it allows users to remove the raw HTML by removing a single line from the Org output, instead of having to unwrap text from a HTML block. It would still be nice to have some kind of text-and-outline-structure-only mode that would leave out any raw HTML. Unless the user is planning to reconvert the Org to HTML for republishing (unlikely, if he's not the author, and the author would have the HTML to begin with), he probably will have no use for HTML like that in the output.

Otherwise, we should just keep the id if present and just output the content. This should keep the important information without shoehorning divs into the output. Does this seem reasonable?

I guess you mean, if the DIVs are nested, only output the <<anchor>>, and ignore the DIV attributes?

That's fine with me, although it seems inconsistent. But nesting DIVs in Org syntax isn't feasible, and I don't think trying to hack around that would make sense. HTML-to-Org is practically a one-way conversion for capturing or archiving information in an informal way, so it doesn't need raw HTML in the first place.

Thanks.

@alphapapa
Copy link

alphapapa commented Jul 2, 2017

By the way, I just found this question on Emacs.SE about this very problem: https://emacs.stackexchange.com/questions/24676/html-to-orgmode-via-pandoc-get-rid-of-all-begin-html-blocks It seems like disabling the output of DIVs in the Org output format would help a lot of people.

@zeltak FYI. :)

@zeltak
Copy link

zeltak commented Jul 2, 2017

thx @alphapapa :)

@tarleb
Copy link
Collaborator

tarleb commented Jul 3, 2017

Thanks @alphapapa, this is helpful.

The special handling of drawers was added to reduce information loss during org→org translations in pandoc: reading a drawer with pandoc's org reader gives a div containing drawer as one of it's classes to make styling easier when writing as HTML via pandoc.

@jgm: with checking if the content is a single block, I meant to inspect whether the div contains just a single block in the pandoc sense. The downside of this approach is that it would fail with some block types (e.g., lists, org is weird that way) and that it becomes more difficult for users to understand why attributes are retained for some divs but not for others.

I guess I agree with @alphapapa that it's better to keep special cases to a minimum here and that dropping everything but the div's id is the best option.

@alphapapa: We are currently integrate a lua-based filtering system into pandoc; removing unwanted information should become as easy as writing three short lines of lua code.

@alphapapa
Copy link

@tarleb Ah, that's clever about the drawers. And thanks, the Lua filtering sounds great. I assume that the filters can be passed as command-line options? That would be ideal for my use-case.

@tarleb tarleb closed this as completed in 6a6c385 Aug 31, 2017
@alphapapa
Copy link

@tarleb Thank you very much. Which version of Pandoc will that end up in?

@tarleb
Copy link
Collaborator

tarleb commented Sep 1, 2017

This will be shipped with pandoc v2.0. You can download a nightly from the inofficial nightly builds repo if you want to test this without building pandoc from source.

@tarleb
Copy link
Collaborator

tarleb commented Sep 1, 2017

Jgm pointed out on the mailing list that adding disabling the native_divs extension in the reader is a good work around for this:

pandoc -f html-native_divs -t org …

@alphapapa
Copy link

Is there an easy way to find out which version that was added to? The version I've got on my Ubuntu Trusty system doesn't have that extension.

@tarleb
Copy link
Collaborator

tarleb commented Sep 2, 2017

I checked the git log: it seems that the native_divs extension was added with pandoc 1.13.

@vyp
Copy link

vyp commented Sep 2, 2017

@alphapapa The changelog is quite comprehensive.

@jgm
Copy link
Owner

jgm commented Sep 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants