-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandoc keeps span and div tags when converting html to org #3771
Comments
Can you say why you expected what you expected (rather than what you got)?
|
The question is just how native spans should be rendered in Org mode. I don't know enough about Org mode to know all the options - pinging @tarleb. |
One method would be to use special blocks, like this:
I've shied away from this solution for two reasons:
It might be the lesser evil though. Input of other org-mode users would be appreciated. |
There is no need for these DIVs to be represented in the Org output in the first place. Org organizes content by headings, which are H1, H2, etc. in HTML, and those are already converted correctly. DIVs like these are primarily used in HTML to organize and style content with CSS, which is irrelevant to Org. These DIVs are simply useless clutter in the converted Org output. Instead of a user getting a clean, useful conversion of an HTML document to Org, he must spend extra time manually removing these irrelevant HTML DIV code blocks. Previous Pandoc versions did not output this spurious HTML content; e.g. I'm using Pandoc 1.12.2.1 from Ubuntu 14.04, and it outputs only useful content (see example screenshot at https://github.com/alphapapa/org-protocol-capture-html). I'm sure the ability to track and output these HTML DIVs in some output formats is very useful. But it's actually a detriment to Org output. And, as @tarleb pointed out, they cannot be nested properly in Org's native syntax. If there's a way to preserve them with extra options, that's great, but that's definitely a corner-case compared to a user simply wanting to capture the plain-text and outline-structured content from a simple HTML page. No one is expecting HTML->Pandoc->Org->Pandoc->HTML to produce a 1:1 conversion. So these HTML DIVs should simply be disabled for normal Thanks. |
From my point, there are ways to meet different needs will be nice. |
I'm sympathetic to the idea that native Divs should not come across as raw HTML in Org. |
Divs with only an id and no class or key-value pairs are currently unwrapped and the id is inserted as an My current preference is to add a rule checking if the content is a single block, and to prefix that block with |
+++ Albert Krewinkel [Jun 29 17 03:32 ]:
Divs with only an id and no class or key-value pairs are currently
unwrapped and the id is inserted as an <<anchor>>. Furthermore, if one
of the classes is either quote or verse, we wrap the content in the
respective block type. If one of the classes is drawer, we create a
drawer.
My current preference is to add a rule checking if the content is a
single block, and to add #+ATTR_HTML if that's the case. Otherwise, we
should just keep the id if present and just output the content. Does
this seem reasonable?
What do you mean, "checking if the content is a single block"?
|
That seems sensible. If it's feasible, it would be nice to be able to disable the anchors in the output, as well.
Seems like a clever hack to avoid parsing CSS while still getting the idea. Nice.
I'm not sure about this one. Drawers are peculiar to Org-mode, and I wouldn't expect HTML elements with a class name
I guess by "single block" you mean a non-nested DIV, i.e. a DIV containing no other DIVs? I guess that's a decent compromise, as it allows users to remove the raw HTML by removing a single line from the Org output, instead of having to unwrap text from a HTML block. It would still be nice to have some kind of text-and-outline-structure-only mode that would leave out any raw HTML. Unless the user is planning to reconvert the Org to HTML for republishing (unlikely, if he's not the author, and the author would have the HTML to begin with), he probably will have no use for HTML like that in the output.
I guess you mean, if the DIVs are nested, only output the That's fine with me, although it seems inconsistent. But nesting DIVs in Org syntax isn't feasible, and I don't think trying to hack around that would make sense. HTML-to-Org is practically a one-way conversion for capturing or archiving information in an informal way, so it doesn't need raw HTML in the first place. Thanks. |
By the way, I just found this question on Emacs.SE about this very problem: https://emacs.stackexchange.com/questions/24676/html-to-orgmode-via-pandoc-get-rid-of-all-begin-html-blocks It seems like disabling the output of DIVs in the Org output format would help a lot of people. @zeltak FYI. :) |
thx @alphapapa :) |
Thanks @alphapapa, this is helpful. The special handling of drawers was added to reduce information loss during org→org translations in pandoc: reading a drawer with pandoc's org reader gives a div containing @jgm: with checking if the content is a single block, I meant to inspect whether the div contains just a single block in the pandoc sense. The downside of this approach is that it would fail with some block types (e.g., lists, org is weird that way) and that it becomes more difficult for users to understand why attributes are retained for some divs but not for others. I guess I agree with @alphapapa that it's better to keep special cases to a minimum here and that dropping everything but the div's id is the best option. @alphapapa: We are currently integrate a lua-based filtering system into pandoc; removing unwanted information should become as easy as writing three short lines of lua code. |
@tarleb Ah, that's clever about the drawers. And thanks, the Lua filtering sounds great. I assume that the filters can be passed as command-line options? That would be ideal for my use-case. |
@tarleb Thank you very much. Which version of Pandoc will that end up in? |
This will be shipped with pandoc v2.0. You can download a nightly from the inofficial nightly builds repo if you want to test this without building pandoc from source. |
Jgm pointed out on the mailing list that adding disabling the
|
Is there an easy way to find out which version that was added to? The version I've got on my Ubuntu Trusty system doesn't have that extension. |
I checked the git log: it seems that the |
@alphapapa The changelog is quite comprehensive. |
You could look at the cumulative changelog (available on pandoc.org
under Releases).
+++ alphapapa [Sep 01 17 21:38 ]:
… Is there an easy way to find out which version that was added to? The
version I've got on my Ubuntu Trusty system doesn't have that
extension.
—
You are receiving this because you were mentioned.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.
References
1. #3771 (comment)
2. https://github.com/notifications/unsubscribe-auth/AAAL5KTSB45pE1PmAQXkPekjftZ8dsTkks5seHlygaJpZM4OHbBj
|
Version: 1.19.2.1
pandoc -f html -t org
andpandoc -f html -t org-raw_html-native_divs-native_spans
make no difference.Input html:
Both output org:
Expected:
pandoc -f html -t markdown-raw_html-native_divs-native_spans
seems to be no problem.The text was updated successfully, but these errors were encountered: