-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for mhchem package for LaTeX to docx #6668
Comments
Pandoc cannot support all LaTeX package out there, and I'm afraid that this will be one of the cases were we have to draw the line. But you could use a filter to parse and replace the mhchem statements. See also https://stackoverflow.com/q/56387990/. |
Additional resource: related thread on the mailing list. |
I am the inventor and author of mhchem. Let's discuss how I could be of help here. I wrote the LaTeX package and a JavaScript/TypeScript parser. But I don't know any Haskell, yet. My understanding is that transpilation of JavaScript to Haskell is not possible. Correct? It looks like pandoc should stay 100% Haskell. Correct? So, what would be needed here? I could try to learn Haskell and create a Haskell function that takes an mhchem input string and returns the equivalent LaTeX. (This would be a long term project on my side. I cannot do this during a normal work week. And this is subject to me liking Haskell. I don't want to ruin my vacation by doing something I don't like.) The pandoc team would need to create all the wrapping. Detection of What do you think? Are my assumptions correct? Could this work this way? |
Hello Martin, thanks for chiming in! I believe that you are correct, and that there is no transpilation from JS to Haskell; and yes, this would would have to be coded in Haskell to be shipped with pandoc (but see below).
The most useful would be a module like SIunitx. It parses LaTeX commands into pandoc's internal document representation.
I very much like your way of thinking If Haskell turns out not to be your thing, there are two alternatives. One would be to use Lua. The language shares a lot of concepts with JavaScript, so you'd probably be productive in no time. The idea there would be that pandoc can be instructed to keep those LaTeX commands which it doesn't know how to handle, so we can parse them later. The Lua script could then do the parsing and translating, passing the result back to pandoc. Pandoc includes a Lua interpreter, so running the extension would be possible for everyone with a working pandoc installation. The second alternative is probably the easiest: do the same as described above, but use JavaScript to do the processing. The disadvantages are only that node would be required, plus the performance impact of having to pass the document to and from JS by serializing to JSON. But at least the latter point shouldn't matter too much. The approach would only leave the challenge of having to translate the parsed state into pandoc's internal format (or directly into specific output formats). Possible problems with these approaches could stem from pandoc mis-parsing the mhchem commands as a whole, e.g., creating multiple chunks out of something that's really just one command. Not entirely sure how likely that would be. |
Hi Albert. Thanks for explaining the three approaches. I guess they would work for the easy chemical formulas. But when I think about the more complex ones I don't see this fit. We have So, parsing mhchem as a last step wouldn't work, because it could contain further math. Also, parsing directly into pandoc's internal document representation can only work if it has a "This is LaTeX which needs another parser run" object. And if the internal representation could be nested (e.g. as a subscript inside a LaTeX expression). Would it work if the mhchem parser returns a string with LaTeX syntax? I might be biased here, because my other mhchem implementations work this way. But I don't see the other approaches working nicely for things like |
If you can put in your own macro definition for |
The inner workings of |
No, we're not going to modify pandoc so that it shells out to a JavaScript/TypeScript executable. You should be able to use a Lua filter, though, to do this. The Lua filter would match on Math elements (or RawInline (Format "latex"), if these things occur outside of math mode). It could then pipe the content through your program and reinsert the result. |
Nobody suggested that. I am discussing how an mhchem parser in Haskell could be integrated in the pandoc's scanning process of LaTeX code. |
That sounds good. I just learned (by reading https://pandoc.org/lua-filters.html) that pandoc does not parse (La)TeX in the first run, but retains the whole string as pandoc.Math (or similar). Yes, my filter could modify this string and hand it back. I'll follow up on this. (So, pandoc parses this math string later on? I guess so, otherwise you would not be able to convert this to docx etc, would you?) |
Sorry, I missed that. In principle, we could do that, but I'm a little hesitant. We try to handle commonly used LaTeX packages, but we can't make our ambition that of supporting everything, or it will balloon out of all proportion. If the needed code is fairly compact, I might consider it. If it's a lot, I'd be more inclined to say: people who need this can use a simple lua filter and shell out to your existing program. |
Exactly, that is done by the jgm/texmath library, which converts between tex math and several other formats (Word equations, MathML, roff eqn). The library has some limitations, so you'd need to make sure that the output of your script can be processed by it. This may not be the case if it uses a lot of lower-level tex. |
I don't want to brag, but mhchem might qualify as a commonly used LaTeX package. I just looked up some numbers: Chemistry StackExchange has 23724 posts using I'll take a closer look at the Lua approach. Thanks. |
@mhchem You may find this pandoc filter helpful as an example - it's used for converting equations from LaTeX to SVG using MathJax, all in JavaScript/TypeScript. In this case the input would be mhchem syntax and the output would be LaTeX, but the process should be similar. |
If I understand the discussion correctly, then full support would require changes which are forbiddingly high effort. There would have to be an mhchem equivalent to texmath. However, a filter seems to be an adequate if imperfect solution. It seems that there is currently not much we can do, thus closing. Let me know if my analysis was incorrect. |
Let me just say that I wanted to mention that KaTeX has implemented filters for |
Point taken about its wide usage. However, as tarleb notes above, it would be a lot of work to implement the macros. If some mhchem user who knows Haskell wants to do it, we can talk! (Rudimentary support would not be too hard, e.g. for things like |
The most basic thing to support would be
Note that these can be used both in text mode and in math mode. One practical approach would be to generate math from every |
At the moment, there is a terrible lack of functional tex-to-word or tex-to-libreoffice filters. Anything moderately functional, even if hacky, so much better than nothing at all. Just pass the \ce{...} content through as plain text for a start! The syntax is basically human readable. It is understood that hand-editing is unavoidable. Do you have a systematic way of highlighting/flagging 'imperfectly translated' content, perhaps? Would make the hand-editing process much easier. FYI my best alternative at the moment is to create a PDF, then upload it to an Adobe website, download a Word document, then load it into LibreOffice. The results are terrible. Nearly anything would be an improvement. Perfection is not necessary. |
That's very easy to achieve with a small Lua filter.
(Note: this won't cover |
Actually, since you can use lpeg in writing Lua filters, it might be fun to write a little lpeg grammar for mhchem; then the filter could be fairly fully featured, including subscripts and superscripts and bonds and the like. Occurrences in math mode could be handled in the way suggested above. |
Here's a start on a more sophisticated filter that uses a grammar: -- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt
local whitespacechar = S(" \t\r\n")
local number = R"09"^1 * (P"." * R"09"^1)^-1
local fraction = number * "/" * number
local symbol = C(S"()[]") + (P"\\" * C(S"{}"))
local thinspace = utf8.char(0x2009)
Mhchem = P{ "Formula",
Formula = Ct(( V"Molecule"
+ V"Math"
+ V"Sup"
+ V"Sub"
+ V"Number"
+ V"Letter"
+ V"Symbol"
+ whitespacechar^1
)^0) * P(-1);
Molecule = V"MoleculePart"^1 ;
MoleculePart = V"Element" * V"ElementSub"^-1 ;
Element = C(R"AZ" * R"az"^0) / pandoc.Str ;
ElementSub = C(R"09"^1) / pandoc.Str / pandoc.Subscript ;
Letter = R"az" / pandoc.Str ;
Number = fraction + C(number) /
function(s) return pandoc.Str(s .. thinspace) end ;
Sup = ((P"^" * (V"InBraces" + C(R"09"^0 * S"+-"^-1)))
+ C(S"+-"))
/ pandoc.Str / pandoc.Superscript ;
Sub = (P"_" * (V"InBraces" + C(R"09"^0 + S"+-"^-1))) /
pandoc.Str / pandoc.Subscript ;
Math = P"$" * C((P(1) - P"$")^1) * P"$" /
function(s) return pandoc.Math("InlineMath", s) end ;
Symbol = symbol / pandoc.Str;
InBraces = P"{" * C(((P(1) - P"}") + V"InBraces")^0) * P"}"
}
function handleCe(s)
local inner = s:sub(5,-2) -- strip off \ce{ and }
local result = lpeg.match(Mhchem, inner)
if not result then
io.stderr:write("Could not parse mhchem formula " .. inner .. "\n")
end
return result
end
function RawInline(el)
if (el.format == "latex" or el.format == "tex") and
el.text:match("\\ce{") then
return handleCe(el.text)
end
end
function RawBlock(el)
if (el.format == "latex" or el.format == "tex") and
el.text:match("\\ce{") then
local ils = handleCe(el.text)
if ils then
return pandoc.Para(ils)
end
end
end Example of use:
|
The approach above looks interesting. Could you help me understand what it does and when? I understand that this filter is called after the document has been parsed a first time, to distinguish text (
I'm not sure LPeg would be the way to go. I don't know it and just had a quick glance at the documentation. I am impressed by it's compactness, but I feel that 1500 Lines of TypeScript do not easily fit that grammar in a way that fits into a (rather: my) brain. |
This is a transformation of the AST generated by the LaTeX parser. Currently this filter doesn't do anything to handle |
I suppose that we could do the following to handle math mode.
we could use a table,
We could make
And in math mode, we'd pass in a different table:
EDIT: another option would be to use math mode for all the |
[EDIT: scrubbed this idea because |
I think I see how to do this now. I'll try to produce a version of this filter that gives decent results on most of your test cases, and then I'll link to it. |
So pleased to see the progress on this! :-) |
I just re-read @hubgit 's post above. Hmm, if we already have a MathJax filter, why don't we use that? MathJax has perfect mhchem support. |
You could indeed use a filter that uses MathJax to produce SVGs and then includes the SVGs in the document. But that means all your math and chemical formulas turns into images. Wouldn't you rather have the math and chemical formulas be native Word equations (in docx) or mathml (in DocBook) or eqn (in ms)? |
@mhchem the manual says:
Under what conditions is it each of these things? |
Here is the latest version of the filter. This handles around 70% of the examples in pp. 4-12 of the mhchem manual. To use it, save this as -- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt
local whitespacechar = S(" \t\r\n")
local number = (R"09"^1 * (P"." * R"09"^1)^-1)
local symbol = C(S"()[],") + (P"\\" * C(S"{}"))
local function escapeTeX(x)
return x:gsub("%%","\\%")
:gsub("\\","\\\\")
:gsub("([{}])", "\\%1")
end
local arrows = {
["->"] = "\\longrightarrow",
["<-"] = "\\longleftarrow",
["<->"] = "\\longleftrightarrow",
["<-->"] = "\\longleftarrow\\longrightarrow",
["<=>"] = "\\rightleftharpoons",
["<=>>"] = "\\longRightleftharpoons",
["<<=>"] = "\\longLeftrightharpoons"
}
local bonds = {
["-"] = "{-}",
["="] = "{=}",
["#"] = "{\\equiv}",
["1"] = "{-}",
["2"] = "{=}",
["3"] = "{\\equiv}",
["..."] = "{\\cdot}{\\cdot}{\\cdot}",
["->"] = "{\\rightarrow}",
["<-"] = "{\\leftarrow}"
}
-- math mode renderer
local render =
{ str = function(x)
if #x > 0 then
return "\\text{" .. escapeTeX(x) .. "}"
else
return ""
end
end,
element = function(x) return "\\mathrm{" .. escapeTeX(x) .. "}" end,
superscript = function(x) return "^{" .. x .. "}" end,
subscript = function(x) return "_{" .. x .. "}" end,
number = function(x) return x end,
math = function(x) return x end,
fraction = function(n,d) return "\\frac{" .. n .. "}{" .. d .. "}" end,
fractionparens = function(n,d) return "(" .. n .. "/" .. d .. ")" end,
greek = function(x) return "\\mathrm{" .. x .. "}" end,
arrow = function(arr, above, below)
local result = arrows[arr]
if above then
result = "\\overset{" .. above .. "}{" .. result .. "}"
end
if below then
result = "\\underset{" .. below .. "}{" .. result .. "}"
end
return result
end,
precipitate = function() return "\\downarrow " end,
gas = function() return "\\uparrow " end,
bond = function(s) return bonds[s] or s end,
circa = function() return "{\\sim}" end
}
Mhchem = P{ "Formula",
Formula = Ct( V"FormulaPart"^0 ) * P(-1) / table.concat;
FormulaPart = V"Molecule"
+ V"ReactionArrow"
+ V"Bond"
+ V"Sup"
+ V"Sub"
+ V"Charge"
+ V"Fraction"
+ V"Number"
+ V"Math"
+ V"Precipitate"
+ V"Gas"
+ V"Letters"
+ V"GreekLetter"
+ V"Text"
+ V"EquationOp"
+ V"Space"
+ V"Circa"
+ V"Symbol" ;
Molecule = V"StoichiometricNumber"^-1 * V"MoleculePart"^1 ;
MoleculePart = V"Element" * V"ElementSub"^-1 ;
StoichiometricNumber = (V"Number" + C(R"az") + V"Math" + V"Fraction") *
Cc("\\;") * whitespacechar^0 ;
Element = C(R"AZ" * R"az"^0) / render.element ;
Charge = B(R"AZ" + R"az" + S")]}") * C(S"+-") * #-R"AZ" /
render.str / render.superscript ;
ElementSub = C(R"09"^1) / render.str / render.subscript ;
Precipitate = whitespacechar^0 * (P"(v)" + P"v") * whitespacechar^0 /
render.precipitate ;
Gas = whitespacechar^0 * (P"(^)" + P"^") * whitespacechar^0 /
render.gas ;
Bond = (C(S"#=-") * #R"AZ" / render.bond) +
(P"\\bond{" * C((P(1) - P"}")^0) * P"}" / render.bond) ;
Letters = R"az"^1 / render.str ;
Number = C(number) / render.number;
NumberOrLetter = V"Number" + V"Letters" ;
Fraction = (P"(" * V"NumberOrLetter"^1 * P"/" * V"NumberOrLetter"^1 * P")"
/ render.fractionparens) +
(V"NumberOrLetter" * P"/" * V"NumberOrLetter" / render.fraction);
Sup = P"^" * (V"InBraces" + (C(S"+-"^-1 * R"09"^0 * S"+-"^-1) / render.str)) /
render.superscript ;
Sub = P"_" * (V"InBraces" + (C(S"+-"^-1 * R"09"^0 * S"+-"^-1) / render.str)) /
render.subscript ;
Math = P"$" * Cs((V"MathPart" + V"CEPart")^1) * P"$" / render.math ;
MathPart = C((P(1) - (P"$" + V"CEPart"))^1) ;
CEPart = P"\\ce{" * Ct((V"FormulaPart" - P"}")^0) * P"}" / table.concat ;
GreekLetter = C(P"\\" *
(( P"alpha" + P"beta" + P"gamma" + P"delta" + P"epsilon" +
P"zeta" + P"eta" + P"theta" + P"iota" + P"kappa" +
P"mu" + P"nu" + P"xi" + P"omicron" + P"pi" + P"rho" + P"sigma" +
P"tau" + P"upsilon" + P"phi" + P"xi" + P"psi" + P"omega"
) +
(( P"Alpha" + P"Beta" + P"Gamma" + P"Delta" + P"Epsilon" +
P"Zeta" + P"Eta" + P"Theta" + P"Iota" + P"Kappa" +
P"Mu" + P"Nu" + P"Xi" + P"Omicron" + P"Pi" + P"Rho" + P"Sigma" +
P"Tau" + P"Upsilon" + P"Phi" + P"Xi" + P"Psi" + P"Omega" )))) *
whitespacechar^0 / render.greek ;
EquationOp = whitespacechar^0 *
C(P"+" + P"-" + P"=" + (P"\\pm")) *
whitespacechar^0 /
render.math;
ReactionArrow =
whitespacechar^0 *
C(P"->" +
P"<-->" +
P"<->" +
P"<-" +
P"<=>>" +
P"<=>" +
P"<<=>") *
(P"[" * Cs((V"FormulaPart" - P"]")^0) * P"]")^-2 *
whitespacechar^0 / render.arrow ;
Text = V"InBraces" ;
Circa = P"\\ca" * whitespacechar^0 / render.circa ;
Space = C(whitespacechar^1) / "~" ;
Symbol = symbol / render.str;
InBraces = P"{" * Ct((((V"FormulaPart" - S"{}")^1) + V"InBraces")^0) * P"}" /
table.concat
}
function handleCe(s)
local inner = s:sub(5,-2) -- strip off \ce{ and }
local result = lpeg.match(Mhchem, inner)
if not result then
io.stderr:write("Could not parse mhchem formula " .. inner .. "\n")
return s
end
return result
end
function RawInline(el)
if (el.format == "latex" or el.format == "tex") and
el.text:match("\\ce{") then
local result = handleCe(el.text)
if result then
return pandoc.Math("InlineMath", handleCe(el.text))
end
end
end
function RawBlock(el)
local il = RawInline(el)
if il then
return pandoc.Para(il)
end
end
function Math(el)
el.text = string.gsub(el.text, "(\\ce%b{})", handleCe)
end |
I've improved this more and added it to the pandoc/lua-filters repository: There's a sample there that shows how the manual's examples render in docx: test.docx. As you can see, there are a few that don't convert well (due to lack of support in texmath for the symbols used), and there are some minor infelicities, but it's much better than no support! |
From a discussion I had on this topic: maybe we could use the mhchem MathJax plugin to convert to MathML, then read that back into pandoc. This should (theoretically) result in a fairly good conversion. |
@jgm Could you add the change above so that we can write |
@jiucenglou why don't you submit a pull request at https://github.com/pandoc/lua-filters |
I will try a pull request :D Many thanks for your efforts in developing this filter ! |
Thank you very much for the filter! I ran the test.txt but it shows:
|
It works pretty well on my own |
Currently pandoc (2.10.1) still does not support parsing \ce{} command in latex to docx correctly. For example, if I type \ce{CO2} in my tex file and then convert it to docx using pandoc, the resulted docx file will totally miss the CO2, leaving a blank space there.
I found that no one ever mentioned the mhchem support issue here before. As a user heavily use LaTeX to write paragraphs about chemistry, I hope pandoc can support it soon. Thanks!
The text was updated successfully, but these errors were encountered: