LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

timtylin · 2013-09-13T10:33:43Z

This issue pertains to all versions up to current HEAD. Perhaps the most succinct example is with equations:

Command
pandoc -f latex -t native --mathjax

Input

\newcommand{\BEQ}{\begin{equation}}
\newcommand{\EEQ}{\end{equation}}

\BEQ
y=x^2
\EEQ

\begin{equation}
y=x^2
\end{equation}

output

[Para [Str "y=x",Superscript [Str "2"]]
,Para [Math DisplayMath "y=x^2"]]

Considering that environment delimiters are usually prime candidates for macro shorthand, this is probably a genuine shortcoming. As far as I can tell, block commands have a similar problem.

I can try to fix this, but I'm not sure exactly how you would want to approach the problem. We can either:

monkey-patch all the individual parsers (and all controlSeq "end" calls) with a map to applyMacros.
bite the bullet and do multi-pass parsing, change readLaTeX to first resolve all macros on the input string, then do the usual parse.

The text was updated successfully, but these errors were encountered:

jgm · 2013-09-14T17:32:02Z

+++ Tim Lin [Sep 13 13 03:33 ]:

I can try to fix this, but I'm not sure exactly how you would want to
approach the problem. We can either:
1. monkey-patch all the individual parsers (and all controlSeq "end"
calls) with a map to applyMacros.
2. bite the bullet and do multi-pass parsing, change readLaTeX to
first resolve all macros on the input string, then do the usual
parse.

Multi-pass parsing might be the cleanest. (I've tried to avoid it in
general, but note that there's already a function to go through the
whole input text and process include files.)

Another option, perhaps a nice middle ground, would be to have a
macro parser in both block and inline (not for macro definitions
but for actual macros), which would resolve the macro and simply
insert the result into the input stream.

Something like

result <- processMacro
inp <- getInput
setInput (result ++ input)
return mempty

…andleMacros`. This approach results in additional passes through the document in addition to `handleIncludes`, but it's ultimately cleaner and more maintainable. Many uses of macros were not properly handled by the old paradigm, such as when environment begin\end delimiters were defined as macros. The new paradigm preprocesses all macro definitions and recursively apply them to the text until reaching a fixed point, so the LaTeX Reader is free to assume that no macros exist in the input. Closes jgm#982. As a side effect, the `macro` parser now produces String instead of Inline. Note that this commit introduces a regression when `\newcommand` is in the argument of another command. (For example, inside `\pdfstringdefDisableCommands` as in the test files.) Perhaps the most sensible policy for now is to ignore all such occurrences.

timtylin · 2013-09-16T08:01:26Z

I've made significant progress towards the multi-pass approach.

I first considered the latter suggestion (which requires texmath to export macroParser in order to write a Inlines parser for macro commands), but then realized that the number of edge-cases to handle will quickly escalate if I munge it into the blocks/inlines parsing paradigm. Considering the large number of look-aheads existing in the code already, I don't really feel comfortable going down this path. Hence, I bit the bullet and stripped out all the macro definitions.

Like you said, the scaffoldings are already in-place for the handleIncludes preprocessor, so writing another one isn't too bad.

If you'd like, I'll finish cleaning up some regressions (such as the \pdfstringdefDisableCommands{\renewcommand{\sout}{}} command in the unit tests), then issue a pull request for the feature branch.

jgm · 2013-09-16T21:43:27Z

Hm. I still don't see the problem with the "apply macro and push the result into the input stream" idea.
You'd need something like this:

processMacro :: Monoid m => LP m
processMacro = try $ do
  name <- anyControlSeq
  guard $ name /= "begin" && name /= "end"
  guard $ not $ isBlockCommand name
  star <- option "" (string "*")
  let name' = name ++ star
  rawargs <- withRaw (skipopts *> option "" dimenarg *> many braced)
  let rawcommand = '\\' : name ++ star ++ snd rawargs
  transformed <- applyMacros' rawcommand
  if ('\\':name) `isPrefixOf` transformed
     then mzero  -- no macro was applied, move on
     else getInput >>= setInput . (transformed ++) >> return mempty

Then, just add processMacro to the beginning of inline, and to the beginning of block. (If we did this, we could get rid of some of the similar code that now exists in the handlers for block and inline commands.)

You'd need to do pretty similar processing anyway in a multipass approach, right?

Ultimately it would be good to try both approaches and benchmark to see which has the best performance.

timtylin · 2013-09-17T00:37:04Z

I was mainly persuaded to go for the preprocessor approach after realizing that it's a closer conceptual approximation to the expansion step in the actual rendering process used by TeX. My intuition is that future macro edge cases related to TeX quirkiness can fixed easier in this paradigm, and that in general this provides a cleaner conceptual separation. Without it, anyone trying to help fix macro issues would have to spend quite some time groking the entire LaTeX reader parsing model.

Performance-wise, the processMacro parser is likely to be cheaper to run, but to resolve my test case it also has to be also included in verbEnv (except in verbatim envs, etc). Before you know it, the amount of look-aheads may incur performance penalties asymptotically approaching the expansion processor approach. I do hate how hand-wavy this argument sounds, so I'll try to setup a benchmarking test between the two approaches when I have the chance. Particularly troubling for the handleMacros preprocessor is the mass number of single-char tokens added to the end of a list (perhaps a nice chance to break out Data.Sequence?)

I wonder if there's other objections to the macro expansion preprocessor model aside from performance issues?

timtylin · 2013-09-17T01:55:47Z

I partially take back what I said about using Data.Sequence leading to substantial performance gains. I just realized that Parsec's many (used in many macroToken) is already defined as

many :: ParsecT s u m a -> ParsecT s u m [a]
many p
  = do xs <- manyAccum (:) p
       return (reverse xs)

jgm · 2013-09-17T17:22:01Z

No objections in principle to the two-pass method; it just seems a bit uglier to me, but you may be right that it's simpler and better.

I think that eventually we should probably put handleMacros right into readLaTeX from Text.Pandoc.LaTeX, rather than applying it in pandoc.hs. (In addition, Text.Pandoc.LaTeX should export something like readLaTeXWithIncludes in the IO monad, so we can get this out of pandoc.hs too.)

On the performance problem with lots of 1-character strings: one simple improvement would be to replace the anyChar with something like many (noneOf "\\%"). It would still be good to know what the performance impact of handleMacros is.

jgm · 2017-03-09T13:28:15Z

This would also handle #934

jgm · 2017-03-09T13:31:36Z

Be sure to look at these related macro issues: #3236, #987, #2118, #1390, #2114

jgm · 2017-03-09T19:45:45Z

One issue with modifying the input stream is that you lose good source position information.
(If the macro adds lines, line numbers of errors will be misleading.)

jgm · 2017-06-28T14:33:50Z

I've got an idea for how to solve this, keeping source position info.

Instead of parsing a [Char], we parse a stream of tokens (produced by a tokenizer).

data Tok = Tok (Line, Column) TokType Text
data TokType = CtrlSeq | Spaces | Newline | Symbol | Word

Then, when we expand a macro, tokenize the expansion and set the source position of each token to the source position of the original unexpanded macro.

This would be a significant rewrite but might have other advantages as well.

This rewrite is primarily motivated by the need to get macros working properly (#982, #934, #3779, #3236, #1390, #2888, #2118). We now tokenize the input text, then parse the token stream. Macros modify the token stream, so they should now be effective in any context, including math. (Thus, we no longer need the clunky macro processing capacities of texmath.) A custom state LaTeXState is used instead of ParserState. This, plus the tokenization, will require some rewriting of the exported functions rawLaTeXInline, inlineCommand, rawLaTeXBlock.

This rewrite is primarily motivated by the need to get macros working properly (#982, #934, #3779, #3236, #1390, #2888, #2118). A side benefit is that the reader is significantly faster (27s -> 19s in one benchmark, and there is a lot of room for further optimization). We now tokenize the input text, then parse the token stream. Macros modify the token stream, so they should now be effective in any context, including math. Thus, we no longer need the clunky macro processing capacities of texmath. A custom state LaTeXState is used instead of ParserState. This, plus the tokenization, will require some rewriting of the exported functions rawLaTeXInline, inlineCommand, rawLaTeXBlock. * Added Text.Pandoc.Readers.LaTeX.Types (new exported module). Exports Macro, Tok, TokType, Line, Column. [API change] * Text.Pandoc.Parsing: adjusted type of `insertIncludedFile` so it can be used with token parser. * Removed old texmath macro stuff from Parsing. Use Macro from Text.Pandoc.Readers.LaTeX.Types instead. * Removed texmath macro material from Markdown reader. * Changed types for Text.Pandoc.Readers.LaTeX's rawLaTeXInline and rawLaTeXBlock. (Both now return a String, and they are polymorphic in state.) * Added orgMacros field to OrgState. [API change] * Removed readerApplyMacros from ReaderOptions. Now we just check the `latex_macros` reader extension.

Not all passing yet.

lod mentioned this issue Sep 16, 2013

LaTeX Reader does not handle \newenvironment #987

Closed

mpickering added enhancement complexity:high labels Dec 8, 2014

jgm added format:LaTeX reader labels Mar 9, 2017

jgm added this to the pandoc 2.0 milestone Mar 9, 2017

jgm mentioned this issue Apr 15, 2017

Pandoc throws an error on valid latex #3571

Closed

jgm mentioned this issue May 18, 2017

Support \xspace command #3681

Closed

jgm mentioned this issue Jun 22, 2017

raw tex not filtered if it includes \begin or \end #3754

Closed

jgm added a commit that referenced this issue Jul 7, 2017

Added test cases for #1390, #2118, #3236, #3779, #934, #982.

aa55995

Not all passing yet.

jgm closed this as completed in 0feb750 Jul 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

timtylin commented Sep 13, 2013

jgm commented Sep 14, 2013

timtylin commented Sep 16, 2013

jgm commented Sep 16, 2013

timtylin commented Sep 17, 2013

timtylin commented Sep 17, 2013

jgm commented Sep 17, 2013

jgm commented Mar 9, 2017

jgm commented Mar 9, 2017 •

edited

Loading

jgm commented Mar 9, 2017

jgm commented Jun 28, 2017 •

edited

Loading

LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

Comments

timtylin commented Sep 13, 2013

jgm commented Sep 14, 2013

timtylin commented Sep 16, 2013

jgm commented Sep 16, 2013

timtylin commented Sep 17, 2013

timtylin commented Sep 17, 2013

jgm commented Sep 17, 2013

jgm commented Mar 9, 2017

jgm commented Mar 9, 2017 • edited Loading

jgm commented Mar 9, 2017

jgm commented Jun 28, 2017 • edited Loading

jgm commented Mar 9, 2017 •

edited

Loading

jgm commented Jun 28, 2017 •

edited

Loading