Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982

Closed
timtylin opened this issue Sep 13, 2013 · 10 comments

Comments

@timtylin
Copy link
Contributor

This issue pertains to all versions up to current HEAD. Perhaps the most succinct example is with equations:

Command
pandoc -f latex -t native --mathjax

Input

\newcommand{\BEQ}{\begin{equation}}
\newcommand{\EEQ}{\end{equation}}

\BEQ
y=x^2
\EEQ

\begin{equation}
y=x^2
\end{equation}

output

[Para [Str "y=x",Superscript [Str "2"]]
,Para [Math DisplayMath "y=x^2"]]

Considering that environment delimiters are usually prime candidates for macro shorthand, this is probably a genuine shortcoming. As far as I can tell, block commands have a similar problem.

I can try to fix this, but I'm not sure exactly how you would want to approach the problem. We can either:

  1. monkey-patch all the individual parsers (and all controlSeq "end" calls) with a map to applyMacros.
  2. bite the bullet and do multi-pass parsing, change readLaTeX to first resolve all macros on the input string, then do the usual parse.
@jgm
Copy link
Owner

jgm commented Sep 14, 2013

+++ Tim Lin [Sep 13 13 03:33 ]:

I can try to fix this, but I'm not sure exactly how you would want to
approach the problem. We can either:
1. monkey-patch all the individual parsers (and all controlSeq "end"
calls) with a map to applyMacros.
2. bite the bullet and do multi-pass parsing, change readLaTeX to
first resolve all macros on the input string, then do the usual
parse.

Multi-pass parsing might be the cleanest. (I've tried to avoid it in
general, but note that there's already a function to go through the
whole input text and process include files.)

Another option, perhaps a nice middle ground, would be to have a
macro parser in both block and inline (not for macro definitions
but for actual macros), which would resolve the macro and simply
insert the result into the input stream.

Something like

result <- processMacro
inp <- getInput
setInput (result ++ input)
return mempty

timtylin added a commit to timtylin/scholdoc that referenced this issue Sep 16, 2013
…andleMacros`.

This approach results in additional passes through the document in addition to `handleIncludes`, but it's ultimately cleaner and more maintainable. Many uses of macros were not properly handled by the old paradigm, such as when environment begin\end delimiters were defined as macros. The new paradigm preprocesses all macro definitions and recursively apply them to the text until reaching a fixed point, so the LaTeX Reader is free to assume that no macros exist in the input.

Closes jgm#982.

As a side effect, the `macro` parser now produces String instead of Inline.

Note that this commit introduces a regression when `\newcommand` is in the argument of another command. (For example, inside `\pdfstringdefDisableCommands` as in the test files.) Perhaps the most sensible policy for now is to ignore all such occurrences.
@timtylin
Copy link
Contributor Author

I've made significant progress towards the multi-pass approach.

I first considered the latter suggestion (which requires texmath to export macroParser in order to write a Inlines parser for macro commands), but then realized that the number of edge-cases to handle will quickly escalate if I munge it into the blocks/inlines parsing paradigm. Considering the large number of look-aheads existing in the code already, I don't really feel comfortable going down this path. Hence, I bit the bullet and stripped out all the macro definitions.

Like you said, the scaffoldings are already in-place for the handleIncludes preprocessor, so writing another one isn't too bad.

If you'd like, I'll finish cleaning up some regressions (such as the \pdfstringdefDisableCommands{\renewcommand{\sout}{}} command in the unit tests), then issue a pull request for the feature branch.

@jgm
Copy link
Owner

jgm commented Sep 16, 2013

Hm. I still don't see the problem with the "apply macro and push the result into the input stream" idea.
You'd need something like this:

processMacro :: Monoid m => LP m
processMacro = try $ do
  name <- anyControlSeq
  guard $ name /= "begin" && name /= "end"
  guard $ not $ isBlockCommand name
  star <- option "" (string "*")
  let name' = name ++ star
  rawargs <- withRaw (skipopts *> option "" dimenarg *> many braced)
  let rawcommand = '\\' : name ++ star ++ snd rawargs
  transformed <- applyMacros' rawcommand
  if ('\\':name) `isPrefixOf` transformed
     then mzero  -- no macro was applied, move on
     else getInput >>= setInput . (transformed ++) >> return mempty

Then, just add processMacro to the beginning of inline, and to the beginning of block. (If we did this, we could get rid of some of the similar code that now exists in the handlers for block and inline commands.)

You'd need to do pretty similar processing anyway in a multipass approach, right?

Ultimately it would be good to try both approaches and benchmark to see which has the best performance.

@timtylin
Copy link
Contributor Author

I was mainly persuaded to go for the preprocessor approach after realizing that it's a closer conceptual approximation to the expansion step in the actual rendering process used by TeX. My intuition is that future macro edge cases related to TeX quirkiness can fixed easier in this paradigm, and that in general this provides a cleaner conceptual separation. Without it, anyone trying to help fix macro issues would have to spend quite some time groking the entire LaTeX reader parsing model.

Performance-wise, the processMacro parser is likely to be cheaper to run, but to resolve my test case it also has to be also included in verbEnv (except in verbatim envs, etc). Before you know it, the amount of look-aheads may incur performance penalties asymptotically approaching the expansion processor approach. I do hate how hand-wavy this argument sounds, so I'll try to setup a benchmarking test between the two approaches when I have the chance. Particularly troubling for the handleMacros preprocessor is the mass number of single-char tokens added to the end of a list (perhaps a nice chance to break out Data.Sequence?)

I wonder if there's other objections to the macro expansion preprocessor model aside from performance issues?

@timtylin
Copy link
Contributor Author

I partially take back what I said about using Data.Sequence leading to substantial performance gains. I just realized that Parsec's many (used in many macroToken) is already defined as

many :: ParsecT s u m a -> ParsecT s u m [a]
many p
  = do xs <- manyAccum (:) p
       return (reverse xs)

@jgm
Copy link
Owner

jgm commented Sep 17, 2013

No objections in principle to the two-pass method; it just seems a bit uglier to me, but you may be right that it's simpler and better.

I think that eventually we should probably put handleMacros right into readLaTeX from Text.Pandoc.LaTeX, rather than applying it in pandoc.hs. (In addition, Text.Pandoc.LaTeX should export something like readLaTeXWithIncludes in the IO monad, so we can get this out of pandoc.hs too.)

On the performance problem with lots of 1-character strings: one simple improvement would be to replace the anyChar with something like many (noneOf "\\%"). It would still be good to know what the performance impact of handleMacros is.

@jgm
Copy link
Owner

jgm commented Mar 9, 2017

This would also handle #934

@jgm jgm added this to the pandoc 2.0 milestone Mar 9, 2017
@jgm
Copy link
Owner

jgm commented Mar 9, 2017

Be sure to look at these related macro issues: #3236, #987, #2118, #1390, #2114

@jgm
Copy link
Owner

jgm commented Mar 9, 2017

One issue with modifying the input stream is that you lose good source position information.
(If the macro adds lines, line numbers of errors will be misleading.)

@jgm
Copy link
Owner

jgm commented Jun 28, 2017

I've got an idea for how to solve this, keeping source position info.

Instead of parsing a [Char], we parse a stream of tokens (produced by a tokenizer).

data Tok = Tok (Line, Column) TokType Text
data TokType = CtrlSeq | Spaces | Newline | Symbol | Word

Then, when we expand a macro, tokenize the expansion and set the source position of each token to the source position of the original unexpanded macro.

This would be a significant rewrite but might have other advantages as well.

jgm added a commit that referenced this issue Jul 5, 2017
This rewrite is primarily motivated by the need to
get macros working properly (#982, #934, #3779, #3236,
 #1390, #2888, #2118).

We now tokenize the input text, then parse the token stream.
Macros modify the token stream, so they should now be effective in any
context, including math. (Thus, we no longer need the clunky macro
processing capacities of texmath.)

A custom state LaTeXState is used instead of ParserState.
This, plus the tokenization, will require some rewriting
of the exported functions rawLaTeXInline, inlineCommand,
rawLaTeXBlock.
jgm added a commit that referenced this issue Jul 6, 2017
This rewrite is primarily motivated by the need to
get macros working properly (#982, #934, #3779, #3236,
 #1390, #2888, #2118).  A side benefit is that the
reader is significantly faster (27s -> 19s in one
benchmark, and there is a lot of room for further
optimization).

We now tokenize the input text, then parse the token stream.

Macros modify the token stream, so they should now be effective
in any context, including math. Thus, we no longer need the clunky
macro processing capacities of texmath.

A custom state LaTeXState is used instead of ParserState.
This, plus the tokenization, will require some rewriting
of the exported functions rawLaTeXInline, inlineCommand,
rawLaTeXBlock.

* Added Text.Pandoc.Readers.LaTeX.Types (new exported module).
  Exports Macro, Tok, TokType, Line, Column.  [API change]
* Text.Pandoc.Parsing: adjusted type of `insertIncludedFile`
  so it can be used with token parser.
* Removed old texmath macro stuff from Parsing.
  Use Macro from Text.Pandoc.Readers.LaTeX.Types instead.
* Removed texmath macro material from Markdown reader.
* Changed types for Text.Pandoc.Readers.LaTeX's
  rawLaTeXInline and rawLaTeXBlock.  (Both now return a String,
  and they are polymorphic in state.)
* Added orgMacros field to OrgState.  [API change]
* Removed readerApplyMacros from ReaderOptions.
  Now we just check the `latex_macros` reader extension.
jgm added a commit that referenced this issue Jul 7, 2017
@jgm jgm closed this as completed in 0feb750 Jul 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants