-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LaTeX Reader does not handle environment delimiters or block commands defined with a macro #982
Comments
+++ Tim Lin [Sep 13 13 03:33 ]:
Multi-pass parsing might be the cleanest. (I've tried to avoid it in Another option, perhaps a nice middle ground, would be to have a Something like
|
…andleMacros`. This approach results in additional passes through the document in addition to `handleIncludes`, but it's ultimately cleaner and more maintainable. Many uses of macros were not properly handled by the old paradigm, such as when environment begin\end delimiters were defined as macros. The new paradigm preprocesses all macro definitions and recursively apply them to the text until reaching a fixed point, so the LaTeX Reader is free to assume that no macros exist in the input. Closes jgm#982. As a side effect, the `macro` parser now produces String instead of Inline. Note that this commit introduces a regression when `\newcommand` is in the argument of another command. (For example, inside `\pdfstringdefDisableCommands` as in the test files.) Perhaps the most sensible policy for now is to ignore all such occurrences.
I've made significant progress towards the multi-pass approach. I first considered the latter suggestion (which requires texmath to export Like you said, the scaffoldings are already in-place for the If you'd like, I'll finish cleaning up some regressions (such as the |
Hm. I still don't see the problem with the "apply macro and push the result into the input stream" idea. processMacro :: Monoid m => LP m
processMacro = try $ do
name <- anyControlSeq
guard $ name /= "begin" && name /= "end"
guard $ not $ isBlockCommand name
star <- option "" (string "*")
let name' = name ++ star
rawargs <- withRaw (skipopts *> option "" dimenarg *> many braced)
let rawcommand = '\\' : name ++ star ++ snd rawargs
transformed <- applyMacros' rawcommand
if ('\\':name) `isPrefixOf` transformed
then mzero -- no macro was applied, move on
else getInput >>= setInput . (transformed ++) >> return mempty Then, just add You'd need to do pretty similar processing anyway in a multipass approach, right? Ultimately it would be good to try both approaches and benchmark to see which has the best performance. |
I was mainly persuaded to go for the preprocessor approach after realizing that it's a closer conceptual approximation to the expansion step in the actual rendering process used by TeX. My intuition is that future macro edge cases related to TeX quirkiness can fixed easier in this paradigm, and that in general this provides a cleaner conceptual separation. Without it, anyone trying to help fix macro issues would have to spend quite some time groking the entire LaTeX reader parsing model. Performance-wise, the I wonder if there's other objections to the macro expansion preprocessor model aside from performance issues? |
I partially take back what I said about using Data.Sequence leading to substantial performance gains. I just realized that Parsec's many :: ParsecT s u m a -> ParsecT s u m [a]
many p
= do xs <- manyAccum (:) p
return (reverse xs) |
No objections in principle to the two-pass method; it just seems a bit uglier to me, but you may be right that it's simpler and better. I think that eventually we should probably put handleMacros right into readLaTeX from Text.Pandoc.LaTeX, rather than applying it in pandoc.hs. (In addition, Text.Pandoc.LaTeX should export something like readLaTeXWithIncludes in the IO monad, so we can get this out of pandoc.hs too.) On the performance problem with lots of 1-character strings: one simple improvement would be to replace the |
This would also handle #934 |
One issue with modifying the input stream is that you lose good source position information. |
I've got an idea for how to solve this, keeping source position info. Instead of parsing a
Then, when we expand a macro, tokenize the expansion and set the source position of each token to the source position of the original unexpanded macro. This would be a significant rewrite but might have other advantages as well. |
This rewrite is primarily motivated by the need to get macros working properly (#982, #934, #3779, #3236, #1390, #2888, #2118). We now tokenize the input text, then parse the token stream. Macros modify the token stream, so they should now be effective in any context, including math. (Thus, we no longer need the clunky macro processing capacities of texmath.) A custom state LaTeXState is used instead of ParserState. This, plus the tokenization, will require some rewriting of the exported functions rawLaTeXInline, inlineCommand, rawLaTeXBlock.
This rewrite is primarily motivated by the need to get macros working properly (#982, #934, #3779, #3236, #1390, #2888, #2118). A side benefit is that the reader is significantly faster (27s -> 19s in one benchmark, and there is a lot of room for further optimization). We now tokenize the input text, then parse the token stream. Macros modify the token stream, so they should now be effective in any context, including math. Thus, we no longer need the clunky macro processing capacities of texmath. A custom state LaTeXState is used instead of ParserState. This, plus the tokenization, will require some rewriting of the exported functions rawLaTeXInline, inlineCommand, rawLaTeXBlock. * Added Text.Pandoc.Readers.LaTeX.Types (new exported module). Exports Macro, Tok, TokType, Line, Column. [API change] * Text.Pandoc.Parsing: adjusted type of `insertIncludedFile` so it can be used with token parser. * Removed old texmath macro stuff from Parsing. Use Macro from Text.Pandoc.Readers.LaTeX.Types instead. * Removed texmath macro material from Markdown reader. * Changed types for Text.Pandoc.Readers.LaTeX's rawLaTeXInline and rawLaTeXBlock. (Both now return a String, and they are polymorphic in state.) * Added orgMacros field to OrgState. [API change] * Removed readerApplyMacros from ReaderOptions. Now we just check the `latex_macros` reader extension.
This issue pertains to all versions up to current HEAD. Perhaps the most succinct example is with equations:
Command
pandoc -f latex -t native --mathjax
Input
output
Considering that environment delimiters are usually prime candidates for macro shorthand, this is probably a genuine shortcoming. As far as I can tell, block commands have a similar problem.
I can try to fix this, but I'm not sure exactly how you would want to approach the problem. We can either:
controlSeq "end"
calls) with a map to applyMacros.readLaTeX
to first resolve all macros on the input string, then do the usual parse.The text was updated successfully, but these errors were encountered: