Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion of ipynb to pdf fails because of ANSI escape codes in stacktrace #5633

Closed
wstomv opened this issue Jul 5, 2019 · 11 comments
Closed

Comments

@wstomv
Copy link

wstomv commented Jul 5, 2019

This issue concerns the conversion of a Jupyter notebook (-f ipynb) to PDF (-t latex -o *.pdf), where the notebook contains a Python stack trace (example attached), being the result of a runtime error in a code cell. Such a stack trace includes ANSI escape sequences to color some of the output. However, LaTeX chokes on the result that pandoc produces. Message with pdflatex as engine:

Error producing PDF.
! Package inputenc Error: Unicode character ^^[ (U+001B)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.102 \end{verbatim}

Try running pandoc with --pdf-engine=xelatex.

With xelatex as engine (lualatex responds similarly):

Error producing PDF.
! Text line contains an invalid character.
l.96 ^^[

These ANSI escape sequences are encoded in plain ASCII in the Jupyter notebook (inside a JSON string), but appear as real escape sequences in the produced LaTeX source file, inside a verbatim environment.

Further details:

  • Version of pandoc:
$ pandoc -v
pandoc 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
  • Command line:
$ pandoc -f ipynb -t latex --pdf-engine=xelatex -o Stacktrace.pdf Stacktrace.ipynb

A workaround would be nice to have.

(One option is to strip the ASCII encoded ANSI escape sequences from the notebook before conversion with pandoc (e.g. using sed). Alternatively, the LaTeX source generated by pandoc can be stripped of ANSI escape sequences (e.g. using ansifilter) and then pulled through a LaTeX-to-PDF engine separately. None of these options is very appealing.)

@jgm
Copy link
Owner

jgm commented Jul 6, 2019

For convenience, this is what's in the ipynb:

   "outputs": [
    {
     "ename": "ZeroDivisionError",
     "evalue": "division by zero",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mZeroDivisionError\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-9e1622b385b6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;36m1\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mZeroDivisionError\u001b[0m: division by zero"
     ]
    }
   ],

and this is how pandoc parses it:

jgm@macbook-air-3:~/src/pandoc % pandoc -t native ~/Downloads/Stacktrace.ipynb
[Div ("",["cell","code"],[("execution_count","1"),("ExecuteTime","{\"start_time\":\"2019-07-05T17:24:16.423114Z\",\"end_time\":\"2019-07-05T17:24:16.477717Z\"}")])
 [CodeBlock ("",["python"],[]) "1/0"
 ,Div ("",["output","error"],[("ename","ZeroDivisionError"),("evalue","division by zero")])
  [CodeBlock ("",[],[]) "\ESC[0;31m---------------------------------------------------------------------------\ESC[0m\n\ESC[0;31mZeroDivisionError\ESC[0m                         Traceback (most recent call last)\n\ESC[0;32m<ipython-input-1-9e1622b385b6>\ESC[0m in \ESC[0;36m<module>\ESC[0;34m\ESC[0m\n\ESC[0;32m----> 1\ESC[0;31m \ESC[0;36m1\ESC[0m\ESC[0;34m/\ESC[0m\ESC[0;36m0\ESC[0m\ESC[0;34m\ESC[0m\ESC[0;34m\ESC[0m\ESC[0m\n\ESC[0m\n\ESC[0;31mZeroDivisionError\ESC[0m: division by zero\n"]]]

Here's a hexdump of the latex output:

0000000   \   b   e   g   i   n   {   S   h   a   d   e   d   }  \n   \
0000010   b   e   g   i   n   {   H   i   g   h   l   i   g   h   t   i
0000020   n   g   }   [   ]  \n   \   D   e   c   V   a   l   T   o   k
0000030   {   1   }   \   O   p   e   r   a   t   o   r   T   o   k   {
0000040   /   }   \   D   e   c   V   a   l   T   o   k   {   0   }  \n
0000050   \   e   n   d   {   H   i   g   h   l   i   g   h   t   i   n
0000060   g   }  \n   \   e   n   d   {   S   h   a   d   e   d   }  \n
0000070  \n   \   b   e   g   i   n   {   v   e   r   b   a   t   i   m
0000080   }  \n 033   [   0   ;   3   1   m   -   -   -   -   -   -   -
0000090   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
*
00000d0   -   -   -   - 033   [   0   m  \n 033   [   0   ;   3   1   m
00000e0   Z   e   r   o   D   i   v   i   s   i   o   n   E   r   r   o
00000f0   r 033   [   0   m                                            
0000100                                                           T   r
0000110   a   c   e   b   a   c   k       (   m   o   s   t       r   e
0000120   c   e   n   t       c   a   l   l       l   a   s   t   )  \n
0000130 033   [   0   ;   3   2   m   <   i   p   y   t   h   o   n   -
0000140   i   n   p   u   t   -   1   -   9   e   1   6   2   2   b   3
0000150   8   5   b   6   > 033   [   0   m       i   n     033   [   0
0000160   ;   3   6   m   <   m   o   d   u   l   e   > 033   [   0   ;
0000170   3   4   m 033   [   0   m  \n 033   [   0   ;   3   2   m   -
0000180   -   -   -   >       1 033   [   0   ;   3   1   m     033   [
0000190   0   ;   3   6   m   1 033   [   0   m 033   [   0   ;   3   4
00001a0   m   / 033   [   0   m 033   [   0   ;   3   6   m   0 033   [
00001b0   0   m 033   [   0   ;   3   4   m 033   [   0   m 033   [   0
00001c0   ;   3   4   m 033   [   0   m 033   [   0   m  \n 033   [   0
00001d0   m  \n 033   [   0   ;   3   1   m   Z   e   r   o   D   i   v
00001e0   i   s   i   o   n   E   r   r   o   r 033   [   0   m   :    
00001f0   d   i   v   i   s   i   o   n       b   y       z   e   r   o
0000200  \n   \   e   n   d   {   v   e   r   b   a   t   i   m   }  \n

@jgm
Copy link
Owner

jgm commented Jul 6, 2019

How do you think pandoc should deal with this? We could easily modify the latex writer to strip out ANSI escape sequences. Is that a good solution?

@wstomv
Copy link
Author

wstomv commented Jul 6, 2019 via email

@wstomv
Copy link
Author

wstomv commented Jul 8, 2019

Just in case someone needs it, here is the sed command that I use to strip the ANSI color codes:

sed -E 's/\\\\u001b[^m]*m//g' file.ipynb | pandoc ...

@jgm
Copy link
Owner

jgm commented Jul 11, 2019

We could either modify the ipynb reader to strip out ANSI escape sequences in code, or modify the latex writer to strip them out. The former approach seems more sensible since we'd get similar problems in other output formats (HTML, docx?). However, this has the drawback that ipynb would not round trip the escape sequences when going ipynb -> ipynb. Maybe that's not an issue?

@panisson
Copy link

I've found the same problem when using the question mark to access the documentation of an object.

Very simple example:

s = "a"
s.strip?

produces an output like this:

   "outputs": [
    {
     "data": {
      "text/plain": [
       "\u001b[0;31mSignature:\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mchars\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m/\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
       "\u001b[0;31mDocstring:\u001b[0m\n",
       "Return a copy of the string with leading and trailing whitespace remove.\n",
       "\n",
       "If chars is given and not None, remove characters in chars instead.\n",
       "\u001b[0;31mType:\u001b[0m      builtin_function_or_method\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],

And running XeTeX with the resulting latex code produces the error Text line contains an invalid character..

@jgm jgm closed this as completed in 5454aad Jul 16, 2019
@jgm
Copy link
Owner

jgm commented Jul 16, 2019

I came up with a good solution that strips them in most cases but still allows for round-trip.

@aarchiba
Copy link

Could this be fixed by jupyter/nbconvert#1181 which converts the escape sequences into colors?

@maegul
Copy link

maegul commented Sep 1, 2020

Could this be fixed by jupyter/nbconvert#1181 which converts the escape sequences into colors?

Hmmm ... seems that you've addressed this in 2.10.1 (having read the changelog) .... would it be possible to have the option of stripping the ansi escape sequences without using --ipynb-output=best? I find that notebooks with lots of images can take some time when using --ipynb-output=best.

@ickc
Copy link
Contributor

ickc commented Apr 11, 2022

Hi, I encountered this problem in my custom workflow, using pandoc 2.18.

Is it true that the ASNI are striped only if it is from ipynb?

In my workflow there's some steps that

  1. combine many ipynb into a single file with native format first,
    (which was converted from ipynb to native with --ipynb-output=all as we want to keep everything at this step),
  2. then in another step to convert the native file to pdf/tex, and applying --ipynb-output=best doesn't help removing the ANSI sequences.

I can provide MWE and open a new issue if needed.

@jgm
Copy link
Owner

jgm commented Apr 11, 2022

I can provide MWE and open a new issue if needed.

Yes please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants