Syntax highlighting skipped on literal_block with unicode #4225

abooij · 2017-11-08T00:00:56Z

When using unicode in a language that accepts it, on a block of code that is accepted by pygments, the doc build breaks.

Problem

I am trying to get syntax highlighting for the following code:

Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).

As you can see, this uses the unicode symbol ¬.

Procedure to reproduce the problem

As a test case, I try to compile the following index.rst in an otherwise vanilla sphinx project:

This breaks:

.. code-block:: coq

    Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).

But this works:

.. code-block:: coq

    Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).

Error logs / results

$ make html                     
Running Sphinx v1.6.5
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
reading sources... [100%] index                                                                          
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] index                                                                           
/tmp/test/index.rst:3: WARNING: Could not lex literal_block as "coq". Highlighting skipped.
generating indices... genindex
writing additional pages... search
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 1 warning.

Build finished. The HTML pages are in _build/html.

Expected results

The original code is processed fine by pygments directly, hence sphinx should, too.
And if sphinx insists on breaking, it should tell me why.

Environment info

OS: linux
Python version: 3.6.2
Sphinx version: 1.6.5
Pygments version: 2.2.0

The text was updated successfully, but these errors were encountered:

jfbu · 2017-11-08T16:05:13Z

Additional info: the html produced from pygmentize -v -l coq -f html -o temp.coqout temp.coq looks like this

<div class="highlight"><pre><span></span><span class="kn">Definition</span> <span class="n">logeq_both_false</span> <span class="o">{</span><span class="n">X</span> <span class="n">Y</span> <span class="o">:</span> <span class="n">UU</span><span class="o">}</span> <span class="o">:</span> <span class="err">¬</span><span class="n">X</span> <span class="o">-&gt;</span> <span class="err">¬</span><span class="n">Y</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">X</span> <span class="o">&lt;-&gt;</span> <span class="n">Y</span><span class="o">).</span>
</pre></div>

The bits <span class="err">¬</span> do not look encouraging... does err stand for "erroneous input"? I inserted this manually in some HTML produced by Sphinx (I used your working input and injected directly in html the above) and here is what it gives;

Similarly I tried with the latex formatter of pygmentize, I picked up its produced code, replaced all \PY macros by \PYG as used by Sphinx, inserted it by manual surgery in a Sphinx LaTeX document and this gives this PDF output:

So, it does seem that even on Pygmentize side the Unicode code-points are treated as some sort of error.

jfbu · 2017-11-08T16:12:51Z

Even if Pygmentize 2.2.0's COQ lexer does not yet handle such input, the question remains why Sphinx would abort syntax highlighting altogether rather than keeping the "err"̀ class tags, which may be better than nothing as user can override the CSS to remove visual display of an error condition of pygmentize.

jfbu · 2017-11-08T16:24:22Z

about the good first issue tag, I don't think we have a definitive policy here, so in the end I removed it as I don't like so much the idea of it actually... ;-)

tk0miya · 2017-11-11T12:33:18Z

Now sphinx uses raiseonerror filter of pygments. On my local, pygments considers the unicode characters as an error.

bash-3.2$ cat failed.coq
Definition logeq_both_false {X Y : UU} : ¬X -> ¬Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < failed.coq

*** Error while highlighting:
ErrorToken: \xac
   (file "/Users/tkomiya/work/sphinx/.tox/py27/lib/python2.7/site-packages/pygments/filters/__init__.py", line 196, in filter)
*** If this is a bug you want to report, please rerun with -v.
Definition logeq_both_false {X Y : UU} :

bash-3.2$ cat suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).

jfbu · 2017-11-11T15:22:05Z

Oh, I had missed the -F raiseonerror pygmentize option. If Sphinx has no way to instruct Pygments to distinguish Unicode character caused error from other types of error then prospects are bleak here.

tk0miya · 2017-11-11T17:06:26Z

I don't know why pygments raises an error because I don't know about coq language. Anyway, current Sphinx always uses raiseonerror mode of pygments. So it is nice to change this settings through options.

abooij · 2017-11-20T12:22:09Z

The code is valid coq as it is accepted by the compiler. Does that mean this should be additionally be reported with pygments?

jfbu · 2017-11-20T13:00:43Z

@abooij it does look as being primarily a Pygments problem.

abooij · 2017-11-20T13:33:33Z

@jfbu, why do you say that this is "primarily" a pygments problem? After all, pygments' own output is usable, although indeed not perfect.

jfbu · 2017-11-20T13:52:53Z

@abooij 1) the "raiseonerror" Pygments filter does not seem to have great granularity, it generates an exception when the lexer generates an error token, and 2) if Unicode characters are ok in Coq input (I don't know), then why does the Coq lexer generate an error token? Except if something is wrong in the way Sphinx calls Pygments, this puts most of the problem on the Pygments side.

abooij · 2017-11-20T15:12:40Z

@jfbu Well indeed there is a problem on the pygments side. However, I highly doubt pygments will ever be able to get Coq syntax (or any other moderately complicated modern syntax, for that matter) completely right. In those cases, it would be preferable if sphinx generated something workable, rather than give up completely as soon as some subtle aspect of a language's syntax is used.

(Unicode is OK in coq and widely used.)

jfbu · 2017-11-20T16:56:37Z

I think @tk0miya concurred that an option to let Sphinx not use raiseonerror Pygments filter could be added to Sphinx. But then the user still has to use CSS for html or a custom redefinition in LaTeX preamble of \PYG@tok@err macro to decide what to do with the error highlighting itself.

Issue #4249 explains that Pygments' LaTeX formatter has a bug in that respect because it uses a \strut inside an \fcolorbox for highlighting error tokens and this causes an increased total height+depth to larger than normal baseline (also the character width is increased). That issue remains invisible currently because Sphinx renounces using the Pygments output if there are token errors reported by the lexer.

asmeurer · 2023-01-10T08:26:55Z

Can this option to disable raiseonerror be added, or alternatively, could Sphinx just not use it at all? I want to syntax highlight some IPython console code and if the output contains a Unicode character (like it does with %timeit, with ±), the lexer produces an error token and it fails. But it could easily just syntax highlight it ignoring the error token.

This is how, for instance, the highlighter on GitHub works, and also virtually every Markdown parser I've ever used. If you put some text in ```python that contains some not valid Python, it just highlights what it knows and ignores the rest, e.g.,

In [1]: %timeit sum([i for i in range(10000)])
335 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

It seems that in general it should be possible to choose a highlighter and have it just do the best it can with the input. It will at least definitely color the parts of it that definitely are Python (or whatever language), but the way it works now, a single "bad" character means it doesn't color anything.

AA-Turner · 2023-08-12T05:28:02Z

Fixed by 7d8df06.

A

jfbu added the type:question label Nov 8, 2017

jfbu added good first issue and removed good first issue labels Nov 8, 2017

jfbu mentioned this issue Nov 20, 2017

PDF output: Pygments error highlighting increases line spacing in code blocks #4249

Closed

jfbu mentioned this issue Nov 20, 2017

Fix #4249 by overriding Pygments latex formatter error highlighting #4250

Merged

dagwieers mentioned this issue Sep 1, 2018

Code example improvements in Windows documentation ansible/ansible#45055

Merged

AA-Turner added this to the some future version milestone Sep 29, 2022

asmeurer mentioned this issue Jan 10, 2023

Pygments lexer does not handle %timeit magic ipython/ipython#13887

Open

AA-Turner closed this as completed Aug 12, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax highlighting skipped on literal_block with unicode #4225

Syntax highlighting skipped on literal_block with unicode #4225

abooij commented Nov 8, 2017

jfbu commented Nov 8, 2017

jfbu commented Nov 8, 2017

jfbu commented Nov 8, 2017

tk0miya commented Nov 11, 2017

jfbu commented Nov 11, 2017

tk0miya commented Nov 11, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

asmeurer commented Jan 10, 2023

AA-Turner commented Aug 12, 2023

Syntax highlighting skipped on literal_block with unicode #4225

Syntax highlighting skipped on literal_block with unicode #4225

Comments

abooij commented Nov 8, 2017

Problem

Procedure to reproduce the problem

Error logs / results

Expected results

Environment info

jfbu commented Nov 8, 2017

jfbu commented Nov 8, 2017

jfbu commented Nov 8, 2017

tk0miya commented Nov 11, 2017

jfbu commented Nov 11, 2017

tk0miya commented Nov 11, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

abooij commented Nov 20, 2017

jfbu commented Nov 20, 2017

asmeurer commented Jan 10, 2023

AA-Turner commented Aug 12, 2023