Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about slow syntax highlighting #2877

Closed
ces42 opened this issue Jan 26, 2024 · 21 comments
Closed

about slow syntax highlighting #2877

ces42 opened this issue Jan 26, 2024 · 21 comments
Labels

Comments

@ces42
Copy link
Contributor

ces42 commented Jan 26, 2024

Description

vimtex's syntax highlighting is a bit slow at times. It's not terrible but if I open a large tex file and scroll up and down with my touchpad it is noticably not smooth. I've tried to look at the output of :syntime report and see if there's anything that can be improved. Here's the output

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.222296   81139  74810   0.000459    0.000003  texMathDelim       [()[\]]\|\\[{}]
  0.146481   21673  49      0.000507    0.000007  texLigature        \v%(``|''|,,)
  0.105568   25166  79      0.000556    0.000004  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.090826   21627  0       0.000207    0.000004  texMathZoneLI      \%(\\\@<!\)\@<=\\(
  0.084873   70591  61790   0.000521    0.000001  texMathSuperSub    [_^]
  0.060983   74427  498     0.000408    0.000001  texMathZoneTI      \\\\\|\\\$
  0.060684   38720  30212   0.000448    0.000002  texMathOper        [-+=/<>|]
  0.039987   51544  45011   0.000128    0.000001  texMathCmd         \\\a\+
  0.022891   26430  6532    0.000241    0.000001  texComment         %.*$
  0.022842   25192  16306   0.000065    0.000001  texMathDelimMod    \\\(left\|right\)\>
  0.021043   23855  4193    0.000384    0.000001  texMathGroup       \\\\\|\\}
  0.020043   32248  17368   0.000079    0.000001  texCmd             \\[a-zA-Z@]\+
  0.020011   5365   209     0.000412    0.000004  texCommentAcronym  \v<(\u|\d){3,}s?>
  0.019337   5194   3       0.000303    0.000004  texCommentURL      \w\+:\/\/[^[:space:]]\+
  0.018543   21627  0       0.000066    0.000001  texCmdConditionalINC \\\w*@ifnextchar\>
  0.016475   21627  0       0.000101    0.000001  texCmdLigature     \v\\%([ijolL]|ae|oe|ss|AA|AE|OE)\ze[^a-zA-Z@]
  0.016086   21627  0       0.000052    0.000001  texSynIgnoreZone   ^\c\s*% VimTeX: SynIgnore\%( on\| enable\)\?\s*$
  0.014915   16338  3614    0.000067    0.000001  texMathArg         \\\\\|\\}
  0.014743   21627  0       0.000047    0.000001  texCmdSpaceCode    \v\\%(math|cat|del|lc|sf|uc)code`
  0.014660   21627  38      0.000082    0.000001  texMathZoneEnv     \\begin{\z(cd\*\?\)}
  0.014640   14477  12771   0.000071    0.000001  texMathTextAfter   \w\+
  0.014227   22104  563     0.000064    0.000001  texCmdCRef         \v\\%(%(label)?c%(page)?|C)ref>
  0.014106   25136  0       0.000119    0.000001  texCmdRef          \\\(page\|eq\)ref\>
  0.014085   23605  6905    0.000066    0.000001  texCmdEnv          \v\\%(begin|end)>
  0.013869   21627  0       0.000103    0.000001  texCmdLigature     \v\\%([ijolL]|ae|oe|ss|AA|AE|OE)$
  0.013798   25136  0       0.000148    0.000001  texComment         ^\s*\\iffalse\>
  0.013527   25136  0       0.000438    0.000001  texCmdRef          \\v\?ref\>
  0.013382   25136  0       0.000124    0.000001  texComment         ^\s*%\s*!.*
  0.012572   21627  0       0.000061    0.000001  texCmdPart         \\\(front\|main\|back\)matter\>
  0.012235   25172  48      0.000079    0.000000  texSpecialChar     \\[,;:!>]
  0.011866   21654  103     0.000117    0.000001  texCmdConditional  \\\(if[a-zA-Z@]\+\|fi\|else\)\>
  0.011844   21627  0       0.000047    0.000001  texConditionalTrueZone ^\s*\\iftrue\>

First of all, I think the very slow \v%(``|''|,,) can be replaced by the equivalent \([`',]\)\1, which was slightly faster for me, averaging 4us instead of 7us.

I was very confused by \%(\\\@<!\)\@<=\~. Am I correct in understanding that

  • it is equivalent to \\\@<!\~
  • the point of making it more complicated is that it will match faster (with a naive regex engine): After finding a ~, it will only try to check if there's a backslash before the ~ once, instead of trying to match every substring ending before ~ against the regex \\?
    If so then the same behavior could be achieved with \\\@1<!\~ which looks simpler. Unfortunately it doesn't seem to give a speedup.

Another point is that this regex will parse something like a\\~b wrongly. This is more relevant for parsing something like \\\(a^2\) -- this is valid latex but vimtex's highlighting currently doesn't recognize the math mode (OTOH I don't know why anyone would ever write that). The regex \%(\\\@!\%(\\\\\)*)\@<=\\( would fix this, checking if there's an even number of backslashes before the \(. Same goes for detecting ~. The performance of this seems to be slightly worse than \%(\\\@!)\@<=\\( though. I got 11us vs 9us.

Do you use a latexmkrc file?

No

VimtexInfo

System info:
  OS: Ubuntu 23.10
  Vim version: NVIM v0.10.0-dev-2175+g85a041716
  Has clientserver: true
  Servername: /run/user/1000/nvim.242708.0

VimTeX project: m
  base: m.tex
  root: /home/ca/vim
  tex: /home/ca/vim/m.tex
  main parser: current file verified
  document class: article
  packages: accents aliascnt aliasctr amsbsy amsfonts amsgen amsmath amsopn amssymb amstext amsthm atbegshi atbegshi-ltx atveryend atveryend-ltx autonum auxhook bigintcalc bitset calc cleveref color csquotes enumitem epstopdf-base etex etextools etoolbox geometry gettitlestring graphics graphicx hycolor hypcap hyperref iftex ifthen ifvtex infwarerr inputenc intcalc keyval kvdefinekeys kvoptions kvsetkeys letltxmacro ltxcmds mathrsfs mathtools mhsetup mleftright nameref parseargs pdfescape pdftexcmds pgf pgfcomp-version-0-65 pgfcomp-version-1-18 pgfcore pgffor pgfkeys pgfmath pgfrcs pgfsys refcount rerunfilecheck rotating textpos tgpagella thm-amsthm thm-autoref thm-kv thm-listof thm-patch thm-restate thmtools tikz tikz-cd todonotes trig uniquecounter url xcolor xkeyval
  source files:
    m.tex
    ../texmf/tex/latex/preamble.tex
  compiler: latexmk
    engine: -pdf
    options:
      -verbose
      -file-line-error
      -synctex=1
      -interaction=nonstopmode
    callback: 1
    continuous: 1
    executable: latexmk
  viewer: Zathura
    xwin id: 0
  qf method: LaTeX logfile
@ces42 ces42 added the bug label Jan 26, 2024
@ces42
Copy link
Contributor Author

ces42 commented Jan 26, 2024

To test the slow \(\) more I just replaced all $-math in my tex file with \(\) and I got some pretty bad time:

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN           
  0.339383   8439   8439    0.000610    0.000040  texMathZoneLI      \%(\\\@<!\)\@<=\\)
  0.117477   12558  2847    0.000994    0.000009  texMathZoneLI      \%(\\\@<!\)\@<=\\(

I don't understand why finding the closing \) is so slow but it seems like replacing

  execute 'syntax region texMathZoneLI matchgroup=texMathDelimZoneLI'
          \ 'start="\%(\\\@<!\)\@<=\\("'
          \ 'end="\%(\\\@<!\)\@<=\\)"'
          \ 'contains=@texClusterMath'
          \ l:conceal

by

  execute 'syntax region texMathZoneLI matchgroup=texMathDelimZoneLI'
          \ 'start="\%(\\\@<!\)\@<=\\("'
          \ 'skip="\\\\"'
          \ 'end="\\)"'
          \ 'contains=@texClusterMath'
          \ l:conceal

in vimtex/autoload/vimtex/syntax/core.vim makes it much better (and also fixes some wrong highlighting in e.g. \(x^2\\\))

lervag added a commit that referenced this issue Jan 26, 2024
lervag added a commit that referenced this issue Jan 26, 2024
@lervag
Copy link
Owner

lervag commented Jan 26, 2024

vimtex's syntax highlighting is a bit slow at times. It's not terrible but if I open a large tex file and scroll up and down with my touchpad it is noticably not smooth. I've tried to look at the output of :syntime report and see if there's anything that can be improved. Here's the output

Thanks for looking into this and for providing some profiling numbers!

First of all, I think the very slow \v%(|''|,,) can be replaced by the equivalent ([`',])\1 ``, which was slightly faster for me, averaging 4us instead of 7us.

Could you check the original pattern without the group, i.e.

  syntax match texLigature "``\|''\|,,"

I would think it should be faster still, but it would be nice to see how it compares to your current numbers. (I've pushed an update that does this already, because I can't see how it would not be an improvement. But I'm curious if your suggested version may be even faster.)

I was very confused by \%(\\\@<!\)\@<=\~.

Not surprising. It's quite complicated; perhaps needlessly so. I have to admit that it does look equivalent to \\\@<!\~. I'm updating that now.

Am I correct in understanding that …

  • the point of making it more complicated is that it will match faster (with a naive regex engine): After finding a ~, it will only try to check if there's a backslash before the ~ once, instead of trying to match every substring ending before ~ against the regex \\?

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Another point is that this regex will parse something like a\\~b wrongly.

I've pushed a simplification of the pattern now, and it seems to work well on a\\~b.

This is more relevant for parsing something like \\\(a^2\) -- this is valid latex but vimtex's highlighting currently doesn't recognize the math mode (OTOH I don't know why anyone would ever write that). The regex \%(\\\@!\%(\\\\\)*)\@<=\\( would fix this, checking if there's an even number of backslashes before the \(. Same goes for detecting ~. The performance of this seems to be slightly worse than \%(\\\@!)\@<=\\( though. I got 11us vs 9us.

I've tested this a little bit further, and I believe that the complexity is not really needed here. \\ is already matched early as a texTabularChar. I'm therefore pushing a further simplification on this that I believe should also work as expected and improve things somewhat.

To test the slow \(\) more I just replaced all $-math in my tex file with \(\) and I got some pretty bad time: …

I've simplified this even further. How do the timing look now?

@ces42
Copy link
Contributor Author

ces42 commented Jan 26, 2024

A quick test seems to indicate that \%([`',]\)\1 (average 3.0us) might be faster than ``\|''\|,, (average 4.7us). But I'm not sure this sample is representative.

@ces42
Copy link
Contributor Author

ces42 commented Jan 26, 2024

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Some data for this: I created some files with a couple of lines like 999 times i and then a single ~ (in math mode). This should be a worst-case scenario for lookbehinds. These are the results

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN          
  0.846513   1964   933     0.002901    0.000431  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.070227   1944   928     0.000472    0.000036  texSpecialChar     \\\@1<!\~
  0.052229   1529   727     0.000437    0.000034  texSpecialChar     \\\@<!\~                                                      
  1.125284   2296   1072    0.003568    0.000490  texSpecialChar     \%(\\\@<!\%(\\\\\)*\)\@<=\~
  0.000506   2146   1004    0.000017    0.000000  texSpecialChar     \~

So it seems like the change you already pushed is better than the way it was. However the pattern \\\@<!\~ is wrong in situations like a\\~b so maybe it would be preferable to just match \~ and rely on texTabularChar matching double backslashes first. This leads to somewhat weird highlighting of strings like \~, but that's not valid tex in math-mode anyway.

@ces42
Copy link
Contributor Author

ces42 commented Jan 26, 2024

Here's another idea that might improve syntax highlighting performance. Currently there's a lot of syntax definitions that match specific commands. It might be faster to just have a syntax group that matches commands, i.e. \\[a-zA-z@]\+ and then have this syntax group contain all the specific commands, e.g. texCmdAccent, texCmdLigature. A quick test with those two syntax groups looks quite promising.

@lervag
Copy link
Owner

lervag commented Jan 28, 2024

A quick test seems to indicate that \%([`',]\)\1 (average 3.0us) might be faster than |''|,, `` (average 4.7us). But I'm not sure this sample is representative.

Interesting. I can't understand why it would be faster, but I'll switch based on your evidence.

Did you already check if the original pattern matches faster than the simplified pattern? That's would be surprising to me.

Some data for this: I created some files with a couple of lines like 999 times i and then a single ~ (in math mode). This should be a worst-case scenario for lookbehinds. These are the results

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN          
  1.125284   2296   1072    0.003568    0.000490  texSpecialChar     \%(\\\@<!\%(\\\\\)*\)\@<=\~
  0.846513   1964   933     0.002901    0.000431  texSpecialChar     \%(\\\@<!\)\@<=\~
  0.070227   1944   928     0.000472    0.000036  texSpecialChar     \\\@1<!\~
  0.052229   1529   727     0.000437    0.000034  texSpecialChar     \\\@<!\~                                                      
  0.000506   2146   1004    0.000017    0.000000  texSpecialChar     \~

Ok, so the current version is very fast now. That's good. But …

So it seems like the change you already pushed is better than the way it was. However the pattern \\\@<!\~ is wrong in situations like a\\~b so maybe it would be preferable to just match \~ and rely on texTabularChar matching double backslashes first. This leads to somewhat weird highlighting of strings like \~, but that's not valid tex in math-mode anyway.

Yes, you are right. I'm sorry for first insisting otherwise. I think using the "trivial" \~ is really fine here, because \\ is properly matched already as texTabularChar and \~ is matched as texCmdAccent. In math mode this latter command does not exist and will typically be an error anyway, so why worry about it?

Here's another idea that might improve syntax highlighting performance. Currently there's a lot of syntax definitions that match specific commands. It might be faster to just have a syntax group that matches commands, i.e. \\[a-zA-z@]\+ and then have this syntax group contain all the specific commands, e.g. texCmdAccent, texCmdLigature. A quick test with those two syntax groups looks quite promising.

Yes, you may be right. But it does seem lik a large amount of work to do this. And in my experience, syntax performance is not really a big issue?

lervag added a commit that referenced this issue Jan 28, 2024
@lervag
Copy link
Owner

lervag commented Jan 28, 2024

I'll close this, but feel free to continue the discussion.

@lervag lervag closed this as completed Jan 28, 2024
@lervag
Copy link
Owner

lervag commented Jan 28, 2024

From your original list of slow patterns, it seems we should consider the texMathDelim pattern. Do you have any ideas on this one?

@lervag
Copy link
Owner

lervag commented Jan 28, 2024

Also: If you care to share a nice example file with which you are now testing the syntax speed, that would be nice. I'm thinking of adding an example to the test files so that I have a nice way to reproduce timings.

lervag added a commit that referenced this issue Jan 28, 2024
lervag added a commit that referenced this issue Jan 28, 2024
@lervag
Copy link
Owner

lervag commented Jan 28, 2024

I've added a very tiny example here: 2477b87#diff-b6fcc94b4e1e1c06afd70f6fa03d63100d069eed363f356fe23eb30bbe2af033

@ces42
Copy link
Contributor Author

ces42 commented Feb 19, 2024

I modified your script slightly:

set nolazyredraw
let LINES = line('$')
syntime on
for s:x in range(2*LINES/winheight(0))
  norm! �
  redraw!
endfor

and ran it on the thesis.tex example file included with vimtex. The top syntimes are

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.358374   58463  7456    0.000086    0.000006  texLength          \<\d\+\([.,]\d\+\)\?\s*\(true\)\?\s*\(bp\|cc\|cm\|dd\|em\|ex\|in\|mm\|pc\|pt\|sp\)\>
  0.152289   135299 83807   0.000049    0.000001  texCmd             \\[a-zA-Z@]\+
  0.066881   53552  0       0.000042    0.000001  texComment         ^\s*\\iffalse\>
  0.044951   53552  0       0.000043    0.000001  texComment         ^\s*%\s*!.*
  0.043604   82835  35604   0.000042    0.000001  texOptSep          ,\s*
  0.038223   180644 31556   0.000038    0.000000  texOpt             \]
  0.038166   58993  3851    0.000034    0.000001  texArg             \\\\\|\\}

This is interesting, because for one of my files I get

  TOTAL      COUNT  MATCH   SLOWEST     AVERAGE   NAME               PATTERN
  0.017515   11303  10559   0.000023    0.000002  texMathDelim       [()[\]]
  0.010766   11929  100     0.000008    0.000001  texMathZoneTI      \\\\\|\\\$
  0.010017   10471  9433    0.000030    0.000001  texMathSuperSub    [_^]
  0.006673   9823   9159    0.000013    0.000001  texMathCmd         \\\a\+
  0.006661   5076   4084    0.000018    0.000001  texMathOper        [-+=/<>|]
  0.003620   3829   516     0.000006    0.000001  texMathGroup       \\\\\|\\}
  0.002590   2689   0       0.000019    0.000001  texCmdConditionalINC \\\w*@ifnextchar\>
  0.002480   4763   2925    0.000009    0.000001  texCmd             \\[a-zA-Z@]\+
  0.002314   2457   1411    0.000012    0.000001  texMathDelimMod    \\\(left\|right\)\>

So which particular rules take long might vary from case to case. Anyway I also plotted the results (form thesis.tex)
plot_thesis
and I think looking at the top syntimes might be barking up the wrong tree. It seems like the large number of (fast) syntax rules is a bigger issue than some individual slow ones.

@lervag
Copy link
Owner

lervag commented Feb 20, 2024

I modified your script slightly:

set nolazyredraw
let LINES = line('$')
syntime on
for s:x in range(2*LINES/winheight(0))
  norm! �
  redraw!
endfor

So, the idea here is to scroll through a file, right? So it's norm! <c-f> or something?

and ran it on the thesis.tex example file included with vimtex. The top syntimes are …

This is interesting, because for one of my files I get …

So which particular rules take long might vary from case to case.
Anyway I also plotted the results … and I think looking at the top syntimes might be barking up the wrong tree. It seems like the large number of (fast) syntax rules is a bigger issue than some individual slow ones.

The thesis.tex file is not really a very good example of a common LaTeX project. First, it does not contain very much math. Next, the content is repeated several times to increase the length of the file so that it becomes much bigger than most projects.

Thus, it is not so strange that there are big differences in which rules take long.

Further, the main things we want is for a single screen render to be quick. For this, we want to have low average (and slowest) times for all rules. We don't want slow rules, or at least we want them to be very rare.

@ces42
Copy link
Contributor Author

ces42 commented Jul 15, 2024

I found some time to spend on this today. I ended up using the source of this paper https://arxiv.org/abs/1512.07213 to time things with that scrolling script (the non-printable character is ^D). It seems like I was able to get a 20% speedup by trying to reduce the number of syntax rules created. I've put my changes in the faster-syntax branch on my fork.

Most of it is just "merging" regular expressions, although I also tried changing the vimtex#syntax#core#new_env function so that it only creates one (big) syntax rule for texMathEnvBgnEnd, texMathZoneEng and texMathError (so every time the function is called I delete the old syntax rule and replace it). For this I had to limit what you can do vimtex#syntax#core#new_env when {'math': v:true}, in particular you can't pass the __predicate argument. I'm not exactly sure what the use case of that is. The new function just throws errors whenever the combination of arguments could cause trouble. I think this maybe doesn't limit functionality too much.

Looking at the code probably makes things more clear than I can explain here.

@lervag
Copy link
Owner

lervag commented Jul 31, 2024

Interesting. With your branch I do get a very noticeable speedup on my example:

image

Now, I notice you do a lot of different stuff, e.g. changing to the old regex engine. It's a little bit hard to read which of your changes are the most significant. But I'm beginning to think that one of the most significant factors is the number of rules. Thus, as you say, reducing the number of rules by using more complex regexes seems to be a useful trick.

@lervag
Copy link
Owner

lervag commented Jul 31, 2024

Could you explain the timings you've added in your commits? E.g.

image

image

Are the numbers the current runtime? If so, it seems to be increasing with the commits and the latest one is the slowest. Clearly, that's not the correct understanding, but perhaps you see my confusion?

@lervag
Copy link
Owner

lervag commented Jul 31, 2024

Now, it looks like you've done a very good and thorough job here. I believe it may be a good idea to add a comment to the top of the core.vim file that summarizes some of the key reflections here?

Also, I am wondering if you are proposing that I merge this or if you want to open a PR with your work more cleaned up?

@ces42
Copy link
Contributor Author

ces42 commented Jul 31, 2024

Could you explain the timings you've added in your commits? E.g.

They are just the runtimes of test.vim (using that arxiv paper I linked as main.tex) on my computer (while fixing cpu frequency). They are not very meaningful by themselves, I was just adding them to keep track of how much mileage I was getting out of every commit.

@lervag
Copy link
Owner

lervag commented Aug 1, 2024

Ok, thanks for clarifying. How about my other questions?

@lervag lervag reopened this Aug 16, 2024
@ces42
Copy link
Contributor Author

ces42 commented Aug 19, 2024

Hi,
sorry this has been taking so long. I'm confused about one of your questions. The timing numbers do get smaller with every commit.
Reducing the total number of syntax rules is one of the main factors contributing to the speedup, although switching to the old regex engine actually had a slightly bigger impact :/

I would like to clean this up a bit before opening a PR, and I can write a comment at the top of the file explaining what the main tricks are.

@lervag
Copy link
Owner

lervag commented Aug 20, 2024

Hi, sorry this has been taking so long. I'm confused about one of your questions. The timing numbers do get smaller with every commit.

Sorry, my fault. I must have read the commit log in the wrong order or something.

Reducing the total number of syntax rules is one of the main factors contributing to the speedup, although switching to the old regex engine actually had a slightly bigger impact :/

It's also confusing to me that changing to the old regex engine should have such an impact. The docs themselves say that the old regex engine is slower. I also tested this on a few slow rules when looking into lags related to the fontawesome5 package, and there I did find that the old engine was slower.

I would like to clean this up a bit before opening a PR, and I can write a comment at the top of the file explaining what the main tricks are.

That does sound great! I will be very glad to see a PR for this!

@lervag
Copy link
Owner

lervag commented Oct 5, 2024

I'm closing this again, since there is now a PR: #3006.

@lervag lervag closed this as completed Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants