[YAML] Add entirely new YAML syntax #90

FichteFoll · 2015-08-15T23:34:30Z

I spent the last couple of weeks (every now and then) writing this from scratch, as I promised. Based on http://www.yaml.org/spec/1.2/spec.html.

Package now also includes a preview file (for "good measure"), so you can test your color scheme against it or something.

Also has a lot of tests, of course.

I have decided to not assign any special scopes to the punctuation characters, e.g. :[]{},?|>% because everything else is already colored.

Notable differences:

Properties are now properly highlighted
Directives are highlighted
Ending and beginning of plain scalars are correctly highlighted, in hopefully all situations. This is context-aware (block vs flow context)
More accurate highlighting of implicit plain scalar types (int, float, bool ...)
Explicit keys are matched, even though they don't receive special highlighting in most cases
Probably more things that I can't

Screenshot: with new syntax

(unfortunately constant.numeric and string have a very similar color in my color scheme, so you hardly see the difference here)

Screenshot with old syntax:

Fixes jskinner/DefaultPackages#41
Fixes jskinner/DefaultPackages#167

FichteFoll · 2015-08-15T23:36:23Z

As a side note, I discovered a couple of inconveniences while working on this, which I reported here: https://github.com/SublimeTextIssues/Core/labels/C%3A%20Syntax%20Highlighting

aziz · 2015-08-16T02:07:28Z

👍 This is awesome man. YAML is a tricky one! Thanks @FichteFoll. Next stop Markdown 😉

FichteFoll · 2015-10-12T11:02:39Z

By the way, I would really like to use this as a base for a .sublime-syntax definition, but I am still unsure of how to tackle it best since I want to wait for this to be merged first. It would be cool if I could somehow inject patterns (i.e. prototypes) only into certain named contexts, such as key names, so that I could highlight the 'special' keys without rewriting half of the yaml def.

This would also be interesting for other Sublime Text resource file types based on JSON.

Don't know if this would be a feasable addition.

FichteFoll · 2016-01-18T09:12:09Z

When you are reviewing this, I'd like to hear about your opinion on coloring the punctuation characters, i.e. -|<[]{},?:%. I currently do highlight &* in anchors and references.

Edit: Just saw that convert_syntax.py is included in this PR. This shouldn't be there. Will squash soon™.

Written from scratch. Package now also includes a preview file (for "good measure"), so you can test your color scheme against it or something. Also has a lot of tests, of course.

FichteFoll · 2016-01-24T02:18:50Z

Done.

Regarding punctuation characters, I came to the following stance: For the same reason that JSON punctuation is not highlighted, I believe that YAML punctuation should not be highlighted as well. If users prefer to have them highlighted, they can do so easily by editing their color scheme, which they could also to for other languages following the same punctuation scope namin, which will hopefully be standardized at some point.

The other way around (edit to override colorization of punctuation because of using scope names like keyword.flow) is not preferred here.

wbond · 2016-01-29T19:58:44Z

In 3098 we added the "Performance" variant of the Syntax Tests build suite.

Since this is a complete rewrite, could you take a couple of minutes and runs some tests on some decently large YAML files with this and the existing YAML syntax? This can help ensure that, in addition to the excellent coverage of different syntax you already have, there aren't any performance issues with the regular expressions.

FichteFoll · 2016-01-30T00:33:42Z

I ran the performance test a couple times and recorded min and max average.

The new YAML.sublime-syntax file itself (550 lines):

Syntax "Packages/YAML/YAML.sublime-syntax" took an average of 6-8ms over 10 runs
Syntax "Packages/YAML 1.2/YAML 1.2.sublime-syntax" took an average of 33-34ms over 10 runs

The biggest YAML I could find (you hardly find anything >50 lines with google) was ... PHP Source.sublime-syntax, which has 1193 lines:

Syntax "Packages/YAML/YAML.sublime-syntax" took an average of 30-32ms over 10 runs
Syntax "Packages/YAML 1.2/YAML 1.2.sublime-syntax" took an average of 744-749ms over 10 runs

That's sadly a 25x slowdown (only 5x for the first). To be expected, since YAML is really not easy to parse for computers, but the current one is just wrong in many situations and we can't have that, can we?

wbond · 2016-02-03T11:32:30Z

@FichteFoll Hmm, 750ms for a 1200 line file is definitely less than ideal. Considering that the second example is only about twice as long as the first, the 20x slowdown seems like there is probably something we can optimize in there.

I'll take a look and see if there is anything I can identify.

wbond · 2016-02-03T12:10:13Z

Quoting the long strings in PHP Source.sublime-syntax brings the average (on my machine) to 125ms. This leads me to believe a lot of the inefficiency right now is in unquoted string processing.

With all of the variables and includes, and being unfamiliar with all of the YAML terminology, I have not yet identified what is causing the issue. I see there are a number of patterns with multiple options that are part of negative lookaheads. My hunch is it may be related to this.

The other option is that I may be able to instrument the regex engine for the performance test to help identify the regex pattern performance.

FichteFoll · 2016-02-03T13:20:50Z

It is very likely that unquoted scalars are the performance bottle neck in this, just because of how "expensive" they are in a computer parsing sense, since multiple checks have to be performed for each character. The way I check this currently is, as you mentioned, by using negative look-aheads that tell us to terminate a plain scalar, but I do think there is room for improvement.

I don't have it in my head exactly right now since it's been a while that i worked on this, though, and I'm rather busy at the moment with deadlines. I will be able to take a look at this again after next week (2016-02-15). Maybe earlier, but unlikely.

FichteFoll · 2016-02-19T01:43:13Z

Adding the results of performance tests without {} repititions, which the sregex engine does not support at this moment, for reference:

Highlighting time of PHP Source.sublime-syntax is reduced by ~100ms to ~650ms, which is a 13% improvement.
For YAML.sublime-syntax the improvement is 40%. (~20ms)

haad · 2016-02-22T23:52:05Z

+1

wbond · 2016-03-10T03:04:39Z

YAML/YAML.sublime-syntax

+        (?x)
+        (?=
+          {{ns_plain_first_plain_out}}
+          ((?!{{_flow_scalar_end_plain_out}}).)*


Here you have a lookahead, containing a negative lookahead, containing a positive lookahead.

Proposed change: ([^ :#]|\:[^ ]| [^#])*

wbond · 2016-03-10T03:09:38Z

With the two proposed changes I just made (removing nested lookahead, negative lookahead, lookhead patterns), the syntax is an order of magnitude faster.

It processes PHP Source.sublime-syntax in 55ms on my machine.

wbond · 2016-03-10T15:59:40Z

I tweaked a few regexes further. With the unreleased dev build of Sublime Text I am seeing full-file highlighting of PHP Source.sublime-syntax in under 20ms. This makes it possible to edit line 91 of PHP Source.sublime-syntax without any lag.

In short, I removed a number of negative lookaheads being applied to every character with patterns that used character classes to do positive matching. It is possible I introduced a change to behavior, although I tried not to. It would be good to have you review @FichteFoll.

https://gist.github.com/wbond/c2a846b92c873f5b7153

jcberquist · 2016-03-10T19:38:14Z

@wbond: while you are waiting for feedback, I thought this might interest you. I was curious and so I diffed the scope names applied by the two versions of the syntax to the PHP Source.sublime-syntax file and it looks like your new version assigns the string.unquoted.plain.out.yaml scope to newlines. You can easily see this by placing the cursor at the end of the name: PHP Source line and viewing the scopes. In your version you will see source.yaml string.unquoted.plain.out.yaml whereas in @FichteFoll's version it just says source.yaml. I should note that I did this with build 3107 since maybe the dev version you used is different.

@wbond

Courtesy of @wbond.

@wbond

sregex doesn't support them but does not need them either, since they are essentially a hack for backtracking regex engines. Courtesy of @wbond.

@wbond

The nested look-aheads were a huge bottleneck for plain scalar key-value pair parsing. By utilizing linear matches in the single look-ahead, parsing speed for `PHP Source.sublime-syntax` is improved by 80%. Partly courtesy of @wbond.

Remove the misleading comment about that and add test cases.

FichteFoll · 2016-03-12T18:42:10Z

@wbond
All right, I think I'm done with this now. I hand-reviewed all your changes and adjusted them before packing into commits. The issue pointed out by @jcberquist does not appear in this version.

PHP Source.sublime-syntax gets parsed in ~130ms on my machine with 3107 (which is expected to be better for you since I still have {} repititions). Please let me know how this one compares to your version on your machine (and your ST build).

I also tweaked the scoping slightly here and there.

There was one thing I did not incorporate, which was the addition of some match patterns before matches like - match: '{{_flow_scalar_end_plain_out}}' which seem to serve the purpose of improving performance by not causing this look-ahead pattern to be run against each character, but in practice I didn't get any performance improvements at all. Please tell me if results are different on your ST build.

PS: I got my hands on some pretty large YAML test files but they are so large that performance testing with my current ST build would not be feasable.

wbond · 2016-03-15T22:36:40Z

This is merged in, thanks for all of your work @FichteFoll!

javiercr · 2016-03-16T10:46:06Z

Awesome! Works like a charm with Rails i18n files. Thanks!

FichteFoll · 2016-03-16T14:09:32Z

Thanks for the merge. Glad we finally have sane YAML highlighting. :)

Here are some performance tests on big files with the slower 3107 build:

~11600 lines (400kB): 840ms
~161000 lines (4MB): 7900ms (a bit laggy when editing)
~980000 lines (60MB): 57,800ms (also laggy when editing, but only double as bad as previous)

Can't wait to compare against the next build.

I noticed that syntax highlighting, or especially the syntax tests, only use a single core. Maybe some concurrency optimization wrt sregex could speed this up a littlebit, but it'd be quite some work I suppose.
I also noticed that RAM usage grows very linearly, but that will likely not be an issue and is unavoidable in order to store the tokens. I think it should drop between test iterations however, which it does not.

What I'm curious about is RAM usage of plugin_host.exe however, which grew significantly over the course of performance testing but eventually dropped before the tests were finished. Maybe it's unrelated?
See screenshot below:

wbond · 2016-03-16T15:45:20Z

Remember that with 3107, this syntax is still use the oniguruma engine. So all of the performance, memory characteristics, etc will likely be affected by not utilizing patterns that require that engine.

I'm thinking the memory usage of plugin_host is unrelated. Perhaps another package you have installed? It happens after the performance test on the 2nd file, and starts before the performance test on the 3rd file.

I don't know off of the top of my head how allocations are happening related to lexing files in a buffer. My guess is that some detail of that is why the memory usage increases until the end of the performance test.

FichteFoll · 2016-04-13T00:33:52Z

Did more tests with 3110 (and removal of the possessive quantifiers):

300kB: 340ms
4MB: 3,117.5ms (old syntax: 1,880.7ms)

The 4MB file still seems editable. It does lag behind slightly but it's not as bad as it was previously.
I'd say this is probably as fast as we can get while maintaining accuracy.

If tokenizing could somehow be multithreaded, then performance would probably improve greatly (I have 4 cores with HT).

FichteFoll mentioned this pull request Aug 27, 2015

sublime-syntax.sublime-syntax #95

Closed

FichteFoll mentioned this pull request Oct 2, 2015

Broken YAML syntax highlighting using mustache brackets SublimeText/PackageDev#65

Closed

FichteFoll mentioned this pull request Dec 4, 2015

Better function call matching (also decorators) MattDMo/PythonImproved#17

Closed

FichteFoll mentioned this pull request Dec 25, 2015

YAML syntax highlighting breaks with double colons in key names jskinner/DefaultPackages#167

Closed

wbond added the significant label Jan 13, 2016

[YAML] Add entirely new YAML syntax

a9c002e

Written from scratch. Package now also includes a preview file (for "good measure"), so you can test your color scheme against it or something. Also has a lot of tests, of course.

FichteFoll force-pushed the new_yaml_syntax branch from 68669bf to a9c002e Compare January 24, 2016 02:15

wbond reviewed Mar 10, 2016
View reviewed changes

FichteFoll added 6 commits March 11, 2016 19:10

[YAML] Wrap null values variable in non-capt group

a75f15e

[YAML] Remove look-behinds for directives

b83ea66

Courtesy of @wbond.

[YAML] Replace some negative look-aheads with sets

e384ec8

[YAML] Don't use atomic groups

a004fe4

sregex doesn't support them but does not need them either, since they are essentially a hack for backtracking regex engines. Courtesy of @wbond.

[YAML] Whitespace and comment cleanups

40d00cd

[YAML] Significantly improve parsing speed

23e6566

The nested look-aheads were a huge bottleneck for plain scalar key-value pair parsing. By utilizing linear matches in the single look-ahead, parsing speed for `PHP Source.sublime-syntax` is improved by 80%. Partly courtesy of @wbond.

FichteFoll added 5 commits March 12, 2016 18:55

[YAML] Prototype context must not be used in plain

4792692

Remove the misleading comment about that and add test cases.

[YAML] Revise key-value pair meta scopes

d113116

[YAML] Unify whitespace usage in test files

4a07f91

[YAML] Tests: utilize multiple consecutive ^

4d34067

[YAML] Fix directive punctuation scope

83d2d30

ddiachkov mentioned this pull request Mar 12, 2016

Last character being trimmed from keys containg hyphens ddiachkov/sublime-yaml-nav#8

Closed

wbond merged commit 83d2d30 into sublimehq:master Mar 15, 2016

wbond added a commit that referenced this pull request Mar 15, 2016

Merge pull request #90 from FichteForks/new_yaml_syntax

3605fc2

FichteFoll mentioned this pull request Mar 29, 2016

YAML should translate tabs to spaces by default. jskinner/DefaultPackages#115

Closed

FichteFoll deleted the new_yaml_syntax branch March 29, 2016 00:54

FichteFoll mentioned this pull request Jul 26, 2018

[Go] syntax rework #1662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YAML] Add entirely new YAML syntax #90

[YAML] Add entirely new YAML syntax #90

FichteFoll commented Aug 15, 2015

FichteFoll commented Aug 15, 2015

aziz commented Aug 16, 2015

FichteFoll commented Oct 12, 2015

FichteFoll commented Jan 18, 2016

FichteFoll commented Jan 24, 2016

wbond commented Jan 29, 2016

FichteFoll commented Jan 30, 2016

wbond commented Feb 3, 2016

wbond commented Feb 3, 2016

FichteFoll commented Feb 3, 2016

FichteFoll commented Feb 19, 2016

haad commented Feb 22, 2016

wbond Mar 10, 2016

wbond commented Mar 10, 2016

wbond commented Mar 10, 2016

jcberquist commented Mar 10, 2016

FichteFoll commented Mar 12, 2016

wbond commented Mar 15, 2016

javiercr commented Mar 16, 2016

FichteFoll commented Mar 16, 2016

wbond commented Mar 16, 2016

FichteFoll commented Apr 13, 2016

[YAML] Add entirely new YAML syntax #90

[YAML] Add entirely new YAML syntax #90

Conversation

FichteFoll commented Aug 15, 2015

FichteFoll commented Aug 15, 2015

aziz commented Aug 16, 2015

FichteFoll commented Oct 12, 2015

FichteFoll commented Jan 18, 2016

FichteFoll commented Jan 24, 2016

wbond commented Jan 29, 2016

FichteFoll commented Jan 30, 2016

wbond commented Feb 3, 2016

wbond commented Feb 3, 2016

FichteFoll commented Feb 3, 2016

FichteFoll commented Feb 19, 2016

haad commented Feb 22, 2016

wbond Mar 10, 2016

Choose a reason for hiding this comment

wbond commented Mar 10, 2016

wbond commented Mar 10, 2016

jcberquist commented Mar 10, 2016

FichteFoll commented Mar 12, 2016

wbond commented Mar 15, 2016

javiercr commented Mar 16, 2016

FichteFoll commented Mar 16, 2016

wbond commented Mar 16, 2016

FichteFoll commented Apr 13, 2016