perf: sqlparser yacc codegen #7669

vmg · 2021-03-12T10:53:18Z

Description

Following up on last week's zero-copy optimization for the Tokenizer, today we're picking up where we left off with some performance tuning for the generated Yacc parser.

This is the underlying idea: YACC is a compiler generator that outputs stack-based LALR parsers from an input grammar. Each action in a YACC grammar receives its input from one or more stack frames, and is supposed to push its output into a brand new stack frame. These are not native stack frames, they're abstracted by YACC itself into an array of "stack structs". These structs are supposed to contain any of the possible outputs for any of the possible actions in a grammar. This is cheap to implement in C (the language for which YACC was originally designed), because C supports union types.

A union type in C can have any number of fields, but only one of the fields can be written/read to at the same time. Unlike a struct, all the fields in a union share the same underlying memory, so the size of the union is the size of the largest of all its fields, instead of the sum of all its fields like in a struct. This is an important memory optimization that can be performed in YACC parsers in C, and that doesn't work at all in Go: the Go programming language does not support union types because they're not memory-safe (if you write into a pointer field in the union and then read from e.g. a float field, undefined behavior happens).

The fact that Go doesn't support Unions is a massive performance drawback for our YACC parser, because it means that every stack frame in our parser must be a Go struct with a field for every possible output of a grammar action. In our SQL grammar, this is 94 fields totalling 1424 bytes per stack frame. Because of this, the performance of our parser is overwhelmingly dominated by simply copying and clearing stack frames, as opposed to parsing.

This problem is immediately obvious when running a profiler in the parser and in fact @sougou and @GuptaManan100 already tried fixing this in a previous PR that didn't get merged. The design behind the PR is compelling: it tries to re-implement unions in Go by storing all the fields in the original struct into a single interface{} field, the closest thing that the language has to an union. A Go interface{} can hold any Go type, so it's something that could behave like a union (and hence reduce the memory usage of the parser) while being memory safe.

Why didn't the original PR get merged? It was a combination of two factors:

The PR required carrying a fork of goyacc (the yacc generator we're using) and making changes to our sql.y because some of the idioms used by the fork were not supported after the optimizations
The new parser was allocating a much greater amount of memory (200%+ in most benchmarks) than the previous one. It was spending so much time allocating memory that the performance improvement of reducing the stack size was not as significant as it should.

The new `goyacc` generator

This PR attempts to address the two major shortcomings in the previous iteration of these changes so that we don't have to make a memory/speed tradeoff nor any changes to our grammar to obtain very significant performance improvement from the parser. How do we accomplish this?

I've fixed the increased memory consumption by implementing optional union semantics. To understand the problem in the previous implementation: most of the time, writing to an interface{} in Go allocates. An interface{} is a tuple of pointers: the first one points at a VTable with metadata for the type being held in the interface; the second one points to the actual data contained in the interface. The advantage that our old struct has over interfaces is that when we assign a struct, or a primitive, to one of the fields, it's just copying memory into the parent struct. But when we assign a struct to our interface{} field, we're actually allocating space for the struct in the heap, because we need to store a pointer to it.

The key to fixing the increase in allocations is storing in the union interface{} field the types that would not allocate. To accomplish this, the new generator lets us separate all the fields in our grammar between unionized fields, which will be stored in a interface{}, and struct fields, which will be stored like they were before.

We're now storing in an interface{} union:
- Named interface fields (e.g. Statement): These fields were already allocating memory, because all interfaces allocate. Moving them to a shared interface{} greatly reduces the size of the stack at no penalty whatsoever.
- Pointer fields (e.g. *TableSpec): These pointers were already heap allocations, so moving them into an interface{} does not allocate either.
- Slice fields (e.g. []*When): This is actually a memory usage regression, which I've mitigated by using codegen tricks; see the following section for details.
- Literal fields (e.g. Scope where type Scope int8): Back in the day, in the early versions of Go, literals could be stored inline in place of the pointer field, preventing an allocation. However, after some GC improvements in the Go runtime, this optimization was removed because it was much simpler to assume that the pointer field of an interface{} was always a pointer, so literals are always stored as a pointer to a heap allocation that contains the literal. So why are we placing them in an interface{}, then? It turns out that our literals are small enough that they're always opted in into another optimization: the Go runtime has a static array with numbers from 0 to 256; when storing a number literal in that range in an interface{}, the compiler doesn't create a heap allocation, instead it stores a pointer into the static array, which is not an allocation.
We're keeping outside of the union:
- The main string field: a string is a tuple of two words. The first word is a pointer to a heap allocation with the contents of the string, the second word is the length of the string. When storing a string in an interface{}, we're creating an extra heap allocation for those two words, so suddenly every string is two heap allocations (the tuple, and the contents of the string). This was accounting for almost 50% of all the extra allocations because we write into the string field for every tokenization step in the parser -- simply moving it outside of the union is a massive performance improvement.
- The strs []string field: this slice field is kept out of the union because it's used very frequently by SQL comments.
- All the fields that were previously embedded structs in our stack frame: these fields are usually very hot and that's why they were being copied by value before. By keeping them outside of the interface{} we prevent them for allocating.
I've fixed the requirement for grammar changes by making the new generator aware of Go syntax. Most notably, for any $$ variable access, it can tell whether the variable acts as an lvalue or rvalue (if you need a refresher on C-language semantics, this article is nice) and generate code accordingly. This was the major shortcoming on the previous version of goyacc that was forcing us to rewrite chunks of our grammar: when the current stack frame $$ is accessed as a rvalue, its unionized type needs to be cast into the expect type for the frame; when accessed as a lvalue, the assignment to the lvalue needs to be typechecked and then it must be assigned into the unionized field. The new generator supports both cases.

For reference, here are some examples of what the new codegen looks like:

Codegen sample: Typesafe insert into union

update_statement:
UPDATE comment_opt ignore_opt table_references SET update_list where_expression_opt order_by_opt limit_opt
{
	$$ = &Update{Comments: Comments($2), Ignore: $3, TableExprs: $4, Exprs: $6, Where: NewWhere(WhereClause, $7), OrderBy: $8, Limit: $9}
}

	case 56:
		yyDollar = yyS[yypt-9 : yypt+1]
		var yyLOCAL Statement
		{
			yyLOCAL = &Update{Comments: Comments(yyDollar[2].strs), Ignore: yyDollar[3].ignoreUnion(), TableExprs: yyDollar[4].tableExprsUnion(), Exprs: yyDollar[6].updateExprsUnion(), Where: NewWhere(WhereClause, yyDollar[7].exprUnion()), OrderBy: yyDollar[8].orderByUnion(), Limit: yyDollar[9].limitUnion()}
		}
		yyVAL.union = yyLOCAL

Codegen sample: In-place access for $$

| DECIMAL decimal_length_opt
  {
    $$ = &ConvertType{Type: string($1)}
    $$.Length = $2.Length
    $$.Scale = $2.Scale
  }

	case 778:
		yyDollar = yyS[yypt-2 : yypt+1]
		var yyLOCAL *ConvertType
		{
			yyLOCAL = &ConvertType{Type: string(yyDollar[1].str)}
			yyLOCAL.Length = yyDollar[2].LengthScaleOption.Length
			yyLOCAL.Scale = yyDollar[2].LengthScaleOption.Scale
		}
		yyVAL.union = yyLOCAL

Lastly, I've reduced memory allocations further by adding support for updating slices in-place. As explained earlier, roughly 20% of all the memory allocations in the previous generator were caused by calling append on the interface field. Yacc code that looked like $$ = append($$, 1) was being rewritten into yyVAL.union = append(yyVAL.union.([]int), 1). In theory appending to a slice with enough capacity should not allocate, but this operation does. Why? Similarly to the string case discussed earlier, a slice is a triplet of words (pointer to the actual contents; length of the slice; and capacity of the slice). Hence, storing a slice in an interface{} is causing two allocations (one for the triplet, one for the contents). We have enough different slices in the parser that it makes sense to union them all, and this would be fine if the extra allocation for the triplet was happening only when we create the slice. But the allocation also happens every time we append to it, because the assignment to the interface forces an allocation: the return of the append function is a slice header, which as we know is a triplet of words; assigning this slice header to a local variable is a no-op that the compiler optimizes away (the slice is "appended" in-place), but assigning it to an interface{} means allocating 24 new bytes for the header to replace the existing header. The result of the append reuses the underlying array for the slice, but allocates a brand new header every time.

This is a big enough problem (since the parser appends for basically every grammar rule) that I designed a specific fix for it: a rewrite step that detects cases where we're appending to the same slice and rewrites the append so it can be performed directly on the slice header, by acquiring a pointer to it. This code uses unsafe, although in practice it's fully safe. I am not madly in love with it, but the improvement in benchmarks is very noticeable; I've configured the new goyacc with a -fast-append flag that lets us turn the optimization on/off.

Codegen sample: Slice creation and in-place update

view_name_list:
  table_name
  {
    $$ = TableNames{$1.ToViewName()}
  }
| view_name_list ',' table_name
  {
    $$ = append($$, $3.ToViewName())
  }

	case 63:
		yyDollar = yyS[yypt-1 : yypt+1]
		var yyLOCAL TableNames
		{
			yyLOCAL = TableNames{yyDollar[1].tableName.ToViewName()}
		}
		yyVAL.union = yyLOCAL
	case 64:
		yyDollar = yyS[yypt-3 : yypt+1]
		{
			yySLICE := (*TableNames)(yyIaddr(yyVAL.union))
			*yySLICE = append(*yySLICE, yyDollar[3].tableName.ToViewName())
		}

Benchmark results

We're comparing the relative change for performance, memory allocations (in Bytes) and memory allocations (by count) of 4 different implementations:

single_iface is @sougou and @GuptaManan100's original PR, which I've rebased on top of the current master and based my work on top of; it includes several changes to the grammars that are not needed by the other implementations. It stores every field in the grammar into a single interface{}.
multi_union is a too-complex version of the generator that allowed configuring many different union fields (for pointers, for primitives, for interfaces, etc). I discarded it because it also required changes to the grammar.
optimal_union is the final version of the code that only separates between struct fields and an interface{} union.
optimal_union_fast_append is like above, but with the in-place append optimization enabled for slices.

For all graphs, the Y axis is percentual change, so smaller is always better.

Performance wise, the results are very good. The improvement was already impressive in the original PR after applying the grammar changes (despite the increase in memory consuption), but after unionizing selectively it goes through the roof: over 40% faster parsing in all benchmarks (that's almost twice as fast!) except in the pathological ones, which were designed as tokenization benchmarks so they're not particularly interesting here. This is all with no changes to the idioms in the grammar.

When it comes to memory usage, that's where we see the effort in this PR: we've taken down the increase in allocations from +200% to +25% in the realistic benchmarks. Memory usage in stress benchmarks only increases by 10%. Total allocation count in bytes only increases by 10%.

Obviously this is not ideal -- we'd want the parser to be both faster and to allocate less memory, but sometimes we need to trade memory for speed because computers aren't magical.

Final benchmark table: master vs optimal_union_fast_append

name                               old time/op    new time/op    delta
ParseTraces/django_queries.txt-16    2.45ms ± 2%    1.43ms ± 3%  -41.45%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16        106ms ± 2%      62ms ± 2%  -41.83%  (p=0.008 n=5+5)
ParseStress/sql0-16                  13.1µs ± 2%     7.7µs ± 2%  -41.39%  (p=0.008 n=5+5)
ParseStress/sql1-16                  40.9µs ± 2%    24.4µs ± 2%  -40.38%  (p=0.008 n=5+5)
Parse3/normal-16                     1.75ms ± 3%    1.71ms ± 0%   -2.19%  (p=0.016 n=5+4)
Parse3/escaped-16                    5.84ms ± 1%    6.18ms ± 1%   +5.83%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ParseTraces/django_queries.txt-16     211kB ± 0%     238kB ± 0%  +12.78%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16       9.62MB ± 0%   10.86MB ± 0%  +12.95%  (p=0.008 n=5+5)
ParseStress/sql0-16                  1.43kB ± 0%    1.52kB ± 0%   +6.74%  (p=0.016 n=4+5)
ParseStress/sql1-16                  4.78kB ± 0%    5.00kB ± 0%   +4.67%  (p=0.016 n=5+4)
Parse3/normal-16                     3.10kB ± 0%    3.17kB ± 0%   +2.17%  (p=0.008 n=5+5)
Parse3/escaped-16                    5.23MB ± 0%    5.23MB ± 0%     ~     (p=0.841 n=5+5)

name                               old allocs/op  new allocs/op  delta
ParseTraces/django_queries.txt-16     3.28k ± 0%     4.13k ± 0%  +26.06%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16         154k ± 0%      193k ± 0%  +25.38%  (p=0.008 n=5+5)
ParseStress/sql0-16                    24.0 ± 0%      27.0 ± 0%  +12.50%  (p=0.008 n=5+5)
ParseStress/sql1-16                    69.0 ± 0%      76.0 ± 0%  +10.14%  (p=0.008 n=5+5)
Parse3/normal-16                       52.0 ± 0%      55.0 ± 0%   +5.77%  (p=0.008 n=5+5)
Parse3/escaped-16                       294 ± 0%       297 ± 0%   +1.02%  (p=0.008 n=5+5)

Conclusions

I believe that the performance benefits we're getting from this changeset, and the fact that it involves no actual changes to the grammar itself, makes it worth it to carry our own fork of goyacc in the codebase (which I've suitably placed in the sqlparser directory to make generation trivial)

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

Signed-off-by: Vicent Marti <vmg@strn.cat>

systay

Looks great. I have not looked at the goyacc.go file - trusting the benchmarks and unit tests we have.

Signed-off-by: Vicent Marti <vmg@strn.cat>

GuptaManan100

🚀

systay · 2021-03-16T07:10:09Z

I just had one thought, before merging this. How do we track the upstream goyacc improvements? wouldn't it be better to fork the repo and make the changes there instead of copying the file over?

That would also make it possible to see what you changed in the goyacc implementation. I still wouldn't understand it, but others might

vmg · 2021-03-16T10:25:33Z

@systay: Upstream goyacc hasn't received a meaningful change in 5 years, since it is no longer being used in the Go compiler itself. We pretty much own this code now, for bad or worse.

deepthi · 2023-12-14T00:35:48Z

@vmg is it worth pulling this commit into our copy? golang/tools@59f1f2c

vmg requested review from GuptaManan100, harshit-gangal and systay as code owners March 12, 2021 10:53

vmg added 13 commits March 12, 2021 12:30

sqlparser: do not output line info in parser

600b775

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: new goyacc implementation

440b4b5

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: rollback bool/int union

ddbbcf5

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: consistent sorting

0ba030a

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: allow fast appends

f14a0b6

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: simplify union implementation

87b6117

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: minimize memory

0b04ea6

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: enable -fast-append by default

d90dd20

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: do not use union for []string

1d93d65

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: remove redundant typecast

51d8ec6

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: add copyright

2917b6d

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: simplify pooling

a2f139f

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: full type safety for unions

9380d38

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg force-pushed the vmg/yacc branch from 9ad9585 to 9380d38 Compare March 12, 2021 11:31

vmg added 2 commits March 12, 2021 13:07

tools: fix check_make_parser

934d25a

Signed-off-by: Vicent Marti <vmg@strn.cat>

sqlparser: remove unnecessary test

b2aadb5

Signed-off-by: Vicent Marti <vmg@strn.cat>

systay approved these changes Mar 12, 2021

View reviewed changes

deepthi mentioned this pull request Mar 12, 2021

Performance Improvements #7674

Open

Merge branch 'master' into vmg/yacc

b1bdf96

Signed-off-by: Vicent Marti <vmg@strn.cat>

GuptaManan100 approved these changes Mar 16, 2021

View reviewed changes

systay merged commit 90fda12 into vitessio:master Mar 16, 2021

shlomi-noach mentioned this pull request Mar 16, 2021

Support for ALTER VITESS_MIGRATION statements #7663

Merged

8 tasks

askdba added the Component: Query Serving label Mar 18, 2021

askdba added this to the v10.0 milestone Mar 18, 2021

ajm188 mentioned this pull request Jul 15, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: sqlparser yacc codegen #7669

perf: sqlparser yacc codegen #7669

vmg commented Mar 12, 2021 •

edited

Loading

systay left a comment

GuptaManan100 left a comment

systay commented Mar 16, 2021 •

edited

Loading

vmg commented Mar 16, 2021

deepthi commented Dec 14, 2023

perf: sqlparser yacc codegen #7669

perf: sqlparser yacc codegen #7669

Conversation

vmg commented Mar 12, 2021 • edited Loading

Description

The new goyacc generator

Benchmark results

Conclusions

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

systay left a comment

Choose a reason for hiding this comment

GuptaManan100 left a comment

Choose a reason for hiding this comment

systay commented Mar 16, 2021 • edited Loading

vmg commented Mar 16, 2021

deepthi commented Dec 14, 2023

vmg commented Mar 12, 2021 •

edited

Loading

The new `goyacc` generator

systay commented Mar 16, 2021 •

edited

Loading