Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: sqlparser yacc codegen #7669

Merged
merged 16 commits into from
Mar 16, 2021
Merged

perf: sqlparser yacc codegen #7669

merged 16 commits into from
Mar 16, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Mar 12, 2021

Description

Following up on last week's zero-copy optimization for the Tokenizer, today we're picking up where we left off with some performance tuning for the generated Yacc parser.

This is the underlying idea: YACC is a compiler generator that outputs stack-based LALR parsers from an input grammar. Each action in a YACC grammar receives its input from one or more stack frames, and is supposed to push its output into a brand new stack frame. These are not native stack frames, they're abstracted by YACC itself into an array of "stack structs". These structs are supposed to contain any of the possible outputs for any of the possible actions in a grammar. This is cheap to implement in C (the language for which YACC was originally designed), because C supports union types.

A union type in C can have any number of fields, but only one of the fields can be written/read to at the same time. Unlike a struct, all the fields in a union share the same underlying memory, so the size of the union is the size of the largest of all its fields, instead of the sum of all its fields like in a struct. This is an important memory optimization that can be performed in YACC parsers in C, and that doesn't work at all in Go: the Go programming language does not support union types because they're not memory-safe (if you write into a pointer field in the union and then read from e.g. a float field, undefined behavior happens).

The fact that Go doesn't support Unions is a massive performance drawback for our YACC parser, because it means that every stack frame in our parser must be a Go struct with a field for every possible output of a grammar action. In our SQL grammar, this is 94 fields totalling 1424 bytes per stack frame. Because of this, the performance of our parser is overwhelmingly dominated by simply copying and clearing stack frames, as opposed to parsing.

This problem is immediately obvious when running a profiler in the parser and in fact @sougou and @GuptaManan100 already tried fixing this in a previous PR that didn't get merged. The design behind the PR is compelling: it tries to re-implement unions in Go by storing all the fields in the original struct into a single interface{} field, the closest thing that the language has to an union. A Go interface{} can hold any Go type, so it's something that could behave like a union (and hence reduce the memory usage of the parser) while being memory safe.

Why didn't the original PR get merged? It was a combination of two factors:

  • The PR required carrying a fork of goyacc (the yacc generator we're using) and making changes to our sql.y because some of the idioms used by the fork were not supported after the optimizations
  • The new parser was allocating a much greater amount of memory (200%+ in most benchmarks) than the previous one. It was spending so much time allocating memory that the performance improvement of reducing the stack size was not as significant as it should.

The new goyacc generator

This PR attempts to address the two major shortcomings in the previous iteration of these changes so that we don't have to make a memory/speed tradeoff nor any changes to our grammar to obtain very significant performance improvement from the parser. How do we accomplish this?

  • I've fixed the increased memory consumption by implementing optional union semantics. To understand the problem in the previous implementation: most of the time, writing to an interface{} in Go allocates. An interface{} is a tuple of pointers: the first one points at a VTable with metadata for the type being held in the interface; the second one points to the actual data contained in the interface. The advantage that our old struct has over interfaces is that when we assign a struct, or a primitive, to one of the fields, it's just copying memory into the parent struct. But when we assign a struct to our interface{} field, we're actually allocating space for the struct in the heap, because we need to store a pointer to it.

    The key to fixing the increase in allocations is storing in the union interface{} field the types that would not allocate. To accomplish this, the new generator lets us separate all the fields in our grammar between unionized fields, which will be stored in a interface{}, and struct fields, which will be stored like they were before.

    We're now storing in an interface{} union:

    • Named interface fields (e.g. Statement): These fields were already allocating memory, because all interfaces allocate. Moving them to a shared interface{} greatly reduces the size of the stack at no penalty whatsoever.
    • Pointer fields (e.g. *TableSpec): These pointers were already heap allocations, so moving them into an interface{} does not allocate either.
    • Slice fields (e.g. []*When): This is actually a memory usage regression, which I've mitigated by using codegen tricks; see the following section for details.
    • Literal fields (e.g. Scope where type Scope int8): Back in the day, in the early versions of Go, literals could be stored inline in place of the pointer field, preventing an allocation. However, after some GC improvements in the Go runtime, this optimization was removed because it was much simpler to assume that the pointer field of an interface{} was always a pointer, so literals are always stored as a pointer to a heap allocation that contains the literal. So why are we placing them in an interface{}, then? It turns out that our literals are small enough that they're always opted in into another optimization: the Go runtime has a static array with numbers from 0 to 256; when storing a number literal in that range in an interface{}, the compiler doesn't create a heap allocation, instead it stores a pointer into the static array, which is not an allocation.

    We're keeping outside of the union:

    • The main string field: a string is a tuple of two words. The first word is a pointer to a heap allocation with the contents of the string, the second word is the length of the string. When storing a string in an interface{}, we're creating an extra heap allocation for those two words, so suddenly every string is two heap allocations (the tuple, and the contents of the string). This was accounting for almost 50% of all the extra allocations because we write into the string field for every tokenization step in the parser -- simply moving it outside of the union is a massive performance improvement.
    • The strs []string field: this slice field is kept out of the union because it's used very frequently by SQL comments.
    • All the fields that were previously embedded structs in our stack frame: these fields are usually very hot and that's why they were being copied by value before. By keeping them outside of the interface{} we prevent them for allocating.
  • I've fixed the requirement for grammar changes by making the new generator aware of Go syntax. Most notably, for any $$ variable access, it can tell whether the variable acts as an lvalue or rvalue (if you need a refresher on C-language semantics, this article is nice) and generate code accordingly. This was the major shortcoming on the previous version of goyacc that was forcing us to rewrite chunks of our grammar: when the current stack frame $$ is accessed as a rvalue, its unionized type needs to be cast into the expect type for the frame; when accessed as a lvalue, the assignment to the lvalue needs to be typechecked and then it must be assigned into the unionized field. The new generator supports both cases.

    For reference, here are some examples of what the new codegen looks like:

Codegen sample: Typesafe insert into union
update_statement:
UPDATE comment_opt ignore_opt table_references SET update_list where_expression_opt order_by_opt limit_opt
{
	$$ = &Update{Comments: Comments($2), Ignore: $3, TableExprs: $4, Exprs: $6, Where: NewWhere(WhereClause, $7), OrderBy: $8, Limit: $9}
}
	case 56:
		yyDollar = yyS[yypt-9 : yypt+1]
		var yyLOCAL Statement
		{
			yyLOCAL = &Update{Comments: Comments(yyDollar[2].strs), Ignore: yyDollar[3].ignoreUnion(), TableExprs: yyDollar[4].tableExprsUnion(), Exprs: yyDollar[6].updateExprsUnion(), Where: NewWhere(WhereClause, yyDollar[7].exprUnion()), OrderBy: yyDollar[8].orderByUnion(), Limit: yyDollar[9].limitUnion()}
		}
		yyVAL.union = yyLOCAL
Codegen sample: In-place access for $$
| DECIMAL decimal_length_opt
  {
    $$ = &ConvertType{Type: string($1)}
    $$.Length = $2.Length
    $$.Scale = $2.Scale
  }
	case 778:
		yyDollar = yyS[yypt-2 : yypt+1]
		var yyLOCAL *ConvertType
		{
			yyLOCAL = &ConvertType{Type: string(yyDollar[1].str)}
			yyLOCAL.Length = yyDollar[2].LengthScaleOption.Length
			yyLOCAL.Scale = yyDollar[2].LengthScaleOption.Scale
		}
		yyVAL.union = yyLOCAL
  • Lastly, I've reduced memory allocations further by adding support for updating slices in-place. As explained earlier, roughly 20% of all the memory allocations in the previous generator were caused by calling append on the interface field. Yacc code that looked like $$ = append($$, 1) was being rewritten into yyVAL.union = append(yyVAL.union.([]int), 1). In theory appending to a slice with enough capacity should not allocate, but this operation does. Why? Similarly to the string case discussed earlier, a slice is a triplet of words (pointer to the actual contents; length of the slice; and capacity of the slice). Hence, storing a slice in an interface{} is causing two allocations (one for the triplet, one for the contents). We have enough different slices in the parser that it makes sense to union them all, and this would be fine if the extra allocation for the triplet was happening only when we create the slice. But the allocation also happens every time we append to it, because the assignment to the interface forces an allocation: the return of the append function is a slice header, which as we know is a triplet of words; assigning this slice header to a local variable is a no-op that the compiler optimizes away (the slice is "appended" in-place), but assigning it to an interface{} means allocating 24 new bytes for the header to replace the existing header. The result of the append reuses the underlying array for the slice, but allocates a brand new header every time.

    This is a big enough problem (since the parser appends for basically every grammar rule) that I designed a specific fix for it: a rewrite step that detects cases where we're appending to the same slice and rewrites the append so it can be performed directly on the slice header, by acquiring a pointer to it. This code uses unsafe, although in practice it's fully safe. I am not madly in love with it, but the improvement in benchmarks is very noticeable; I've configured the new goyacc with a -fast-append flag that lets us turn the optimization on/off.

Codegen sample: Slice creation and in-place update
view_name_list:
  table_name
  {
    $$ = TableNames{$1.ToViewName()}
  }
| view_name_list ',' table_name
  {
    $$ = append($$, $3.ToViewName())
  }
	case 63:
		yyDollar = yyS[yypt-1 : yypt+1]
		var yyLOCAL TableNames
		{
			yyLOCAL = TableNames{yyDollar[1].tableName.ToViewName()}
		}
		yyVAL.union = yyLOCAL
	case 64:
		yyDollar = yyS[yypt-3 : yypt+1]
		{
			yySLICE := (*TableNames)(yyIaddr(yyVAL.union))
			*yySLICE = append(*yySLICE, yyDollar[3].tableName.ToViewName())
		}

Benchmark results

We're comparing the relative change for performance, memory allocations (in Bytes) and memory allocations (by count) of 4 different implementations:

  • single_iface is @sougou and @GuptaManan100's original PR, which I've rebased on top of the current master and based my work on top of; it includes several changes to the grammars that are not needed by the other implementations. It stores every field in the grammar into a single interface{}.

  • multi_union is a too-complex version of the generator that allowed configuring many different union fields (for pointers, for primitives, for interfaces, etc). I discarded it because it also required changes to the grammar.

  • optimal_union is the final version of the code that only separates between struct fields and an interface{} union.

  • optimal_union_fast_append is like above, but with the in-place append optimization enabled for slices.

For all graphs, the Y axis is percentual change, so smaller is always better.

yacc_cpu
yacc_alloc_b
yacc_alloc_n

Performance wise, the results are very good. The improvement was already impressive in the original PR after applying the grammar changes (despite the increase in memory consuption), but after unionizing selectively it goes through the roof: over 40% faster parsing in all benchmarks (that's almost twice as fast!) except in the pathological ones, which were designed as tokenization benchmarks so they're not particularly interesting here. This is all with no changes to the idioms in the grammar.

When it comes to memory usage, that's where we see the effort in this PR: we've taken down the increase in allocations from +200% to +25% in the realistic benchmarks. Memory usage in stress benchmarks only increases by 10%. Total allocation count in bytes only increases by 10%.

Obviously this is not ideal -- we'd want the parser to be both faster and to allocate less memory, but sometimes we need to trade memory for speed because computers aren't magical.

Final benchmark table: master vs optimal_union_fast_append
name                               old time/op    new time/op    delta
ParseTraces/django_queries.txt-16    2.45ms ± 2%    1.43ms ± 3%  -41.45%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16        106ms ± 2%      62ms ± 2%  -41.83%  (p=0.008 n=5+5)
ParseStress/sql0-16                  13.1µs ± 2%     7.7µs ± 2%  -41.39%  (p=0.008 n=5+5)
ParseStress/sql1-16                  40.9µs ± 2%    24.4µs ± 2%  -40.38%  (p=0.008 n=5+5)
Parse3/normal-16                     1.75ms ± 3%    1.71ms ± 0%   -2.19%  (p=0.016 n=5+4)
Parse3/escaped-16                    5.84ms ± 1%    6.18ms ± 1%   +5.83%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ParseTraces/django_queries.txt-16     211kB ± 0%     238kB ± 0%  +12.78%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16       9.62MB ± 0%   10.86MB ± 0%  +12.95%  (p=0.008 n=5+5)
ParseStress/sql0-16                  1.43kB ± 0%    1.52kB ± 0%   +6.74%  (p=0.016 n=4+5)
ParseStress/sql1-16                  4.78kB ± 0%    5.00kB ± 0%   +4.67%  (p=0.016 n=5+4)
Parse3/normal-16                     3.10kB ± 0%    3.17kB ± 0%   +2.17%  (p=0.008 n=5+5)
Parse3/escaped-16                    5.23MB ± 0%    5.23MB ± 0%     ~     (p=0.841 n=5+5)

name                               old allocs/op  new allocs/op  delta
ParseTraces/django_queries.txt-16     3.28k ± 0%     4.13k ± 0%  +26.06%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16         154k ± 0%      193k ± 0%  +25.38%  (p=0.008 n=5+5)
ParseStress/sql0-16                    24.0 ± 0%      27.0 ± 0%  +12.50%  (p=0.008 n=5+5)
ParseStress/sql1-16                    69.0 ± 0%      76.0 ± 0%  +10.14%  (p=0.008 n=5+5)
Parse3/normal-16                       52.0 ± 0%      55.0 ± 0%   +5.77%  (p=0.008 n=5+5)
Parse3/escaped-16                       294 ± 0%       297 ± 0%   +1.02%  (p=0.008 n=5+5)

Conclusions

I believe that the performance benefits we're getting from this changeset, and the fact that it involves no actual changes to the grammar itself, makes it worth it to carry our own fork of goyacc in the codebase (which I've suitably placed in the sqlparser directory to make generation trivial)

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

vmg added 13 commits March 12, 2021 12:30
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
vmg added 2 commits March 12, 2021 13:07
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I have not looked at the goyacc.go file - trusting the benchmarks and unit tests we have.

Signed-off-by: Vicent Marti <vmg@strn.cat>
Copy link
Member

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@systay
Copy link
Collaborator

systay commented Mar 16, 2021

I just had one thought, before merging this. How do we track the upstream goyacc improvements? wouldn't it be better to fork the repo and make the changes there instead of copying the file over?

That would also make it possible to see what you changed in the goyacc implementation. I still wouldn't understand it, but others might

@vmg
Copy link
Collaborator Author

vmg commented Mar 16, 2021

@systay: Upstream goyacc hasn't received a meaningful change in 5 years, since it is no longer being used in the Go compiler itself. We pretty much own this code now, for bad or worse.

@systay systay merged commit 90fda12 into vitessio:master Mar 16, 2021
@askdba askdba added this to the v10.0 milestone Mar 18, 2021
@deepthi
Copy link
Member

deepthi commented Dec 14, 2023

@vmg is it worth pulling this commit into our copy? golang/tools@59f1f2c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants