Control memory explosion on large list of queries #102

rahulpowar · 2017-07-07T14:24:28Z

In our use case (server to server), we were making batch queries to our Go server and ran into a problem when a single request started to exceed about queries 100. In our use case, we needed to pack a bunch of queries together as we can very effectively parallelise them and want to hide the request latencies.

Performance would collapse and the resident memory would spike up to 10s of GB in production where the requests had ~3000 queries. I added a benchmark BenchmarkFragmentQueries to graphql_test.go and with a little profiling it looked like almost all of this was due to the validation. With an early return on the validation, the time and memory overheads would vanish.

https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L174-L178 causes a ^2 explosion of memory due to https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L389-L393 - the keys start taking up a huge amount of space causing the hashmap implementation to struggle while it expands buckets internally to cope. This memory also does not get released with the GC of the map itself. To control this, this PR stops the use of the map if the selection length is greater than 100 which causes more computation but is faster than fighting the map. I also made a few other tweaks to improve performance and it now works ok for our usecase with MBs instead of GBs of RSS and sub second overheads.

The numbers get much worse if field name overlaps due to validateFieldOverlap not early terminating but I skipped fixes/investigation as that does not look like a common issue in the real world. However, if you try it the current master will exceed the 10min limit for a benchmark test so likely needs some guards to prevent DoS issues with public APIs. I believe this can all be a bit smarter as the queries have the same repeating fragments but I am new to this codebase.

Performance tests

The headline for us was reducing 65s (of mainly validation) down to 1s for 10,000 queries in the request.

# master with the new test added
$ go test -bench=FragmentQueries -benchmem > /tmp/master-graphql.txt

# this fork
$  go test -bench=FragmentQueries -benchmem > /tmp/branch-graphql.txt

benchcmp /tmp/master-graphql.txt /tmp/branch-graphql.txt
benchmark                                                            old ns/op       new ns/op      delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         42890           41963          -2.16%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        239469          201074         -16.03%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       3337688         2820303        -15.50%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      376659442       24807675       -93.41%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     65201269084     1067392258     -98.36%

benchmark                                                            old allocs     new allocs     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         254            252            -0.79%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        1981           1973           -0.40%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       19559          19245          -1.61%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      247891         189323         -23.63%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     7900911        1916028        -75.75%

benchmark                                                            old bytes       new bytes     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         16300           14488         -11.12%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        146670          126303        -13.89%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       2322911         1674979       -27.89%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      169845173       10843020      -93.62%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     10961889744     115822224     -98.94%

Overlapping fields is a bit of an artificial test case.

tonyghita · 2017-07-07T16:28:20Z

.gitignore

@@ -1 +1,3 @@
 /internal/tests/testdata/graphql-js
+
+.idea/


This looks like it's specific to your developer environment. It'd probably be best to add in your global .gitignore.

rahulpowar · 2017-07-22T12:30:27Z

Have zapped the gitignore as requested.

tonyghita

This seems like a very reasonable change. Thank you for including the benchmark.

Would you be able to fix up the merge conflicts? I'm happy to merge the changes once this PR is good to go.

tonyghita · 2018-04-09T21:07:30Z

graphql_test.go

+			if o {
+				a = "non-overlapping"
+			}
+			b.Run(fmt.Sprintf("%d queries %s aliases", c, a), func(b *testing.B) {


Maybe we'd want to b.ResetTimer() after setup? What do you think?

tonyghita · 2018-04-09T21:07:51Z

internal/validation/validation.go

@@ -171,9 +171,10 @@ func validateSelectionSet(c *opContext, sels []query.Selection, t schema.NamedTy
 		validateSelection(c, sel, t)
 	}

+	useCache := len(sels) <= 100


Should this threshold be configurable? How did we arrive at 100?

tonyghita · 2018-04-09T21:09:28Z

internal/validation/validation.go

@@ -381,16 +382,21 @@ func detectFragmentCycleSel(c *context, sel query.Selection, fragVisited map[*qu
 	}
 }

-func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location) {
+func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location, useCache bool) {


Just a side comment on the internal function signatures, does not apply to these changes: the internal functions may have a nicer APIs if we took structs instead of ever-expanding lists of arguments.

I agree with @tonyghita and this is a refactoring that ~~sooner or later has to be done~~ would improve maintainability of the library. However, it will be a breaking change as well it will require a considerable amount of work.

aaahrens · 2018-09-10T20:20:07Z

Can someone merge this?

pavelnikolov · 2018-10-04T02:53:50Z

There are merge conflicts to be fixed first...

pavelnikolov · 2018-10-16T02:43:17Z

graphql_test.go

@@ -8,6 +8,8 @@ import (
 	"github.com/neelance/graphql-go"
 	"github.com/neelance/graphql-go/example/starwars"
 	"github.com/neelance/graphql-go/gqltesting"
+	"fmt"
+	"bytes"


Imports are in the wrong group.

andreiavrammsd · 2019-03-14T14:20:45Z

The situation occurs quickly with large requests size. I've sent a ~100KB query string and it rapidly leaked to more than 1GB of memory at the overlapValidated map.

c.overlapValidated[selectionPair{a, b}] = struct{}{}

stefanvanburen · 2019-03-14T14:55:25Z

I'd be willing to put together a PR with the merge conflicts fixed here, as well as the benchmark fix. If we did this, how would we want to approach the threshold? Make it configurable? Or should we just allow 100 to be the default?

pavelnikolov · 2019-03-14T20:55:12Z

Thank you, @svanburen! I would prefer if this is opt-in and the threshold is configurable.

stefanvanburen · 2019-03-15T01:45:10Z

I've been playing around with this for a bit and have it mostly working - the only issue is that given the current implementation, if we make the threshold configurable and it's below a certain value, the validation tests will panic (a value of 1 or 2 seems to trigger this).

I can put up a PR with the failing tests, but any initial thoughts here? I'm wondering if we should ignore the configuration for the time being and just get in a merge-able PR?

# Conflicts: # graphql_test.go

dmotylev · 2020-03-25T09:46:57Z

There are merge conflicts to be fixed first...

Fixed

This reverts commit 00e76d5.

mute scanner errors

rahulpowar and others added 7 commits June 26, 2017 19:17

Added benchmark for aliases

cdec6f1

Added test for overlapping alias

47c1b55

Added cache bypass for long lists

93175b0

Minor performance tweaks

f14a316

Modifed test cases to remove overlaps

602f734

Overlapping fields is a bit of an artificial test case.

Merge remote-tracking branch 'neelance/master'

88260d5

Merge remote-tracking branch 'neelance/master'

d27e237

tonyghita reviewed Jul 7, 2017

View reviewed changes

Removed gitignore contamination.

31eb94b

tonyghita reviewed Apr 9, 2018

View reviewed changes

aaahrens mentioned this pull request Sep 10, 2018

Why is there so many open PR's with no conflicts and resolved discussion #257

Closed

pavelnikolov reviewed Oct 16, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master'

0c0545f

# Conflicts: # graphql_test.go

csucu added 4 commits November 1, 2021 11:33

mute scanner errors

720c77c

added collector, fmt & go mod tidy

00e76d5

Revert "added collector, fmt & go mod tidy"

e966f12

This reverts commit 00e76d5.

Merge pull request #1 from redsift/feature/mute-errors

3a8d13b

mute scanner errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control memory explosion on large list of queries #102

Control memory explosion on large list of queries #102

rahulpowar commented Jul 7, 2017

tonyghita Jul 7, 2017 •

edited

Loading

rahulpowar commented Jul 22, 2017

tonyghita left a comment •

edited

Loading

tonyghita Apr 9, 2018

tonyghita Apr 9, 2018

tonyghita Apr 9, 2018

pavelnikolov Mar 25, 2020 •

edited

Loading

aaahrens commented Sep 10, 2018

pavelnikolov commented Oct 4, 2018

pavelnikolov Oct 16, 2018

andreiavrammsd commented Mar 14, 2019

stefanvanburen commented Mar 14, 2019

pavelnikolov commented Mar 14, 2019

stefanvanburen commented Mar 15, 2019

dmotylev commented Mar 25, 2020

Control memory explosion on large list of queries #102

Are you sure you want to change the base?

Control memory explosion on large list of queries #102

Conversation

rahulpowar commented Jul 7, 2017

Performance tests

tonyghita Jul 7, 2017 • edited Loading

Choose a reason for hiding this comment

rahulpowar commented Jul 22, 2017

tonyghita left a comment • edited Loading

Choose a reason for hiding this comment

tonyghita Apr 9, 2018

Choose a reason for hiding this comment

tonyghita Apr 9, 2018

Choose a reason for hiding this comment

tonyghita Apr 9, 2018

Choose a reason for hiding this comment

pavelnikolov Mar 25, 2020 • edited Loading

Choose a reason for hiding this comment

aaahrens commented Sep 10, 2018

pavelnikolov commented Oct 4, 2018

pavelnikolov Oct 16, 2018

Choose a reason for hiding this comment

andreiavrammsd commented Mar 14, 2019

stefanvanburen commented Mar 14, 2019

pavelnikolov commented Mar 14, 2019

stefanvanburen commented Mar 15, 2019

dmotylev commented Mar 25, 2020

tonyghita Jul 7, 2017 •

edited

Loading

tonyghita left a comment •

edited

Loading

pavelnikolov Mar 25, 2020 •

edited

Loading