Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control memory explosion on large list of queries #102

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

rahulpowar
Copy link

In our use case (server to server), we were making batch queries to our Go server and ran into a problem when a single request started to exceed about queries 100. In our use case, we needed to pack a bunch of queries together as we can very effectively parallelise them and want to hide the request latencies.

Performance would collapse and the resident memory would spike up to 10s of GB in production where the requests had ~3000 queries. I added a benchmark BenchmarkFragmentQueries to graphql_test.go and with a little profiling it looked like almost all of this was due to the validation. With an early return on the validation, the time and memory overheads would vanish.

https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L174-L178 causes a ^2 explosion of memory due to https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L389-L393 - the keys start taking up a huge amount of space causing the hashmap implementation to struggle while it expands buckets internally to cope. This memory also does not get released with the GC of the map itself. To control this, this PR stops the use of the map if the selection length is greater than 100 which causes more computation but is faster than fighting the map. I also made a few other tweaks to improve performance and it now works ok for our usecase with MBs instead of GBs of RSS and sub second overheads.

The numbers get much worse if field name overlaps due to validateFieldOverlap not early terminating but I skipped fixes/investigation as that does not look like a common issue in the real world. However, if you try it the current master will exceed the 10min limit for a benchmark test so likely needs some guards to prevent DoS issues with public APIs. I believe this can all be a bit smarter as the queries have the same repeating fragments but I am new to this codebase.

Performance tests

The headline for us was reducing 65s (of mainly validation) down to 1s for 10,000 queries in the request.

# master with the new test added
$ go test -bench=FragmentQueries -benchmem > /tmp/master-graphql.txt

# this fork
$  go test -bench=FragmentQueries -benchmem > /tmp/branch-graphql.txt

benchcmp /tmp/master-graphql.txt /tmp/branch-graphql.txt
benchmark                                                            old ns/op       new ns/op      delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         42890           41963          -2.16%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        239469          201074         -16.03%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       3337688         2820303        -15.50%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      376659442       24807675       -93.41%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     65201269084     1067392258     -98.36%

benchmark                                                            old allocs     new allocs     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         254            252            -0.79%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        1981           1973           -0.40%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       19559          19245          -1.61%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      247891         189323         -23.63%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     7900911        1916028        -75.75%

benchmark                                                            old bytes       new bytes     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         16300           14488         -11.12%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        146670          126303        -13.89%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       2322911         1674979       -27.89%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      169845173       10843020      -93.62%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     10961889744     115822224     -98.94%

.gitignore Outdated
@@ -1 +1,3 @@
/internal/tests/testdata/graphql-js

.idea/
Copy link
Member

@tonyghita tonyghita Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's specific to your developer environment. It'd probably be best to add in your global .gitignore.

@rahulpowar
Copy link
Author

Have zapped the gitignore as requested.

Copy link
Member

@tonyghita tonyghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very reasonable change. Thank you for including the benchmark.

Would you be able to fix up the merge conflicts? I'm happy to merge the changes once this PR is good to go.

if o {
a = "non-overlapping"
}
b.Run(fmt.Sprintf("%d queries %s aliases", c, a), func(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we'd want to b.ResetTimer() after setup? What do you think?

@@ -171,9 +171,10 @@ func validateSelectionSet(c *opContext, sels []query.Selection, t schema.NamedTy
validateSelection(c, sel, t)
}

useCache := len(sels) <= 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this threshold be configurable? How did we arrive at 100?

@@ -381,16 +382,21 @@ func detectFragmentCycleSel(c *context, sel query.Selection, fragVisited map[*qu
}
}

func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location) {
func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location, useCache bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side comment on the internal function signatures, does not apply to these changes: the internal functions may have a nicer APIs if we took structs instead of ever-expanding lists of arguments.

Copy link
Member

@pavelnikolov pavelnikolov Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tonyghita and this is a refactoring that sooner or later has to be done would improve maintainability of the library. However, it will be a breaking change as well it will require a considerable amount of work.

@aaahrens
Copy link

Can someone merge this?

@pavelnikolov
Copy link
Member

There are merge conflicts to be fixed first...

graphql_test.go Outdated
@@ -8,6 +8,8 @@ import (
"github.com/neelance/graphql-go"
"github.com/neelance/graphql-go/example/starwars"
"github.com/neelance/graphql-go/gqltesting"
"fmt"
"bytes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports are in the wrong group.

@andreiavrammsd
Copy link

The situation occurs quickly with large requests size. I've sent a ~100KB query string and it rapidly leaked to more than 1GB of memory at the overlapValidated map.

c.overlapValidated[selectionPair{a, b}] = struct{}{}

@stefanvanburen
Copy link
Contributor

I'd be willing to put together a PR with the merge conflicts fixed here, as well as the benchmark fix. If we did this, how would we want to approach the threshold? Make it configurable? Or should we just allow 100 to be the default?

@pavelnikolov
Copy link
Member

Thank you, @svanburen! I would prefer if this is opt-in and the threshold is configurable.

@stefanvanburen
Copy link
Contributor

I've been playing around with this for a bit and have it mostly working - the only issue is that given the current implementation, if we make the threshold configurable and it's below a certain value, the validation tests will panic (a value of 1 or 2 seems to trigger this).

I can put up a PR with the failing tests, but any initial thoughts here? I'm wondering if we should ignore the configuration for the time being and just get in a merge-able PR?

# Conflicts:
#	graphql_test.go
@dmotylev
Copy link

There are merge conflicts to be fixed first...

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants