Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow queries in GraphQL after upgrading to v3.4.1 #11291

Closed
glesys-andreas opened this issue Dec 22, 2022 · 14 comments
Closed

Slow queries in GraphQL after upgrading to v3.4.1 #11291

glesys-andreas opened this issue Dec 22, 2022 · 14 comments
Assignees
Labels
status: accepted This issue has been accepted for implementation type: feature Introduction of new functionality to the application
Milestone

Comments

@glesys-andreas
Copy link

NetBox version

v3.4.1

Python version

3.9

Steps to Reproduce

import/create >= 800 vlans
[{"vid": "1", "name": "vlan_1", "status": "active"} ..... {"vid": "800", "name": "vlan_800", "status": "active"}]

run query:

query vlan {
	vlan_list {
		prefixes {
			prefix
			description
			tenant {
				name
			}
			vlan {
				group {
					name
				}
				vid
			}
		}
	}
}

Expected Behavior

Response in less than 10 seconds.

Observed Behavior

In version 3.3 this query took less than 10s, after upgrading to 3.4 this takes between 30-40 seconds.
This is reproduced at the demo site: https://demo.netbox.dev/ipam/vlans/

@glesys-andreas glesys-andreas added the type: bug A confirmed report of unexpected behavior in the application label Dec 22, 2022
@arthanson arthanson self-assigned this Dec 27, 2022
@arthanson
Copy link
Collaborator

This is more of a feature request than a bug, The issue is we aren't doing any prefetch_related or select_related on the GraphQL queries. I replicated the above and 2046 queries are done. I'm not certain why it got worse from 3.3 to 3.4 but the root of the issue is the DB calls.

The bigger issue with GraphQL is we don't know what will be included in the query, so you can either optimize for most common queries or dynamically optimize queries. There is a package https://github.com/tfoxy/graphene-django-optimizer to dynamically optimize queries, unfortunately it doesn't work with the latest Graphene.

@emil-nasso
Copy link

emil-nasso commented Dec 31, 2022

I think this is a standard problem in graphql APIs that is usually solved using batching/data loaders. Basically, it's an n+1 problem/lazy loading problem. There is one query to the database to fetch all prefixes and then when it is time to load the VLAN for a prefix, there is no batching implemented, so it performs lazy-loading/one query to the database per prefix.

The solution is to batch it. When trying to load VLANs, instead of directly loading them, their keys are "cached" and finally, a list of all prefixes that we need to load VLANs for is sent to the batch method and we can load them all at once, using one query to the database.

I have implemented this in several graphql APIs but I'm not "fluent in python or graphene". :)

It seems like graphene has a solution for it though: https://docs.graphene-python.org/en/latest/execution/dataloader

It seems like it could possibly be a solution for this issue to use that.

@arthanson arthanson assigned arthanson and unassigned arthanson Jan 3, 2023
@arthanson arthanson added this to the v3.5 milestone Jan 6, 2023
@neuro42
Copy link

neuro42 commented Jan 23, 2023

After upgrading to 3.4.3 from 3.3.6, we've also noticed our graphql query jump from ~1s to ~7s. Looking at the CPU usage with py-spy, it doesn't seem like the DB queries are being slow, but the django bits around them that are:

2.74s   _fetch_all (django/db/models/query.py:1867)
1.36s   _deepcopy_dict (copy.py:230)
1.38s   deepcopy (copy.py:146)
2.33s   filter_queryset (django_filters/filterset.py:213)
3.25s   __iter__ (django/db/models/query.py)

I haven't had a chance yet to rerun the query against 3.3.6 to see what it looked like there.

@jeremystretch jeremystretch removed this from the v3.5 milestone Feb 16, 2023
@v0tti
Copy link
Contributor

v0tti commented Feb 16, 2023

This issue is causing some trouble for us because most of our queries are taking so long we run into timeouts.

Example query we are using:

{
      device_list(has_primary_ip: true, status__n: "planned") {
        comments
        custom_fields
        device_role {
          name
        }
        id
        name
        platform {
          slug
        }
        primary_ip4 {
          address
          dns_name
        }
        primary_ip6 {
          address
          dns_name
        }
        site {
          slug
        }
        status
        interfaces {
          bridge {
            name
          }
          ip_addresses {
            address
          }
          lag {
            name
          }
          name
          mac_address
          mgmt_only
          parent {
            name
          }
          tagged_vlans {
            vid
          }
          type
          untagged_vlan {
            vid
          }
        }
      }
}

@v0tti
Copy link
Contributor

v0tti commented Feb 21, 2023

I have identified commit 99cf1b1 as responsible for the massive slowdown of the GraphQL queries.
For our worst query it slows down the execution time from about 20 seconds to over 2 minutes.

@v0tti
Copy link
Contributor

v0tti commented Feb 22, 2023

My fork reverts the commit on top of the current NetBox release. GraphQL is now as fast as prior to v3.4. As to be expected, filtering does not work on all levels, but we do not use that at the moment.

@florianschendel
Copy link

Hi guys,

I have a similar issue. Since the upgrade to 3.4.x the graphql query ends with a timeout

{
  prefix_list {
    description
    id
    last_updated
    site {
      name
    }
    vrf {
      name
    }
    tenant {
      name
    }
    role {
      name
    }
    tags {
      name
    }
    custom_fields
    prefix
    vlan {
      vid
      id
      name
      description
      status
      last_updated
    }
  }
}

Less complex queries still work without any issues.

Best regards
Florian

@arthanson
Copy link
Collaborator

Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those.

@florianschendel
Copy link

Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those.

thanks a lot!

@florianschendel
Copy link

I split the one script into three scripts as workaround

I created one for the region "emea", "apac" and "americas". We created super networks/summary networks with the VLAN ID 4094.

Actually we don´t need it in our graphql report, we need only VLAN/Prefixes that are really configured on our switches, routers ....

Maybe its also a workaround for some user people.

{
  prefix_list(vlan_vid__lt: 4094, region: "emea") {
    description
    id
    last_updated
    site {
      name
    }
    vrf {
      name
    }
    tenant {
      name
    }
    role {
      name
    }
    tags {
      name
    }
    custom_fields
    prefix
    vlan {
      vid
      id
      name
      description
      status
      last_updated
    }
  }
}

@arthanson
Copy link
Collaborator

Traced down the issue: https://github.com/netbox-community/netbox/blob/develop/netbox/netbox/filtersets.py#L249 This code is causing custom-fields to be queried even if they aren't in GraphQL request. You don't see this in REST API as all fields are returned anyways and there are prefetch_related calls in-place, but in GraphQL only a specific subset of fields are returned and there is currently no query optimization. This basically causes the query count to double from previous versions as filters are applied correctly now.

I made a bunch of VLANs with the query above, here are the query counts:
old code (before filtering at all levels): 985 queries
current code (with filtering at all levels): 1852 queries
with query optimizer: 84 queries

This will need to be fixed separately from a query optimizer, either the code referenced above needs to be changed so the custom fields aren't added to the query (preferred), or the query optimizer would need to detect the filter is a subclass of NetBoxModelFilterSet and auto add custom field prefetches.

@v0tti
Copy link
Contributor

v0tti commented Mar 10, 2023

@arthanson Thanks for your work. I just tested our use case with your branch and it is (at least) up to speed with previous versions.

@arthanson
Copy link
Collaborator

Added issue #11949 to track the custom-fields being auto added to queries.

@jeremystretch jeremystretch added this to the v3.5 milestone Mar 22, 2023
@jeremystretch jeremystretch added status: accepted This issue has been accepted for implementation type: feature Introduction of new functionality to the application and removed type: bug A confirmed report of unexpected behavior in the application labels Mar 22, 2023
@jeremystretch
Copy link
Member

PR #11943 has been merged into feature and will be included in the upcoming v3.5 release.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: accepted This issue has been accepted for implementation type: feature Introduction of new functionality to the application
Projects
None yet
Development

No branches or pull requests

7 participants