Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #8715: eliminates duplicates when used in many-to-many field constraints #8793

Merged
merged 1 commit into from
Mar 7, 2022

Conversation

seros1521
Copy link
Contributor

@seros1521 seros1521 commented Mar 4, 2022

Fixes #8715

Some netbox users create object access rights using tags. Inside netbox, all object level permissions are implemented by the RestrictedQuerySet class:
https://github.com/netbox-community/netbox/blob/3436905744c93fec7ba59a8b7d72ef4102f82334/netbox/utilities/querysets.py
When using permissions that use tags, a user may receive multiple permissions of the same type if multiple tags are assigned to the device. This causes the RestrictedQuerySet class to generate a query similar to this:

>>> dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2'))
<ConfigContextModelQuerySet [<Device: device1>, <Device: device1>]>

This query returns the same object twice if both tags are assigned to it. This is due to the use of the django-taggit library. The library's documentation describes this behavior as expected and suggests using an explicit distinct() call in queries to avoid duplicates.
However, the use of DISTINCT in queries has a global side effect - deduplication of responses, which may or may not be acceptable behavior (depending on further use). Since it is not known how RestrictedQuerySet will be used in the rest of the code, it was decided to dedupe using a subquery.

@seros1521
Copy link
Contributor Author

I have 10,000 devices in my database. Tags tag1 and tag2 are assigned to 1 device. The tag_with_many_matches has been assigned to 1500 devices.
Comparison:

Simple request:

>>> print(dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2')).query)
SELECT ***all Device fields*** FROM "dcim_device" INNER JOIN "extras_taggeditem" ON ("dcim_device"."id" = "extras_taggeditem"."object_id" AND ("extras_taggeditem"."content_type_id" = 27)) INNER JOIN "extras_tag" ON ("extras_taggeditem"."tag_id" = "extras_tag"."id") WHERE ("extras_tag"."name" = tag1 OR "extras_tag"."name" = tag2) ORDER BY "dcim_device"."_name" ASC, "dcim_device"."id" ASC
>>>
>>> print(dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2')).explain())
Sort  (cost=1736.76..1736.83 rows=27 width=392)
  Sort Key: dcim_device._name, dcim_device.id
  ->  Nested Loop  (cost=331.31..1736.12 rows=27 width=392)
        ->  Nested Loop  (cost=331.02..1647.68 rows=141 width=4)
              ->  Seq Scan on extras_tag  (cost=0.00..13.61 rows=2 width=8)
                    Filter: (((name)::text = 'tag1'::text) OR ((name)::text = 'tag2'::text))
              ->  Bitmap Heap Scan on extras_taggeditem  (cost=331.02..815.39 rows=164 width=12)
                    Recheck Cond: ((tag_id = extras_tag.id) AND (content_type_id = 27))
                    ->  BitmapAnd  (cost=331.02..331.02 rows=164 width=0)
                          ->  Bitmap Index Scan on extras_taggeditem_tag_id_d48af7c7  (cost=0.00..55.04 rows=2483 width=0)
                                Index Cond: (tag_id = extras_tag.id)
                          ->  Bitmap Index Scan on extras_taggeditem_content_type_id_ba5562ed  (cost=0.00..272.62 rows=12293 width=0)
                                Index Cond: (content_type_id = 27)
        ->  Index Scan using dcim_device_pkey on dcim_device  (cost=0.29..0.63 rows=1 width=392)
              Index Cond: (id = extras_taggeditem.object_id)
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=1000, stmt='list(dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag2")))')
2.983009135350585
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=50, stmt='list(dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag_with_many_matches")))')
12.054297903552651

Query using DISTINCT:

>>> print(dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2')).distinct().query)
SELECT DISTINCT ***all Device fields*** FROM "dcim_device" INNER JOIN "extras_taggeditem" ON ("dcim_device"."id" = "extras_taggeditem"."object_id" AND ("extras_taggeditem"."content_type_id" = 27)) INNER JOIN "extras_tag" ON ("extras_taggeditem"."tag_id" = "extras_tag"."id") WHERE ("extras_tag"."name" = tag1 OR "extras_tag"."name" = tag2) ORDER BY "dcim_device"."_name" ASC, "dcim_device"."id" ASC
>>>
>>> print(dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2')).distinct().explain())
Unique  (cost=1736.76..1738.65 rows=27 width=392)
  ->  Sort  (cost=1736.76..1736.83 rows=27 width=392)
        Sort Key: dcim_device._name, dcim_device.id, dcim_device.created, dcim_device.last_updated, dcim_device.custom_field_data, dcim_device.local_context_data, dcim_device.device_type_id, dcim_device.device_role_id, dcim_device.tenant_id, dcim_device.platform_id, dcim_device.name, dcim_device.serial, dcim_device.asset_tag, dcim_device.site_id, dcim_device.location_id, dcim_device.rack_id, dcim_device."position", dcim_device.face, dcim_device.status, dcim_device.airflow, dcim_device.primary_ip4_id, dcim_device.primary_ip6_id, dcim_device.cluster_id, dcim_device.virtual_chassis_id, dcim_device.vc_position, dcim_device.vc_priority, dcim_device.comments
        ->  Nested Loop  (cost=331.31..1736.12 rows=27 width=392)
              ->  Nested Loop  (cost=331.02..1647.68 rows=141 width=4)
                    ->  Seq Scan on extras_tag  (cost=0.00..13.61 rows=2 width=8)
                          Filter: (((name)::text = 'tag1'::text) OR ((name)::text = 'tag2'::text))
                    ->  Bitmap Heap Scan on extras_taggeditem  (cost=331.02..815.39 rows=164 width=12)
                          Recheck Cond: ((tag_id = extras_tag.id) AND (content_type_id = 27))
                          ->  BitmapAnd  (cost=331.02..331.02 rows=164 width=0)
                                ->  Bitmap Index Scan on extras_taggeditem_tag_id_d48af7c7  (cost=0.00..55.04 rows=2483 width=0)
                                      Index Cond: (tag_id = extras_tag.id)
                                ->  Bitmap Index Scan on extras_taggeditem_content_type_id_ba5562ed  (cost=0.00..272.62 rows=12293 width=0)
                                      Index Cond: (content_type_id = 27)
              ->  Index Scan using dcim_device_pkey on dcim_device  (cost=0.29..0.63 rows=1 width=392)
                    Index Cond: (id = extras_taggeditem.object_id)
>>>
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=1000, stmt='list(dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag2")).distinct())')
3.084764424711466
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=50, stmt='list(dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag_with_many_matches")).distinct())')
12.156185979023576

Query with subquery:

>>> allowed_objects=dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2'))
>>> dcim.models.Device.objects.filter(pk__in=allowed_objects)
<ConfigContextModelQuerySet [<Device: device1>]>
>>>
>>> print(dcim.models.Device.objects.filter(pk__in=allowed_objects).query)
SELECT ***all Device fields*** FROM "dcim_device" WHERE "dcim_device"."id" IN (SELECT U0."id" FROM "dcim_device" U0 INNER JOIN "extras_taggeditem" U1 ON (U0."id" = U1."object_id" AND (U1."content_type_id" = 27)) INNER JOIN "extras_tag" U2 ON (U1."tag_id" = U2."id") WHERE (U2."name" = tag1 OR U2."name" = tag2)) ORDER BY "dcim_device"."_name" ASC, "dcim_device"."id" ASC
>>>
>>> print(dcim.models.Device.objects.filter(pk__in=allowed_objects).explain())
Sort  (cost=1713.31..1713.38 rows=27 width=392)
  Sort Key: dcim_device._name, dcim_device.id
  ->  Nested Loop  (cost=1693.90..1712.67 rows=27 width=392)
        ->  HashAggregate  (cost=1693.61..1693.88 rows=27 width=12)
              Group Key: u0.id
              ->  Nested Loop  (cost=331.31..1693.54 rows=27 width=12)
                    ->  Nested Loop  (cost=331.02..1647.68 rows=141 width=4)
                          ->  Seq Scan on extras_tag u2  (cost=0.00..13.61 rows=2 width=8)
                                Filter: (((name)::text = 'tag1'::text) OR ((name)::text = 'tag2'::text))
                          ->  Bitmap Heap Scan on extras_taggeditem u1  (cost=331.02..815.39 rows=164 width=12)
                                Recheck Cond: ((tag_id = u2.id) AND (content_type_id = 27))
                                ->  BitmapAnd  (cost=331.02..331.02 rows=164 width=0)
                                      ->  Bitmap Index Scan on extras_taggeditem_tag_id_d48af7c7  (cost=0.00..55.04 rows=2483 width=0)
                                            Index Cond: (tag_id = u2.id)
                                      ->  Bitmap Index Scan on extras_taggeditem_content_type_id_ba5562ed  (cost=0.00..272.62 rows=12293 width=0)
                                            Index Cond: (content_type_id = 27)
                    ->  Index Only Scan using dcim_device_pkey on dcim_device u0  (cost=0.29..0.33 rows=1 width=8)
                          Index Cond: (id = u1.object_id)
        ->  Index Scan using dcim_device_pkey on dcim_device  (cost=0.29..0.70 rows=1 width=392)
              Index Cond: (id = u0.id)
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=1000, stmt='list(dcim.models.Device.objects.filter(pk__in=dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag2"))))')
3.58037918061018
>>>
>>> timeit.timeit(setup='import dcim; from django.db.models import Q', number=50, stmt='list(dcim.models.Device.objects.filter(pk__in=dcim.models.Device.objects.filter(Q(tags__name="tag1")|Q(tags__name="tag_with_many_matches"))))')
12.276001764461398

@seros1521 seros1521 changed the title Fixes #8715 Fixes #8715: eliminates duplicates when used in many-to-many field constraints Mar 4, 2022
@seros1521
Copy link
Contributor Author

Possibly also fixes #8351

@seros1521 seros1521 changed the title Fixes #8715: eliminates duplicates when used in many-to-many field constraints WIP: Fixes #8715: eliminates duplicates when used in many-to-many field constraints Mar 4, 2022
…nstraints

When using permissions that use tags, a user may receive multiple permissions
of the same type if multiple tags are assigned to the device. This causes the
RestrictedQuerySet class to generate a query similar to this:

>>> dcim.models.Device.objects.filter(Q(tags__name='tag1')|Q(tags__name='tag2'))
<ConfigContextModelQuerySet [<Device: device1>, <Device: device1>]>

This query returns the same object twice if both tags are assigned to it. This
is due to the use of the django-taggit library. The library's documentation
describes this behavior as expected and suggests using an explicit distinct()
call in queries to avoid duplicates.

However, the use of DISTINCT in queries has a global side effect -
deduplication of responses, which may or may not be acceptable behavior
(depending on further use). Since it is not known how RestrictedQuerySet will
be used in the rest of the code, it was decided to dedupe using a subquery.
@seros1521 seros1521 changed the title WIP: Fixes #8715: eliminates duplicates when used in many-to-many field constraints Fixes #8715: eliminates duplicates when used in many-to-many field constraints Mar 4, 2022
Comment on lines +46 to +47
allowed_objects = self.model.objects.filter(attrs)
attrs = Q(pk__in=allowed_objects)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach can be optimized by retrieving only the IDs list (rather than a full queryset) in the first query.

Suggested change
allowed_objects = self.model.objects.filter(attrs)
attrs = Q(pk__in=allowed_objects)
allowed_object_ids = self.model.objects.filter(attrs).values_list('pk', flat=True)
attrs = Q(pk__in=allowed_object_ids)

@jeremystretch
Copy link
Member

Thanks for digging into this @seros1521! I'm having a hard time thinking of a scenario where we wouldn't want to call DISTINCT, but I appreciate the consideration.

I'm curious to see whether the change I recommended above reduces the subquery approach substantially. Would you mind testing it and sharing the results?

@seros1521
Copy link
Contributor Author

This change does not change the resulting query. We don't fetch the result of the subquery directly, but use it in another query. django-orm is smart enough to use only the primary keys of the objects in the subquery.
In any case, a query like this is generated:

SELECT
	***all Device fields***
FROM
	"dcim_device"
WHERE
	"dcim_device"."id" IN (
		SELECT
			U0."id"
		FROM
			"dcim_device" U0
		INNER JOIN
			"extras_taggeditem" U1
			ON (
				U0."id" = U1."object_id"
				AND (U1."content_type_id" = 27)
			)
		INNER JOIN
			"extras_tag" U2
			ON (
				U1."tag_id" = U2."id"
			)
		WHERE (
			U2."name" = tag1
			OR U2."name" = tag2
		)
	)
ORDER BY
	"dcim_device"."_name" ASC,
	"dcim_device"."id" ASC

@jeremystretch
Copy link
Member

django-orm is smart enough to use only the primary keys of the objects in the subquery.

Interesting, I didn't think it looked that far into the query. Good to know!

I'm slightly concerned about the modest performance penalty, though it seems unlikely to have a noticeable impact. In the worst case, we might need to revert the change and try using DISTINCT instead.

Thanks for your work on this!

@jeremystretch jeremystretch merged commit bffe63a into netbox-community:develop Mar 7, 2022
@seros1521 seros1521 deleted the fix_8715 branch May 3, 2022 11:12
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Devices > Edit: (get() returned more than one Device) when using multi custom permissions by tags
2 participants