Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prefetch_related updates. #4553

Merged
merged 1 commit into from
Oct 11, 2016
Merged

Fix prefetch_related updates. #4553

merged 1 commit into from
Oct 11, 2016

Conversation

tomchristie
Copy link
Member

Closes #2442.

@tomchristie tomchristie added this to the 3.5.0 Release milestone Oct 11, 2016
@tomchristie tomchristie merged commit d0b3b64 into master Oct 11, 2016
@tomchristie tomchristie deleted the fix-prefetch-related-bug branch October 11, 2016 10:07
@spectras
Copy link

spectras commented Nov 4, 2016

Although this does fix the bug where people update an object and forget to reload it, it can have a big performance impact when unneeded. I just noticed this change because many PUT operation on my objects suddenly got bumped from 9 to 14 requests, from 8 to 12, etc. That's quite a performance hit.

The point being: the object is loaded with multiple prefetch_related clauses, but they are either not changed, or not reflected in the serialized form, or the code updating them takes care of keeping the prefetch cache updated manually.
For instance, complex user and group permissions are loaded, used to determine whether current user can modify the object (and which fields he can modify), then discarded.

I would believe that code modifying data is reponsible for reloading it. Plus, that's consistent with Django form views. But at least, is it possible to opt out of this mechanism without making my own UpdateMixin that I use instead of DRF's provided one?

(on a more sentimental issue, I'm feeling after carefully managing my reloads, I'm taking the performance hit because some other users were too lazy to do it themselves)

@tomchristie
Copy link
Member Author

I'm not exactly clear what "carefully managing my reloads" means here - can you give an example of how you're managing this in your codebase so that this hasn't been an issue for you.

We do need to prioritize not breaking user code more highly than making user code more performant, but happy to consider tweaks if a good case can be made.

@spectras
Copy link

spectras commented Nov 7, 2016

Well, for every prefetch_related added when updating an instance there are 4 typical use cases in my codebase:

  • It's used read-only. For instance, permissions. Nothing special needs to be done as I know for sure the update will not affect the permissions.

  • It's used read-write, but only for modifying related instances, knowing none will be added or removed. Just be careful to use the instances from the prefetch cache for updating, and they will be up-to-date.

  • The whole set of related objects is overwritten. In this case, I just replace the prefetch cache with the new set. For instance:

    # pictures comes from a PrimaryKeyRelatedField with many=True
    # <here, I update picture attachments in the database>, then:
    instance._prefetched_objects_cache[instance.pictures.prefetch_cache_name] = pictures

    (Actually, I put this in a small utility function so I only have one place to update if the prefetch_related() internals change).

  • Something too complex to easily maintain the prefetch cache manually. In that case, I finish my perform_update with the single line:

    serializer.instance = self.get_object()

    That is roughly equivalent to what was introduced in this pull request. But in my codebase, this is used only as a last resort. And it's only rarely actually needed. And when it is, it's because complex changes were done which already triggered many queries, making the reload relatively cheap (4 additional queries on top of 40 don't bother me as much as 4 on top of 6).


By the way, Django's generic form views do not reload the object upon update. It's unlikely to bite often because it sends a redirect, but if your `get_absolute_url()` does use some of the prefetched objects to build the url, user code is responsible for reloading the object, either in his `MyView.form_valid()` method or his `MyForm.save()` method. Otherwise, it will still see the old values.

The fact that prefetched data is not updated on saves is basic Django knowledge that any Django developer should know.

instance = Parent.objects.prefetch_related('children').get(pk=42)
Child.objects.filter(parent__pk=42).delete()

instance.children.all()    # still shows all children despite having deleted them all

And tickets on that subject on Django tracker are systematically closed as Invalid with a core developer saying it is by design (example) I'm not sure DRF working around this is really helpful in the long run. Rather, I would believe making it inconsistent with Django just adds confusion… and useless requests in all cases where reloads are not needed.

@tomchristie
Copy link
Member Author

The change here resolves an easily reproducible bug, and is only applied to prefetch_related cases. Can't see us changing that. I'd suggest a view mixin if you don't want the behavior applied.

The ticket linked is closed as invalid because there's no desire to attempt to resolve this at the ORM level. That doesn't mean that we can't or shouldn't resolve it at the generic view level - it's not the same consideration.

@spectras
Copy link

spectras commented Nov 7, 2016

Perhaps the specific ticket was not the best example, I just took the first that showed up in google.

The point being: prefetched data is not refreshed when saving an instance. This is expected behavior for all Django APIs, be it ORM, forms or views, and I would have expected DRF to behave the same.
Also, why address data pre-loaded through prefetch_related but not data pre-loaded through select_related? Or models loaded in a separate query? Or additional context loaded by third-party modules? None of those will be updated automatically.

It seems odd that if I do prefetch_related('parent'), DRF will reload the object, but if I do select_related('parent') it will not.

It seems odd that adding a prefetch_related() clause to my querysets, with the intent of reducing the number of queries, actually increases that number because it has the unpredictable effect of triggering a reload by DRF.

It seems odd also that saving the object might succeed - potentially generating side-effects such as logging an audit entry, then raise a DoesNotExist exception if the object gets deleted in between the save and the reload. (*)

But alright, mixin it will be for sure. I don't want to have to track the exact queryset methods invoked to guess whether DRF will reload the object or not. Explicit is better than implicit.

__
(*) actually it's even worse: if I update the model in such a way it's now filtered out by the queryset, the reload will always fail. Consider:

# This viewset works as expected:
class DraftViewSet(viewsets.ModelViewSet):
    queryset = Document.objects.filter(status=Document.DRAFT)
    serializer_class = DocumentSerializer

# This viewset will *always* return a 404 when changing the document status
# Actual database outcome depends on Django settings:
#    -> if ATOMIC_REQUESTS is enabled, the model update is cancelled
#    -> if ATOMIC_REQUESTS is disabled (default), the model remains updated despite the 404
class DraftViewSet(viewsets.ModelViewSet):
    queryset = Document.objects.filter(status=Document.DRAFT).prefetch_related('attachments')
    serializer_class = DocumentSerializer

Only difference comes from adding the prefetch_related. Good luck tracking the root cause as a user.

@tomchristie
Copy link
Member Author

tomchristie commented Nov 7, 2016

but not data pre-loaded through select_related

I've currently no reason to believe we have a behavioral bug there.

Or additional context loaded by third-party modules?

I'm not aware of any public API for doing so (other than ad-hoc assignments to the instance or queryset, which we shouldn't support), or have any reason to believe we have a bug there.

It seems odd that adding a prefetch_related() clause to my querysets, with the intent of reducing the number of queries, actually increases that number because it has the unpredictable effect of triggering a reload by DRF.

So don't perform prefetch_related in update cases (modify get_queryset to avoid it) - you'd end up with incorrect results if we didn't apply this fix, so you're in a better position anyway (It's just that now the equation is: "don't use prefetch_related on updates in order to avoid a DB reload", rather than "don't use prefetch_related on updates in order to avoid incorrect responses")

@tomchristie
Copy link
Member Author

tomchristie commented Nov 7, 2016

It seems odd that adding a prefetch_related() clause to my querysets

Sure, but it's better than returning incorrect results.

It seems odd also that saving the object might succeed - potentially generating side-effects such as logging an audit entry, then raise a DoesNotExist exception if the object gets deleted in between the save and the reload.

If you think that we've behavior here that can be improved, then sure, let's discuss that. Note that there are already other similar cases, eg. application level validation might pass but database level validation fail in some race conditions.

if I update the model in such a way it's now filtered out by the queryset, the reload will always fail.

Grand, so let's get together a failing test case for that and then use instance.refresh_from_db() instead in our implementation. (Started along that road at #4661)

(Even better, consider if we can alter the code here so that we only invalidate the prefetch_related cache, without having to reload anything else - demonstrate that gives the correct behavior and we can happily take that path.)

@tomchristie
Copy link
Member Author

Interested to know if #4668 addresses your issue sufficiently or not. Certainly a better approach.

@spectras
Copy link

Although it does leave open the case where prefetched data is not updated and is used in the representation, it does fix the odd 404 error.

I will still use a mixin, because I manually keep the prefetch cache up-to-date when I update a full set of objects:

instance._prefetched_objects_cache[instance.pictures.prefetch_cache_name] = pictures

But with the new approach:

  1. this should trigger much less queries, only prefetched data used in serialization will be actually reloaded.
  2. with this, maybe 1/3 of my viewsets will need the mixin, as opposed to all of them using the old approach.

That gives me an idea, posting on the other issue in 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants