-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix duplicate results returned by is_accessible #1907
fix duplicate results returned by is_accessible #1907
Conversation
For some reason the queries are returning duplicates. It cannot be from the code that checks access through events, because even if the `accessible_projects_ids` has duplicates, the result returned should not have duplicates ``` >>> PublishedProject.objects.filter(id__in=[4,4,4]) <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]> ``` It looks like it might be because of union operation somehow. I am not sure what exactly, but i have hacky fix with .distinct() (Thanks to @ kshalot) On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets
@amitupreti Here's what I think is happening. Snippets like these: physionet-build/physionet-django/project/managers/publishedproject.py Lines 22 to 24 in a886204
result in a LEFT OUTER JOIN inside the query. In this case what will happen exactly is:
LEFT OUTER JOIN "project_duasignature" ON ("project_publishedproject"."id" = "project_duasignature"."project_id") So we will join all the DUA signatures for the project. So the intermediate result will contain a row for each DUA signature that for a given project, e.g.:
Now the filtering happens. We specified:
Let's say our user has DUA signature with IDs 1 and 4, so we will end up with:
No duplicates, because the user can have only one signature for a given project. So the duplicates are implicitly removed. Now what happens when you add physionet-build/physionet-django/project/managers/publishedproject.py Lines 55 to 62 in a886204
is effectively we want to find the ids in the intermediate table with all the joined data (because it will use an OR operator in the query). So if accessible_projects_ids = [1] then every row containing 1 as project_id will be matched.
The same will happen for LEFT OUTER JOIN "project_dataaccessrequest" ON ("project_publishedproject"."id" = "project_dataaccessrequest"."project_id") and we will observe the same behavior. So slapping a |
On a different note - the events matching logic could be refactored a bit, e.g.: events_all = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = set(events_all.filter(end_date__gte=datetime.now()))
accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = []
for event_dataset in accessible_datasets:
if event_dataset.has_access(user):
accessible_projects_ids.append(event_dataset.dataset.id)
query |= Q(id__in=accessible_projects_ids) could become something like accessible_events = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = accessible_events.filter(end_date__gte=datetime.now())
accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = [event_dataset.dataset.pk for dataset in accessible_datasets if dataset.has_access(user)]
query |= Q(pk__in=accessible_projects_ids) or something like that. Changing the querysets to Python objects with Another thing is - we are duplicating a lot of checks with
If we check the implementation of
It's exactly the same. I get that we want to hide the implementation so changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using distinct
seems like the best way to fix this issue (see my comments).
We could use a second PR with a refactoring though (see my comments).
Thank you. Noted about not using python set for queries and using PK over ID I agree about the duplicate code being executed in I will open another PR with refactoring. |
For some reason the queries are returning duplicates, which causes us to see multiple projects listed on healthdatanexus.ai website
It cannot be from the code that checks access through events, because even if the list
accessible_projects_ids
has duplicates, the result returned should not have duplicates>>> PublishedProject.objects.filter(id__in=[4,4,4]) <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]>
It looks like it might be because of union operation somehow, although I am not sure why exactly, but i have hacky fix with .distinct() (Thanks to @kshalot)
On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets