Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix duplicate results returned by is_accessible #1907

Conversation

superryeti
Copy link
Contributor

@superryeti superryeti commented Mar 2, 2023

For some reason the queries are returning duplicates, which causes us to see multiple projects listed on healthdatanexus.ai website

Duplicate projects

It cannot be from the code that checks access through events, because even if the list accessible_projects_ids has duplicates, the result returned should not have duplicates
>>> PublishedProject.objects.filter(id__in=[4,4,4]) <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]>

It looks like it might be because of union operation somehow, although I am not sure why exactly, but i have hacky fix with .distinct() (Thanks to @kshalot)

On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets

    For some reason the queries are returning duplicates. It cannot be from the code that checks access through events, because even if the `accessible_projects_ids` has duplicates, the result returned should not have duplicates
    ```
    >>> PublishedProject.objects.filter(id__in=[4,4,4])
        <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]>
    ```

    It looks like it might be because of union operation somehow. I am not sure what exactly, but i have hacky fix with .distinct() (Thanks to @ kshalot)

    On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets
@superryeti superryeti marked this pull request as draft March 2, 2023 15:46
@kshalot
Copy link
Member

kshalot commented Mar 6, 2023

@amitupreti Here's what I think is happening.

Snippets like these:

query |= Q(access_policy=AccessPolicy.RESTRICTED) & Q(
duasignature__in=dua_signatures
)

result in a LEFT OUTER JOIN inside the query. In this case what will happen exactly is:

LEFT OUTER JOIN "project_duasignature" ON ("project_publishedproject"."id" = "project_duasignature"."project_id")

So we will join all the DUA signatures for the project. So the intermediate result will contain a row for each DUA signature that for a given project, e.g.:

project_id | dua_signature_id
-----------------------------
1          | 1
1          | 2
1          | 3
2          | 4

Now the filtering happens. We specified:

dua_signatures = DUASignature.objects.filter(user=user)

Let's say our user has DUA signature with IDs 1 and 4, so we will end up with:

project_id | dua_signature_id
-----------------------------
1          | 1
2          | 4

No duplicates, because the user can have only one signature for a given project. So the duplicates are implicitly removed.

Now what happens when you add

events_all = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = set(events_all.filter(end_date__gte=datetime.now()))
accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = []
for event_dataset in accessible_datasets:
if event_dataset.has_access(user):
accessible_projects_ids.append(event_dataset.dataset.id)
query |= Q(id__in=accessible_projects_ids)

is effectively we want to find the ids in the intermediate table with all the joined data (because it will use an OR operator in the query). So if accessible_projects_ids = [1] then every row containing 1 as project_id will be matched.

The same will happen for DataAccessRequests - our query contains:

LEFT OUTER JOIN "project_dataaccessrequest" ON ("project_publishedproject"."id" = "project_dataaccessrequest"."project_id")

and we will observe the same behavior.

So slapping a distinct in there might be a good solution for now.

@kshalot
Copy link
Member

kshalot commented Mar 6, 2023

On a different note - the events matching logic could be refactored a bit, e.g.:

events_all = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = set(events_all.filter(end_date__gte=datetime.now()))
accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = []
for event_dataset in accessible_datasets:
    if event_dataset.has_access(user):
            accessible_projects_ids.append(event_dataset.dataset.id)
query |= Q(id__in=accessible_projects_ids)

could become something like

accessible_events = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = accessible_events.filter(end_date__gte=datetime.now())

accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = [event_dataset.dataset.pk for dataset in accessible_datasets if dataset.has_access(user)]
query |= Q(pk__in=accessible_projects_ids)

or something like that. Changing the querysets to Python objects with set is not good - this will essentially produce less efficient queries and load more stuff into memory. We should also always use pk and not id

Another thing is - we are duplicating a lot of checks with dataset.has_access(user). We already checked that:

  • The event is active
  • The user is part of the event
  • The dataset is active

If we check the implementation of has_access it checks that:

  • The event is active
  • The user is part of the event
  • The dataset is active

It's exactly the same. I get that we want to hide the implementation so changes to has_access don't imply changes to the manager, but it clearly has a bit of duplication. Maybe we should have a EventDataset manager like we do with PublishedProject? It's the same issue.

Copy link
Member

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using distinct seems like the best way to fix this issue (see my comments).

We could use a second PR with a refactoring though (see my comments).

@superryeti
Copy link
Contributor Author

It's exactly the same. I get that we want to hide the implementation so changes to has_access don't imply changes to the manager, but it clearly has a bit of duplication. Maybe we should have a EventDataset manager like we do with PublishedProject? It's the same issue.

Thank you. Noted about not using python set for queries and using PK over ID

I agree about the duplicate code being executed in has_access. and an EventDatasetManager to filter out accessible dataset

I will open another PR with refactoring.

@superryeti superryeti marked this pull request as ready for review March 6, 2023 19:04
@superryeti superryeti changed the title [WIP] fix duplicate results returned by is_accessible fix duplicate results returned by is_accessible Mar 6, 2023
@tompollard tompollard merged commit 3fd01a8 into MIT-LCP:dev Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants