fix duplicate results returned by is_accessible #1907

superryeti · 2023-03-02T15:35:47Z

For some reason the queries are returning duplicates, which causes us to see multiple projects listed on healthdatanexus.ai website

It cannot be from the code that checks access through events, because even if the list accessible_projects_ids has duplicates, the result returned should not have duplicates
>>> PublishedProject.objects.filter(id__in=[4,4,4]) <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]>

It looks like it might be because of union operation somehow, although I am not sure why exactly, but i have hacky fix with .distinct() (Thanks to @kshalot)

On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets

For some reason the queries are returning duplicates. It cannot be from the code that checks access through events, because even if the `accessible_projects_ids` has duplicates, the result returned should not have duplicates ``` >>> PublishedProject.objects.filter(id__in=[4,4,4]) <QuerySet [<PublishedProject: COVID-19 Epidemiology and Vaccination Dataset v1.0.0>]> ``` It looks like it might be because of union operation somehow. I am not sure what exactly, but i have hacky fix with .distinct() (Thanks to @ kshalot) On this commit, i added .distinct() to remove the duplicates so that method only returns unique projects/datasets

kshalot · 2023-03-06T17:35:13Z

@amitupreti Here's what I think is happening.

Snippets like these:

physionet-build/physionet-django/project/managers/publishedproject.py

Lines 22 to 24 in a886204

    
           query |= Q(access_policy=AccessPolicy.RESTRICTED) & Q( 
        
               duasignature__in=dua_signatures 
        
           )

result in a LEFT OUTER JOIN inside the query. In this case what will happen exactly is:

LEFT OUTER JOIN "project_duasignature" ON ("project_publishedproject"."id" = "project_duasignature"."project_id")

So we will join all the DUA signatures for the project. So the intermediate result will contain a row for each DUA signature that for a given project, e.g.:

project_id | dua_signature_id
-----------------------------
1          | 1
1          | 2
1          | 3
2          | 4

Now the filtering happens. We specified:

physionet-build/physionet-django/project/managers/publishedproject.py

Line 19 in a886204

dua_signatures = DUASignature.objects.filter(user=user)

Let's say our user has DUA signature with IDs 1 and 4, so we will end up with:

project_id | dua_signature_id
-----------------------------
1          | 1
2          | 4

No duplicates, because the user can have only one signature for a given project. So the duplicates are implicitly removed.

Now what happens when you add

physionet-build/physionet-django/project/managers/publishedproject.py

Lines 55 to 62 in a886204

    
           events_all = Event.objects.filter(Q(host=user) | Q(participants__user=user)) 
        
           active_events = set(events_all.filter(end_date__gte=datetime.now())) 
        
           accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True) 
        
           accessible_projects_ids = [] 
        
           for event_dataset in accessible_datasets: 
        
               if event_dataset.has_access(user): 
        
                   accessible_projects_ids.append(event_dataset.dataset.id) 
        
           query |= Q(id__in=accessible_projects_ids)

is effectively we want to find the ids in the intermediate table with all the joined data (because it will use an OR operator in the query). So if accessible_projects_ids = [1] then every row containing 1 as project_id will be matched.

The same will happen for DataAccessRequests - our query contains:

LEFT OUTER JOIN "project_dataaccessrequest" ON ("project_publishedproject"."id" = "project_dataaccessrequest"."project_id")

and we will observe the same behavior.

So slapping a distinct in there might be a good solution for now.

kshalot · 2023-03-06T17:45:08Z

On a different note - the events matching logic could be refactored a bit, e.g.:

events_all = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = set(events_all.filter(end_date__gte=datetime.now()))
accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = []
for event_dataset in accessible_datasets:
    if event_dataset.has_access(user):
            accessible_projects_ids.append(event_dataset.dataset.id)
query |= Q(id__in=accessible_projects_ids)

could become something like

accessible_events = Event.objects.filter(Q(host=user) | Q(participants__user=user))
active_events = accessible_events.filter(end_date__gte=datetime.now())

accessible_datasets = EventDataset.objects.filter(event__in=active_events, is_active=True)
accessible_projects_ids = [event_dataset.dataset.pk for dataset in accessible_datasets if dataset.has_access(user)]
query |= Q(pk__in=accessible_projects_ids)

or something like that. Changing the querysets to Python objects with set is not good - this will essentially produce less efficient queries and load more stuff into memory. We should also always use pk and not id

Another thing is - we are duplicating a lot of checks with dataset.has_access(user). We already checked that:

The event is active
The user is part of the event
The dataset is active

If we check the implementation of has_access it checks that:

The event is active
The user is part of the event
The dataset is active

It's exactly the same. I get that we want to hide the implementation so changes to has_access don't imply changes to the manager, but it clearly has a bit of duplication. Maybe we should have a EventDataset manager like we do with PublishedProject? It's the same issue.

kshalot

Using distinct seems like the best way to fix this issue (see my comments).

We could use a second PR with a refactoring though (see my comments).

superryeti · 2023-03-06T19:03:59Z

It's exactly the same. I get that we want to hide the implementation so changes to has_access don't imply changes to the manager, but it clearly has a bit of duplication. Maybe we should have a EventDataset manager like we do with PublishedProject? It's the same issue.

Thank you. Noted about not using python set for queries and using PK over ID

I agree about the duplicate code being executed in has_access. and an EventDatasetManager to filter out accessible dataset

I will open another PR with refactoring.

superryeti marked this pull request as draft March 2, 2023 15:46

kshalot approved these changes Mar 6, 2023

View reviewed changes

superryeti marked this pull request as ready for review March 6, 2023 19:04

superryeti changed the title ~~[WIP] fix duplicate results returned by is_accessible~~ fix duplicate results returned by is_accessible Mar 6, 2023

superryeti mentioned this pull request Mar 7, 2023

[WIP] refactor access to dataset through events #1918

Draft

tompollard merged commit 3fd01a8 into MIT-LCP:dev Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix duplicate results returned by is_accessible #1907

fix duplicate results returned by is_accessible #1907

superryeti commented Mar 2, 2023 •

edited

Loading

kshalot commented Mar 6, 2023

kshalot commented Mar 6, 2023 •

edited

Loading

kshalot left a comment

superryeti commented Mar 6, 2023

fix duplicate results returned by is_accessible #1907

fix duplicate results returned by is_accessible #1907

Conversation

superryeti commented Mar 2, 2023 • edited Loading

kshalot commented Mar 6, 2023

kshalot commented Mar 6, 2023 • edited Loading

kshalot left a comment

Choose a reason for hiding this comment

superryeti commented Mar 6, 2023

superryeti commented Mar 2, 2023 •

edited

Loading

kshalot commented Mar 6, 2023 •

edited

Loading