Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add getReadableIds and use in filterReadAccess #277 #278

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

patrick-austin
Copy link
Contributor

Changes

Added getReadableIds function. The logic here is the same as in getReadable, however we accept and return lists of IDs rather than entities. When dealing with the free text search results, we only have IDs. Before we were finding the bean for each ID in turn in order to run authorisation on it, then proceeding to return the ID. But as we already have the klass of the entitity and the ID from Lucene we can simply operate on these instead.

Instead of hardcoding the number of results to get from Lucene at 1000, instead use the maxIdsInQuery setting. This is the limit we would send in one batch to the DB in getReadable/getReadableIds anyway, and the default value of 500 and max for Oracle of 1000 is the same order of magnitude as what we were currently using.

Impact

I ran a number of searches against my local development components, using the lils.yml as my dummy data. This contains:

Investigations Datasets Datafiles
100 300 29967

I searched for * (returning all entities before authorising) as both root and the non-root icatuser, representing best and worst case scenarios (all data visible, no data visible and needing to check DB for rules) for the old and new setup with maxIdsInQuery set to 500 and 1000. All times are in ms, and is measured between the first EntityBeanManager - Got X results from Lucene and EntityBeanManager - Returning Y results logs made by icat.server.

root

Setup Investigations Datasets Datafiles
Old 91 205 212
New (500) 15 11 12
New (1000) 13 12 11

Old method is comparable for DS and DF since we only loop until we reach 300 accepted results, then return.
New only submits one batch of entities to check, and returns in consistently ~12ms.

non-root

Setup Investigations Datasets Datafiles
Old 215 337 29261
New (500) 21 14 1502
New (1000) 15 14 690

Old method takes around 1 to 2 ms per entity checked as we need to check every ID individually before being sure none are authorised.
New method takes roughly the same amount of time for Investigations and DS as these both require a single batch of IDs (less than the limit in settings. Datafiles takes around half the time when the limit is doubled to 1000, as we only need to send half as many batches (averaging 24ms per batch).

Also searched for a specific single result using id:... in both old and new setup as root:

Setup Investigations Datasets Datafiles
Old 17 14 16
New 11 15 14

Times are comparable with old and new approaches. For the new method, the time taken to authorise 1 result and hundreds (as in the * example) was also comparable.

It's worth noting that in these tests I didn't actually have any rules in the DB to evaluate.

@patrick-austin
Copy link
Contributor Author

Following discussion with @kevinphippsstfc, have:

  • Added explicit config option lucene.searchBlockSize instead of piggy backing maxIdsInQuery
  • Refactored duplicated code in GateKeeper
  • Added HasEntityId interface so that the getReadableIds function can run on either EntityBaseBean or ScoredEntityBaseBean without needing to loop to get their id
  • Changed logic in filterReadAccess so we add all newResults if we have no restrictions, and only iterate and check against the id set if restrictions were present. Should also now throw if we're over maxEntities.

This shouldn't change the performance of existing usage of getReadable too much, as we do not introduce any additional loops, only refactoring some code and adding the if (restrictions == null) check to enable early returns of all beans. Further improvements or changes for clarity are welcome.

Copy link
Contributor

@kevinphippsstfc kevinphippsstfc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great, thanks! I've just requested a few additions to the comments to hopefully help future developers.

I'm wondering how best to test this given that I don't think the part of the code using "restrictions" will get tested much (if at all?) by the unit and integration tests. We could do with testing it on an ICAT containing some of the usual InvestigationUser and InstrumentScientist rules/restrictions.

@patrick-austin
Copy link
Contributor Author

Overall, this looks great, thanks! I've just requested a few additions to the comments to hopefully help future developers.

I'm wondering how best to test this given that I don't think the part of the code using "restrictions" will get tested much (if at all?) by the unit and integration tests. We could do with testing it on an ICAT containing some of the usual InvestigationUser and InstrumentScientist rules/restrictions.

Have added/expanded the documentation of the new/modified functions, if anything still needs clarifying just let me know.

@patrick-austin
Copy link
Contributor Author

Have expanded TestRS to ensure it covers these changes. These changes will impact:

  • All Lucene searches
  • Get and Search requests
  • MD exports

Note that the last two are only affected if an INCLUDE query for a collection is used. For example, Facility INCLUDE InvestigationType will run the changed code as there are multiple types at a facility but Investigation INCLUDE InvestigationType will not, as there is only one per investigation.

Some implicit testing was taking place, as if the authorization were overly restrictive (i.e. if getReadableIds is hacked to return no ids) then the following tests would already fail:

  • TestRS.testGet
  • TestRS.testLuceneDatafiles
  • TestRS.testLuceneDatasets
  • TestRS.testLuceneInvestigations
  • TestRS.testSearch

along with some tests in TestWS (as the functionality in question is in Gatekeeper, I only changed TestRS as part of this PR).

I have expanded these tests to also assert that the tests are not too permissive, by performing searches as the piOne user who cannot see anything unless new rules are set.

The existing exportMetaDataQuery is marked with @Ignore, so I created a new test which performs both "permissive" and "restrictive" checks on the inclusion of related entities in the export dump.

Also refactored code to set up root and piOne sessions, and some automatic formatting of the TestRS file. Note that this branch does not have fixes for the unlrelated test failures: #275 #268

Co-authored-by: Viktor Bozhinov <45173816+VKTB@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants