Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid expensive Solr join for public dvObjects in search (experimental) #10555

Merged
merged 17 commits into from
Jun 4, 2024

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented May 10, 2024

What this PR does / why we need it:

We were advised yesterday by a Solr expert that we should avoid Solr joins whenever possible because they can impact performance.

This pull requests removes the join used when Guest users search. However, dvObjects that would have been found by IP Groups will no longer be found. In addition, a full reindex is required as the new field, a dynamic field of boolean type called publicObject_b, is located on the content documents themselves (dvObjects) rather than permission documents (which we join with).

[Edit from Leonid:] The scope of the PR has been expanded to cover ALL the searches - not just for Guest users - and always rely on the newly-added boolean for selecting publicly-available objects. When the user initiating the search, whether a Guest or an AuthenticatedUser, is also a member of one or more groups (remember, Guests can be parts of some groups, such as IP groups), a join is added that is limited in scope to just these restricted groups. I was able to figure out a query that works on a combination of this boolean publicObject_b:true AND an additional limited join. See under "How to test" on how to observe the effect of these changes.

Which issue(s) this PR closes:

This is related to the ongoing Solr perf effort:

Special notes for your reviewer:

For simplicity, I implemented a single experimental feature flag but we can certainly change this to multiple settings: one to tell Dataverse to start indexing content documents with publicObject_b and another setting to switch search over to start using them (and avoid the join). [Edit from Leonid: this has been done, there are now 2 separate feature flags]

I did very light testing on this (and didn't test IP Groups at all). I did ensure that SearchIT passed on my machine.

Suggestions on how to test this:

After enabling the feature:

  • Make sure guest users can still find expected dvObjects.

[Edit from Leonid:] The QA needs to involve confirming that searches are still working exactly the same as before, without these new experimental feature flags enabled. I'm fairly confident that the code handling that has been left unchanged; but it had to be at least moved around a little, so should be prudent to confirm.
In order to see the full impact of the new optimization in searches, the PR must be tested on the perf. test system, or a similar-size instance.
The "before" part, confirming the scale of performance degradation on an instance the size of ours, with the current implementation:

  • Create a local account in perf (assuming you don't have a local account in prod. yet; note that HarvardKey logins are not going to work there);
  • Deploy develop branch on the perf. cluster;
  • Load some collection pages, as a guest, super- and regular logged in users. Confirm that the performance is somewhat tolerable, on an instance that is NOT under heavy load.
  • Now start a full reindex (http://localhost:8080/api/admin/index). Watch the load time for collection pages become abysmal, for both the guest an a logged in non-superuser. Note that for the superuser, dataverseAdmin the performance will still be somewhat ok - because no permission joins are performed for such.

The "after" part:

  • Deploy this PR build, start the reindex again. Watch the difference in performance, for both the guest, and logged in regular users.
    Notes:
  • A full reindex has already been performed on the perf. system with the new feature enabled; i.e., all the public objects already have the new boolean flag publicObject:true in them.
  • Note that the database is a copy of prod., and therefore has collection creation similarly disabled for non-superusers. So if you need a collection that's only visible to your local user, for testing, the only way is to create the collection as the dataverseAdmin user, then make your local user an admin (under "Permissions"), who can then proceed creating datasets etc. in it.
  • Both feature flags are already enabled as JVM options in the domain.xml.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

Included.

This comment has been minimized.

@landreev landreev self-assigned this May 16, 2024
@landreev
Copy link
Contributor

@pdurbin Tried it on the perf. system, with the feature flag enabled, started a reindex. It powered through the first 40K or so datasets; then started failing for every dataset (or most of them?). For the first failure the process-failures error message looked like this:

$ cat process-failures/process-failure-VNX8S-2024-05-17T00-21-13.txt
Indexing failed. You can kickoff a re-index of this dataset with: 
 curl http://localhost:8080/api/admin/index/datasets/16935
Error from server at http://34.195.159.0:8983/solr/collection1: ERROR: [doc=dataset_16935] multiple values encountered for non multiValued field publicObject_b: [true, true]

for the last one (right before I killed the reindex this morning:

$ cat process-failures/process-failure-8fz8-r860-2024-05-17T11-26-33.txt
Indexing failed. You can kickoff a re-index of this dataset with: 
 curl http://localhost:8080/api/admin/index/datasets/9673472
Error from server at http://34.195.159.0:8983/solr/collection1: ERROR: [doc=dataset_9673472] multiple values encountered for non multiValued field publicObject_b: [true, true, true, true]

Seems like a simple matter of the new flag being added multiple times (?).

On the search side, everything worked as advertised, with the guest user only seeing the datasets and collections with the new public flag added. We're not going to know just how fast these guest search queries are going to be for a full database until we get it fully re-indexed, but the early results were very promising.

@landreev landreev self-requested a review May 17, 2024 12:41
@coveralls
Copy link

coveralls commented May 17, 2024

Coverage Status

coverage: 20.573% (-0.007%) from 20.58%
when pulling f07ad4c on 10554-avoid-solr-join-guest
into e6b5856 on develop.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@pdurbin
Copy link
Member Author

pdurbin commented May 17, 2024

multiple values encountered for non multiValued field publicObject_b

As discussed in Slack, this was my bad, a copy/paste error I fixed in 6baef6a.

@@ -1006,6 +1007,14 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
// Yes, see if GuestUser is part of any groups such as IP Groups.
// ----------------------------------------------------
if (user instanceof GuestUser) {
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just check to see if groupsFromProviders (or any other reason to not use the public_object) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers I'm not sure I follow.

My goal is to do a quick check to see if a feature flag is set or not. If set, return early without a join.

You're proposing something else?

Copy link
Member

@qqmyers qqmyers May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the flag and no groupsFromProviders (or the subset of those that are IP groups) which is calculated right after this, then skip the join. Guessing that would make this work in all circumstances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we only talking about making an extra effort NOT to treat members of IP groups as plain guest users here? (I remember very little about how IP groups are supposed to work; I'll let you guys figure it out)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers what is the problem you are trying to solve? What are you concerned about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However, dvObjects that would have been found by IP Groups will no longer be found." If a user is a member of an IP group, why can't you skip adding the SearchFields.PUBLIC_OBJECT + ":" + true; test and just do the normal join? That doesn't give the speedup if everyone is in an IP group, but if that's only a subset of users, you'd get the speedup for others and not break IP groups.

Copy link
Member Author

@pdurbin pdurbin May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get it. I haven't tested the code below, but is it closer to what you're thinking?

// ----------------------------------------------------
// (1) Is this a GuestUser?
// Yes, see if GuestUser is part of any groups such as IP Groups.
// ----------------------------------------------------
if (user instanceof GuestUser) {
    String groupsFromProviders = "";
    Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
    StringBuilder sb = new StringBuilder();
    for (Group group : groups) {
        logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias());
        String groupAlias = group.getAlias();
        if (groupAlias != null && !groupAlias.isEmpty()) {
            sb.append(" OR ");
            // i.e. group_builtIn/all-users, ip/ipGroup3
            sb.append(IndexServiceBean.getGroupPrefix()).append(groupAlias);
        }
    }
    groupsFromProviders = sb.toString();
    logger.fine("groupsFromProviders:" + groupsFromProviders);
    // Check if any groups from providers were found.
    if ("".equals(groupsFromProviders)) {
        if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
            /**
             * Now that we know that there are no groupsFromProviders
             * (such as IP Groups) involved, instead of doing an
             * expensive join, narrow down to only public objects. This
             * field is indexed on the content document itself, rather
             * than a permission document.
             */
            return SearchFields.PUBLIC_OBJECT + ":" + true;
        }
    }
    /**
     * Return the expensive join if groups from providers (IP Groups)
     * were found or if the feature flag isn't enabled.
     */
    String guestWithGroups = "{!join from=" + SearchFields.DEFINITION_POINT + " to=id}" + SearchFields.DISCOVERABLE_BY + ":(" + IndexServiceBean.getPublicGroupString() + groupsFromProviders + ")";
    logger.fine(guestWithGroups);
    return guestWithGroups;
}

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite awesome. As I mentioned in slack, this PR has a huge positive effect-to-effort invested ratio.

Jim's soft commit fixes from 10547;
A quick experiment, replacing join on public objects with a boolean publicObject_b:true
for logged-in users as well (with a join added for just for their own personal documents;
groups are ignored for now). #10554

This comment has been minimized.

…ublicObject" flag for

published documents - now for logged-in users, AND with support for groups.
Group support experimental, but appears to be working. #10554

This comment has been minimized.

@landreev
Copy link
Contributor

@pdurbin I moved it back into in progress temporarily, for some last minute touches/cleanup. That fall-through for a guest user in an ip group I can quickly add myself, if you want.
Other than than, I may just move some pieces of code around.

@cmbz cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Size: 10 A percentage of a sprint. 7 hours. labels May 22, 2024

This comment has been minimized.

@landreev landreev changed the title avoid expensive Solr join when guest users search (affects IP Groups) avoid expensive Solr join when guest users search (experimental) May 23, 2024

This comment has been minimized.

search optimization feture that was no longer true. #10554

This comment has been minimized.

This comment has been minimized.

landreev added 2 commits May 28, 2024 11:31
…separated from the flag that

enables the search-side optimization;
Fixed the groups sub-query for the guest user. #10554

This comment has been minimized.

2 similar comments

This comment has been minimized.

This comment has been minimized.

@landreev landreev removed their assignment May 28, 2024
Copy link
Member Author

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little cleanup of docs and javadoc.

Tests are failing but @landreev are looking into it.

I'll go ahead and drag this into ready for QA. The code changes seem fine to me. I haven't tested them, but I've been following along in Slack.

This comment has been minimized.

@landreev
Copy link
Contributor

Yeah, the last few Jenkins failures were on account of the ConfirmEmail test; for which we have a merged fix now. Updated the branch, keeping an eye on it.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10554-avoid-solr-join-guest
ghcr.io/gdcc/configbaker:10554-avoid-solr-join-guest

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@pdurbin pdurbin removed their assignment Jun 3, 2024
@stevenwinship stevenwinship self-assigned this Jun 3, 2024
@stevenwinship stevenwinship merged commit 3c55c3f into develop Jun 4, 2024
19 of 20 checks passed
@stevenwinship stevenwinship removed their assignment Jun 4, 2024
@pdurbin pdurbin deleted the 10554-avoid-solr-join-guest branch June 4, 2024 20:37
@pdurbin pdurbin added this to the 6.3 milestone Jun 4, 2024
@landreev
Copy link
Contributor

landreev commented Jun 6, 2024

Happy to see that the tests finally passed on this one. :) (and to see it merged, yay)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Performance & Stability Feature: Search/Browse Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid expensive Solr join when Guest users search (affects IP Groups)
6 participants