Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache terms returned by IslandoraUtils::getTermForUri() (Islandora#2272) #997

Open
wants to merge 5 commits into
base: 2.x
Choose a base branch
from

Conversation

antbrown
Copy link

GitHub Issue: (link)

  • Other Relevant Links (Google Groups discussion, related pull requests,
    Release pull requests, etc.)

What does this Pull Request do?

Caches the term found when calling IslandoraUtils::getTermForUri().

Cache ID is based on user id because the db query has an access check.

What's new?

  • Changes IslandoraUtils service by adding a CacheBackend and CurrentUser argument to its constructor
  • Creates a Cache ID based on user id and external uri
  • Tries to retrieve the term from the cache if it exists
  • If the cached term doesn't exist, run the db query and store the result in cache

How should this be tested?

A description of what steps someone could take to:

  • Install webprofiler, compare number of slow db queries before/after, you should see the queries related to this function disappear once the results are loaded from the cache

Documentation Status

  • Does this change existing behaviour that's currently documented? I don't think so?

Additional Notes:

I'm fairly new to using the Cache API so would appreciate any advice you have. In the meantime I'll try find example test cases so that this can be added to the test suite.

Interested parties

@Islandora/committers

@antbrown
Copy link
Author

antbrown commented Dec 1, 2023

The tests failing seem to be getting a 500 response when generating derivatives along with some deprecations. I'll try figure out if these are related to the changes I've proposed

@antbrown
Copy link
Author

We've deployed this change to a site receiving a decent amount of traffic.

Prior to the change our slow query log (analysed using https://docs.percona.com/percona-toolkit/pt-query-digest.html) showed a total exec time of 229853s for slow queries, with the top offender being the lookup for the "http://pcdm.org/use#ServiceFile" term.

After deploying the change these queries disappear from our slow query log and the total exec time of slow queries drops to 2536s, a 98.9% drop in time spent executing queries on the db server.

@adam-vessey
Copy link

adam-vessey commented Jan 3, 2024

I find myself wondering: Could this also be addressed by introducing a better DB index?

As in, Drupal by default does truncated index things, using the first 30 characters, likely not expecting a significant number of lookups on fields of this type (instead, expecting "links" to be hyperlinks to be rendered, not so much a store of identifiers to be looked up): https://git.drupalcode.org/project/drupal/-/blob/10.2.x/core/modules/link/src/Plugin/Field/FieldType/LinkItem.php?ref_type=heads#L79-81

If URLs end up all having similar prefixes, resulting in constant table scans which might be expected to cause slow downs. Could potentially add in another index that makes use of the full URL value?


... at the same time, I start thinking: If we want caching on one method, why not cache as many as possible? Somewhat of mad science, but something of a shot at things in: 2.x...adam-vessey:islandora:madscience

Somewhat of a mess decorating with a caching proxy, but, should handle things, in theory?

@antbrown antbrown force-pushed the cache-term-for-uri branch from 5dfc8dc to 14a6715 Compare April 10, 2024 04:11
@rosiel
Copy link
Member

rosiel commented Apr 10, 2024

Discussed at the Committers Call. This sounds like a real problem that's related to our unconventional use of the Link field type.

There was a strong lean toward looking into just improving database indexing.
Also a mention that other files in IslandoraUtils might need similar treatment.

I'll test and review.

@rosiel rosiel self-requested a review April 10, 2024 17:26
@joecorall joecorall self-assigned this Apr 17, 2024
@joecorall joecorall requested review from joecorall and removed request for rosiel April 20, 2024 18:59
@joecorall
Copy link
Member

joecorall commented Apr 24, 2024

This PR dramatically improves the speed. Made a simple drush script

<?php

$i = 0;
while (++$i < 100) {
  $term = \Drupal::service('islandora.utils')->getTermForUri("http://purl.org/coar/resource_type/c_18cc");
}

Then ran it and it took 17s to complete

$ time drush scr scripts/997.php 
real    0m 17.46s
user    0m 0.52s
sys     0m 0.38s

Applied this PR and ran again and execution dropped to 1s

$ time drush scr scripts/997.php 
real    0m 1.03s
user    0m 0.44s
sys     0m 0.17s

It's definitely the query here causing the slowdown, as if I replace the query with $results = [12]; I get similar 1s executions. So going to see if we can add any database indexes to have similar speed improvements as caching/

Here's the query that entityTypeManager is running:

SELECT t.revision_id, tid FROM taxonomy_term_data t
LEFT JOIN taxonomy_term__field_authority_link a ON a.entity_id = t.tid
LEFT JOIN taxonomy_term__field_external_uri e ON e.entity_id = t.tid
WHERE (a.field_authority_link_uri = 'http://purl.org/coar/resource_type/c_18cc') or (e.field_external_uri_uri = 'http://purl.org/coar/resource_type/c_18cc')

I think due to the orGroup this is slowing down these queries. It'd actually appears be more efficient to run several queries.

diff --git a/src/IslandoraUtils.php b/src/IslandoraUtils.php
index a2df7589..954f768d 100644
--- a/src/IslandoraUtils.php
+++ b/src/IslandoraUtils.php
@@ -246,24 +246,19 @@ class IslandoraUtils {
     // Add field_external_uri.
     $fields[] = self::EXTERNAL_URI_FIELD;
 
-    $query = $this->entityTypeManager->getStorage('taxonomy_term')->getQuery();
-
-    $orGroup = $query->orConditionGroup();
+    $storage = $this->entityTypeManager->getStorage('taxonomy_term');
     foreach ($fields as $field) {
-      $orGroup->condition("$field.uri", $uri);
-    }
-
-    $results = $query
-      ->accessCheck(TRUE)
-      ->condition($orGroup)
-      ->execute();
-
-    if (empty($results)) {
-      return NULL;
+      $query = $storage->getQuery();
+      $results = $query
+        ->accessCheck(TRUE)
+        ->condition("$field.uri", $uri)
+        ->execute();
+      if (!empty($results)) {
+        return $storage->load(reset($results));
+      }
     }
 
-    return $this->entityTypeManager->getStorage('taxonomy_term')
-      ->load(reset($results));
+    return NULL;
   }
 
   /**
$ time drush scr scripts/997.php 
real    0m 0.99s
user    0m 0.46s
sys     0m 0.18s

Copy link
Member

@joecorall joecorall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should simplify the query for speed improvements instead of adding a cache layer. See #997 (comment) for more details

@antbrown antbrown force-pushed the cache-term-for-uri branch from 14a6715 to dcbddf3 Compare July 30, 2024 01:26
@antbrown
Copy link
Author

Hi @joecorall, I agree with simplifying the query rather than using a cache layer here- I will create a separate pull request with the suggestions you've made and test it out with some heavy traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants