Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move all child entries in the cache in a single query #13956

Closed
wants to merge 3 commits into from

Conversation

icewind1991
Copy link
Contributor

Use some sql magic to move the calculation of the updated fields to sql so we can do it all in a single query.

Moves the time it takes to rename a folder with ~5k files down to ~180ms, comparison.

Adds adds support for using MD5() in sql to oracle and mssql and using CONCAT() in sqlite.

cc @DeepDiver1975 @PVince81 @MorrisJobke

@@ -25,6 +25,7 @@ public function fixupStatement($statement) {
$statement = str_replace('`', '"', $statement);
$statement = str_ireplace('NOW()', 'CURRENT_TIMESTAMP', $statement);
$statement = str_ireplace('UNIX_TIMESTAMP()', self::UNIX_TIMESTAMP_REPLACEMENT, $statement);
$statement = preg_replace('/MD5\(([^)]+)\)/i', 'LOWER(DBMS_OBFUSCATION_TOOLKIT.md5 (input => UTL_RAW.cast_to_raw($1)))', $statement);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets all take a moment to appreciate oracle's sql dialect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

\OC_DB::executeAudited($query, array($targetPath, md5($targetPath), $child['fileid']));
}
$query = \OC_DB::prepare('UPDATE `*PREFIX*filecache` SET
`path_hash` = MD5(CONCAT(?, SUBSTR(`path`, ?))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we do all the magic in php?
Also this is using the old path, not the new one, intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, you are not looping anymore

@nickvergessen
Copy link
Contributor

Please also add a test scenario with two folders:

  • foobar and
  • foobar2

Then rename foobar and check whether foobar2 and its children are left untouched.

I can't see it clearly in your code, if that works or not.

@PVince81
Copy link
Contributor

PVince81 commented Feb 9, 2015

@icewind1991 awesome finding.
Are there other places in the code that could benefit of the MD5() for path hash ? (as separate PR if needed)

@@ -149,7 +149,7 @@ public function get($file) {
$where = 'WHERE `fileid` = ?';
$params = array($file);
}
$sql = 'SELECT `fileid`, `storage`, `path`, `parent`, `name`, `mimetype`, `mimepart`, `size`, `mtime`,
$sql = 'SELECT `fileid`, `storage`, `path`, `path_hash`, `parent`, `name`, `mimetype`, `mimepart`, `size`, `mtime`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used?

@DeepDiver1975
Copy link
Member

reschedule for 8.2

@DeepDiver1975 DeepDiver1975 modified the milestones: 8.2-next, 8.1-current Apr 9, 2015
@DeepDiver1975
Copy link
Member

rebased - just because ....

@PVince81
Copy link
Contributor

Test\Files\Cache\Wrapper\CacheJail::testMove
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'dc761cbf774573c685167a9f2aa82df6'
+'308b6040d33824f37124cf3ef584e8c3'

/var/jenkins/workspace/pull-request-analyser-ng-simple@2/label/SLAVE/tests/lib/files/cache/cache.php:407

@butonic
Copy link
Member

butonic commented Apr 30, 2015

Before diving into the filecache table I highly recommend we start educating ourselves on what the existing solutions are. In general there are 4 implementations for storing hierarchical date in a relational database:

  1. Adjacency List
  2. Path enumeration
  3. Nested sets
  4. Closure Tables

Currently oc implements an adjecency list. Unfortunately, we loose all benefits of that model because we are also storing the full path in the db. As a result we have to propagate moves down the tree. An additional requirement is that we need to propagate mtime and etag up the tree. Furthermore, we need to keep in mind how expensive it is to resolve a fileid from a path. That is nothing I could find in any comparison of the above data models. It is also the reason why we store the path, which basically breaks our necks.

Maybe it is enough to cache the path->id mapping in memory?

The only way to find out IMO is to implement the four data models, find out which operations are expensive and especially how they perform in our usecases.

It might make sense to actually change our implementation to allow faster uploads / propagation of etags. Closure tables for example would allow us to do that in one query, preserving referential integrity. Something we currently do not have when a request times out / is interrupted.

@DeepDiver1975
Copy link
Member

Before diving into the filecache table I highly recommend we start educating ourselves on what the existing solutions are. In general there are 4 implementations for storing hierarchical date in a relational database:

Adjacency List
Path enumeration
Nested sets
Closure Tables

that's indeed very interesting - THX for sharing - this actually cries for a 20% research time allocation

@icewind1991 @PVince81 @blizzz interested?

@butonic
Copy link
Member

butonic commented Apr 30, 2015

@felixboehm had the idea to not propagate the etag of shared folders up to the root and instead treat storages as separate trees. the etag for a mount point is then dynamicalld calculated by concatenating and hashing the individual etags for the separate trees. This would prevent having to propagate the etag up into multiple storages (when a file has been shared with multiple users). It would reduce our problem from graphs to trees again.

@icewind1991
Copy link
Contributor Author

I've looked into it in the past and came to the conclusion that the vast majority of operation we do is based on the file path so that's the main case we need to optimize for.

While deleteing/renaming a folder isn't optimal in the current approach those operations dont happen nearly as often as getFileInfo/getDirectoryContent or updating/put cache data (which also doesn't happen nearly as often as reading the cache.

Closure tables seemed like the best way to solve recursive operations like delete/rename to me but since we can't do triggers in all our db backends afaik maintaining the closure table adds a significant amount of complexity and potential for bugs

@butonic
Copy link
Member

butonic commented Apr 30, 2015

There are only two options for keeping referential integrity when inserting / moving / updating trees: Adjacency List or Closure Table. While I agree that we mostly do queries to map paths back to fileids please keep in mind that SELECTS can be cached and scaled out to multiple db servers. UPDATEs can't.

But again, the proof is in the pudding. We need to actually try this and see how it scales in our workloads.

@butonic
Copy link
Member

butonic commented May 2, 2015

Also found this patent on how to model a hierarchical filesystem in a relational database:https://www.google.com/patents/US6427123 it recognizes the problem of path based lookups and adresses it. But it looses referential integrity in the process.

@dragotin
Copy link
Contributor

dragotin commented May 3, 2015

👍 for @icewind1991 s approach of doing that with one statement. This kills pot. concurrency issues as well.

One remark: The LIKE tends to eat kittens with big data sets as it turns slow. One trick to improve that is to shrink the actual data set on which the LIKE is performed. In this case this could be done by excluding all records where the path length is shorter than the source, ie:

            $query = \OC_DB::prepare('UPDATE `*PREFIX*filecache` SET
                `path_hash` = MD5(CONCAT(?, SUBSTR(`path`, ?))),
                `path`      =     CONCAT(?, SUBSTR(`path`, ?))
                WHERE `storage` = ? AND LENGTH(`path`) > ? AND `path` LIKE ?');
            \OC_DB::executeAudited($query, [$target, $sourceLength + 1,$target, $sourceLength+ 1, $this->getNumericStorageId(), length($source), $source . '/%']);

We do that in the client, and I once checked that it was improving speed considerably. We, however, have a column with pathlen in the table which is indexed. If @icewind1991 has a test setup anyhow it might be worth to check.

@ghost
Copy link

ghost commented May 12, 2015

@DeepDiver1975 @butonic We need to agree as to whether this is going into 8.1. This seems required ASAP as per @butonic

@DeepDiver1975
Copy link
Member

@DeepDiver1975 @butonic We need to agree as to whether this is going into 8.1. This seems required ASAP as per @butonic

To e honest with you: the system is already in an unstable state - moving more changes in will help no body. Furthermore we did freeze 8.1 weeks back and this change was move out of scope of 8.1 for a reason.
Finally: nobody really gave a 💩 for weeks -> I stick with my NO as per chat today

@PVince81
Copy link
Contributor

PVince81 commented Jul 3, 2015

Please rebase. Would be cool to push this forward 😄

Then have something similar for etag propagation if possible.

@iGadget
Copy link

iGadget commented Aug 3, 2015

Where's the 'vote' button? I'd very much like to see this implemented 👍

@scrutinizer-notifier
Copy link

A new inspection was created.

@MorrisJobke
Copy link
Contributor

@DeepDiver1975 @cmonteroluque And again we need to move this. Bringing in such a huge change in the current state is quite bad -> 9.0

@MorrisJobke MorrisJobke modified the milestones: 9.0-next, 8.2-current Oct 5, 2015
@ghost
Copy link

ghost commented Oct 5, 2015

@MorrisJobke ok. Yeah, this is definitely 9.0

@MorrisJobke MorrisJobke modified the milestones: 9.1-next, 9.0-current Mar 4, 2016
@MorrisJobke
Copy link
Contributor

@icewind1991 Please rebase this. It would be super nice to have this early in the release cycle. We have all of this covered by tests and rebase it now and merge it would be a good way to proof that it works.

@MorrisJobke
Copy link
Contributor

More conflicts ... what to do here?

@PVince81
Copy link
Contributor

Missed the mark again. Move to 9.2 ?

Or we can merge this (solve conflicts first) before the feature freeze and iron out potential issues during the hardening phase...

@DeepDiver1975
Copy link
Member

It's seriously a joke to move this once more. I'm closing this now.

@DeepDiver1975 DeepDiver1975 deleted the cache-move-single-query branch May 30, 2016 12:45
@iGadget
Copy link

iGadget commented Jun 2, 2016

So... "closed" means "won't fix / won't be implemented"? That would be a real shame...

@lock
Copy link

lock bot commented Aug 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants