Move all child entries in the cache in a single query #13956

icewind1991 · 2015-02-06T19:57:06Z

Use some sql magic to move the calculation of the updated fields to sql so we can do it all in a single query.

Moves the time it takes to rename a folder with ~5k files down to ~180ms, comparison.

Adds adds support for using MD5() in sql to oracle and mssql and using CONCAT() in sqlite.

cc @DeepDiver1975 @PVince81 @MorrisJobke

icewind1991 · 2015-02-06T19:57:56Z

lib/private/db/adapteroci8.php

@@ -25,6 +25,7 @@ public function fixupStatement($statement) {
 $statement = str_replace('`', '"', $statement);
 $statement = str_ireplace('NOW()', 'CURRENT_TIMESTAMP', $statement);
 $statement = str_ireplace('UNIX_TIMESTAMP()', self::UNIX_TIMESTAMP_REPLACEMENT, $statement);
+ $statement = preg_replace('/MD5\(([^)]+)\)/i', 'LOWER(DBMS_OBFUSCATION_TOOLKIT.md5 (input => UTL_RAW.cast_to_raw($1)))', $statement);


Lets all take a moment to appreciate oracle's sql dialect

nickvergessen · 2015-02-09T15:51:02Z

lib/private/files/cache/cache.php

- \OC_DB::executeAudited($query, array($targetPath, md5($targetPath), $child['fileid']));
- }
+ $query = \OC_DB::prepare('UPDATE `*PREFIX*filecache` SET
+ `path_hash` = MD5(CONCAT(?, SUBSTR(`path`, ?))),


why don't we do all the magic in php?
Also this is using the old path, not the new one, intended?

Ah okay, you are not looping anymore

nickvergessen · 2015-02-09T16:01:43Z

Please also add a test scenario with two folders:

foobar and
foobar2

Then rename foobar and check whether foobar2 and its children are left untouched.

I can't see it clearly in your code, if that works or not.

PVince81 · 2015-02-09T17:21:10Z

@icewind1991 awesome finding.
Are there other places in the code that could benefit of the MD5() for path hash ? (as separate PR if needed)

nickvergessen · 2015-02-24T13:07:54Z

lib/private/files/cache/cache.php

@@ -149,7 +149,7 @@ public function get($file) {
 $where = 'WHERE `fileid` = ?';
 $params = array($file);
 }
- $sql = 'SELECT `fileid`, `storage`, `path`, `parent`, `name`, `mimetype`, `mimepart`, `size`, `mtime`,
+ $sql = 'SELECT `fileid`, `storage`, `path`, `path_hash`, `parent`, `name`, `mimetype`, `mimepart`, `size`, `mtime`,


this is not used?

DeepDiver1975 · 2015-04-09T21:37:26Z

reschedule for 8.2

DeepDiver1975 · 2015-04-09T22:00:47Z

rebased - just because ....

PVince81 · 2015-04-10T10:31:07Z

Test\Files\Cache\Wrapper\CacheJail::testMove
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'dc761cbf774573c685167a9f2aa82df6'
+'308b6040d33824f37124cf3ef584e8c3'

/var/jenkins/workspace/pull-request-analyser-ng-simple@2/label/SLAVE/tests/lib/files/cache/cache.php:407

butonic · 2015-04-30T13:03:13Z

Before diving into the filecache table I highly recommend we start educating ourselves on what the existing solutions are. In general there are 4 implementations for storing hierarchical date in a relational database:

Adjacency List
Path enumeration
Nested sets
Closure Tables

Currently oc implements an adjecency list. Unfortunately, we loose all benefits of that model because we are also storing the full path in the db. As a result we have to propagate moves down the tree. An additional requirement is that we need to propagate mtime and etag up the tree. Furthermore, we need to keep in mind how expensive it is to resolve a fileid from a path. That is nothing I could find in any comparison of the above data models. It is also the reason why we store the path, which basically breaks our necks.

Maybe it is enough to cache the path->id mapping in memory?

The only way to find out IMO is to implement the four data models, find out which operations are expensive and especially how they perform in our usecases.

It might make sense to actually change our implementation to allow faster uploads / propagation of etags. Closure tables for example would allow us to do that in one query, preserving referential integrity. Something we currently do not have when a request times out / is interrupted.

DeepDiver1975 · 2015-04-30T13:45:25Z

Before diving into the filecache table I highly recommend we start educating ourselves on what the existing solutions are. In general there are 4 implementations for storing hierarchical date in a relational database:
Adjacency List
Path enumeration
Nested sets
Closure Tables

that's indeed very interesting - THX for sharing - this actually cries for a 20% research time allocation

@icewind1991 @PVince81 @blizzz interested?

butonic · 2015-04-30T14:11:42Z

@felixboehm had the idea to not propagate the etag of shared folders up to the root and instead treat storages as separate trees. the etag for a mount point is then dynamicalld calculated by concatenating and hashing the individual etags for the separate trees. This would prevent having to propagate the etag up into multiple storages (when a file has been shared with multiple users). It would reduce our problem from graphs to trees again.

icewind1991 · 2015-04-30T14:16:08Z

I've looked into it in the past and came to the conclusion that the vast majority of operation we do is based on the file path so that's the main case we need to optimize for.

While deleteing/renaming a folder isn't optimal in the current approach those operations dont happen nearly as often as getFileInfo/getDirectoryContent or updating/put cache data (which also doesn't happen nearly as often as reading the cache.

Closure tables seemed like the best way to solve recursive operations like delete/rename to me but since we can't do triggers in all our db backends afaik maintaining the closure table adds a significant amount of complexity and potential for bugs

butonic · 2015-04-30T14:40:08Z

There are only two options for keeping referential integrity when inserting / moving / updating trees: Adjacency List or Closure Table. While I agree that we mostly do queries to map paths back to fileids please keep in mind that SELECTS can be cached and scaled out to multiple db servers. UPDATEs can't.

But again, the proof is in the pudding. We need to actually try this and see how it scales in our workloads.

butonic · 2015-05-02T19:59:58Z

Also found this patent on how to model a hierarchical filesystem in a relational database:https://www.google.com/patents/US6427123 it recognizes the problem of path based lookups and adresses it. But it looses referential integrity in the process.

dragotin · 2015-05-03T08:42:14Z

👍 for @icewind1991 s approach of doing that with one statement. This kills pot. concurrency issues as well.

One remark: The LIKE tends to eat kittens with big data sets as it turns slow. One trick to improve that is to shrink the actual data set on which the LIKE is performed. In this case this could be done by excluding all records where the path length is shorter than the source, ie:

            $query = \OC_DB::prepare('UPDATE `*PREFIX*filecache` SET
                `path_hash` = MD5(CONCAT(?, SUBSTR(`path`, ?))),
                `path`      =     CONCAT(?, SUBSTR(`path`, ?))
                WHERE `storage` = ? AND LENGTH(`path`) > ? AND `path` LIKE ?');
            \OC_DB::executeAudited($query, [$target, $sourceLength + 1,$target, $sourceLength+ 1, $this->getNumericStorageId(), length($source), $source . '/%']);

We do that in the client, and I once checked that it was improving speed considerably. We, however, have a column with pathlen in the table which is indexed. If @icewind1991 has a test setup anyhow it might be worth to check.

ghost · 2015-05-12T17:57:27Z

@DeepDiver1975 @butonic We need to agree as to whether this is going into 8.1. This seems required ASAP as per @butonic

DeepDiver1975 · 2015-05-12T22:54:31Z

@DeepDiver1975 @butonic We need to agree as to whether this is going into 8.1. This seems required ASAP as per @butonic

To e honest with you: the system is already in an unstable state - moving more changes in will help no body. Furthermore we did freeze 8.1 weeks back and this change was move out of scope of 8.1 for a reason.
Finally: nobody really gave a 💩 for weeks -> I stick with my NO as per chat today

PVince81 · 2015-07-03T11:24:09Z

Please rebase. Would be cool to push this forward 😄

Then have something similar for etag propagation if possible.

iGadget · 2015-08-03T13:05:37Z

Where's the 'vote' button? I'd very much like to see this implemented 👍

scrutinizer-notifier · 2015-09-01T14:01:44Z

A new inspection was created.

MorrisJobke · 2015-10-05T18:33:46Z

@DeepDiver1975 @cmonteroluque And again we need to move this. Bringing in such a huge change in the current state is quite bad -> 9.0

ghost · 2015-10-05T20:09:35Z

@MorrisJobke ok. Yeah, this is definitely 9.0

MorrisJobke · 2016-03-31T19:31:10Z

@icewind1991 Please rebase this. It would be super nice to have this early in the release cycle. We have all of this covered by tests and rebase it now and merge it would be a good way to proof that it works.

MorrisJobke · 2016-05-12T08:04:21Z

More conflicts ... what to do here?

PVince81 · 2016-05-20T13:29:56Z

Missed the mark again. Move to 9.2 ?

Or we can merge this (solve conflicts first) before the feature freeze and iron out potential issues during the hardening phase...

DeepDiver1975 · 2016-05-30T12:45:09Z

It's seriously a joke to move this once more. I'm closing this now.

iGadget · 2016-06-02T10:05:50Z

So... "closed" means "won't fix / won't be implemented"? That would be a real shame...

lock · 2019-08-05T16:01:23Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

icewind1991 force-pushed the cache-move-single-query branch from 8fb35ac to 94ec169 Compare February 6, 2015 19:57

icewind1991 reviewed Feb 6, 2015
View reviewed changes

icewind1991 mentioned this pull request Feb 6, 2015

Use transactions when renaming directory contents #13948

Merged

DeepDiver1975 added this to the 8.1-next milestone Feb 6, 2015

DeepDiver1975 added the 3 - To Review label Feb 6, 2015

nickvergessen reviewed Feb 9, 2015
View reviewed changes

DeepDiver1975 force-pushed the cache-move-single-query branch from 94ec169 to a37de95 Compare February 24, 2015 12:16

nickvergessen reviewed Feb 24, 2015
View reviewed changes

PVince81 added 1 - To develop and removed 3 - To Review labels Mar 26, 2015

DeepDiver1975 modified the milestones: 8.2-next, 8.1-current Apr 9, 2015

DeepDiver1975 force-pushed the cache-move-single-query branch from a37de95 to a508292 Compare April 9, 2015 22:00

PVince81 mentioned this pull request Aug 3, 2015

Data loss on rename of a 49 GB folder #13391

Closed

icewind1991 force-pushed the cache-move-single-query branch from a508292 to 47af8fb Compare September 1, 2015 14:01

butonic mentioned this pull request Sep 18, 2015

File Handling Scalability Improvements #18722

Closed

MorrisJobke modified the milestones: 9.0-next, 8.2-current Oct 5, 2015

PVince81 added the comp:filesystem label Nov 20, 2015

jospoortvliet added the performance label Jan 12, 2016

MorrisJobke modified the milestones: 9.1-next, 9.0-current Mar 4, 2016

icewind1991 added 3 commits April 14, 2016 16:09

Add MD5() to sqlite and oracle

a0a6c0e

Add CONCAT() to sqlite

f70589c

Move all children of a folder in a single query

1c0903f

icewind1991 force-pushed the cache-move-single-query branch from 47af8fb to 1c0903f Compare April 14, 2016 15:13

DeepDiver1975 closed this May 30, 2016

DeepDiver1975 deleted the cache-move-single-query branch May 30, 2016 12:45

lock bot locked as resolved and limited conversation to collaborators Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move all child entries in the cache in a single query #13956

Move all child entries in the cache in a single query #13956

icewind1991 commented Feb 6, 2015

icewind1991 Feb 6, 2015

PVince81 Feb 6, 2015

nickvergessen Feb 9, 2015

nickvergessen Feb 9, 2015

nickvergessen commented Feb 9, 2015

PVince81 commented Feb 9, 2015

nickvergessen Feb 24, 2015

DeepDiver1975 commented Apr 9, 2015

DeepDiver1975 commented Apr 9, 2015

PVince81 commented Apr 10, 2015

butonic commented Apr 30, 2015

DeepDiver1975 commented Apr 30, 2015

butonic commented Apr 30, 2015

icewind1991 commented Apr 30, 2015

butonic commented Apr 30, 2015

butonic commented May 2, 2015

dragotin commented May 3, 2015

ghost commented May 12, 2015

DeepDiver1975 commented May 12, 2015

PVince81 commented Jul 3, 2015

iGadget commented Aug 3, 2015

scrutinizer-notifier commented Sep 1, 2015

MorrisJobke commented Oct 5, 2015

ghost commented Oct 5, 2015

MorrisJobke commented Mar 31, 2016

MorrisJobke commented May 12, 2016

PVince81 commented May 20, 2016

DeepDiver1975 commented May 30, 2016

iGadget commented Jun 2, 2016

lock bot commented Aug 5, 2019

Move all child entries in the cache in a single query #13956

Move all child entries in the cache in a single query #13956

Conversation

icewind1991 commented Feb 6, 2015

icewind1991 Feb 6, 2015

Choose a reason for hiding this comment

PVince81 Feb 6, 2015

Choose a reason for hiding this comment

nickvergessen Feb 9, 2015

Choose a reason for hiding this comment

nickvergessen Feb 9, 2015

Choose a reason for hiding this comment

nickvergessen commented Feb 9, 2015

PVince81 commented Feb 9, 2015

nickvergessen Feb 24, 2015

Choose a reason for hiding this comment

DeepDiver1975 commented Apr 9, 2015

DeepDiver1975 commented Apr 9, 2015

PVince81 commented Apr 10, 2015

butonic commented Apr 30, 2015

DeepDiver1975 commented Apr 30, 2015

butonic commented Apr 30, 2015

icewind1991 commented Apr 30, 2015

butonic commented Apr 30, 2015

butonic commented May 2, 2015

dragotin commented May 3, 2015

ghost commented May 12, 2015

DeepDiver1975 commented May 12, 2015

PVince81 commented Jul 3, 2015

iGadget commented Aug 3, 2015

scrutinizer-notifier commented Sep 1, 2015

MorrisJobke commented Oct 5, 2015

ghost commented Oct 5, 2015

MorrisJobke commented Mar 31, 2016

MorrisJobke commented May 12, 2016

PVince81 commented May 20, 2016

DeepDiver1975 commented May 30, 2016

iGadget commented Jun 2, 2016

lock bot commented Aug 5, 2019