Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear memory usages of the processed files in occ files:checksums:verify #36787

Merged
merged 1 commit into from
Jan 24, 2020

Conversation

karakayasemi
Copy link
Contributor

@karakayasemi karakayasemi commented Jan 20, 2020

Description

We are calling the recursive walkNodes function for each user when calculating file checksums.

private function walkNodes(array $nodes, \Closure $callBack) {

In recursive calls, memory does not free until the root call finished. Because of that, the worst-case memory complexity of the current approach is the size of all node objects of a user. Let's say the average node object size is 1 KB, a user with 1 million files exceeds 1 GB PHP memory limit in the current approach.

As an easy optimization of the current case, This pr first calculates checksums of the files in a directory and then unsets memories of the processed files and folders. In this case, the limitation will be getDirectoryListing() method which returns the node object list of a given directory. This means if a user has 1 million files in a folder (without nested files), again it exceeds 1 GB PHP memory limit.

However, since there are already other usages of getDirectoryListing method, I guess, we do not need to think about optimization of this method. Reducing worst-case memory complexity to the size of getDirectoryListing method should be sufficient.

Also, the pr improves information messages of the command by showing the current processed user and the command run result.

Related Issue

Motivation and Context

How Has This Been Tested?

  • Create 10000 random files in the root of a user's home directory
  • Create a folder and another 10000 random files inside of the new folder
  • Create 30000 randomly located folder in user's home directory
  • Run occ files:checksums:verify and measure peak memory usage with PHP memory_get_peak_usage() method.

I applied the above scenario in the current master and the PR's branch. The result is proving improvement. Memory usage is much less in PR's branch.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Database schema changes (next release will require increase of minor version instead of patch)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:
  • Changelog item, see TEMPLATE

@update-docs
Copy link

update-docs bot commented Jan 20, 2020

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

@karakayasemi karakayasemi self-assigned this Jan 20, 2020
@karakayasemi karakayasemi added this to the development milestone Jan 20, 2020
@codecov
Copy link

codecov bot commented Jan 20, 2020

Codecov Report

Merging #36787 into master will increase coverage by 0.54%.
The diff coverage is 86.66%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #36787      +/-   ##
============================================
+ Coverage     64.14%   64.68%   +0.54%     
- Complexity    19121    19126       +5     
============================================
  Files          1269     1269              
  Lines         74743    74789      +46     
  Branches       1320     1320              
============================================
+ Hits          47946    48380     +434     
+ Misses        26409    26021     -388     
  Partials        388      388
Flag Coverage Δ Complexity Δ
#javascript 54.12% <ø> (ø) 0 <ø> (ø) ⬇️
#phpunit 65.86% <86.66%> (+0.6%) 19126 <13> (+5) ⬆️
Impacted Files Coverage Δ Complexity Δ
apps/files/lib/Command/VerifyChecksums.php 89.24% <86.66%> (-2.32%) 27 <13> (+4)
apps/files/lib/Command/Scan.php 71.92% <0%> (-9.75%) 61% <0%> (ø)
lib/private/Files/Cache/HomePropagator.php 77.77% <0%> (-9.73%) 3% <0%> (ø)
apps/files_sharing/lib/External/Manager.php 90.05% <0%> (+0.05%) 33% <0%> (+1%) ⬆️
lib/private/Files/View.php 84.83% <0%> (+0.29%) 389% <0%> (ø) ⬇️
lib/private/DB/QueryBuilder/QueryBuilder.php 91.34% <0%> (+0.48%) 68% <0%> (ø) ⬇️
apps/dav/lib/Connector/Sabre/File.php 84.19% <0%> (+0.64%) 115% <0%> (ø) ⬇️
lib/private/DB/Connection.php 67.64% <0%> (+0.73%) 49% <0%> (ø) ⬇️
lib/private/DB/MDB2SchemaWriter.php 94.79% <0%> (+1.04%) 34% <0%> (ø) ⬇️
lib/private/DB/MDB2SchemaManager.php 91.22% <0%> (+1.75%) 17% <0%> (ø) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0da7c6...f4ea8df. Read the comment docs.

@jvillafanez
Copy link
Member

It seems there is a return statement missing in the walkNodes function...
I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

@karakayasemi
Copy link
Contributor Author

It seems there is a return statement missing in the walkNodes function...

There was a copy-paste mistake in the code, I fixed it now. Like before walkNodes is void function.

I don't really see any advantage with that code. I mean, you free the memory for the files but not the directories. If instead having 10k files you have 10k folders, the memory consumption will remain the same

If all the files are in the same directory and at the same level, there will be no advantage or as you said, if all tree consists of folders there will be no advantage.
However, I expect the number of files should be much greater than folders. In my approach, when the recursive function branching to the new folder, firstly it completes child file nodes operations and then clear memory usages of the files. In this way, it holds at most one folder's child node files in the memory. Currently, we never clear processed files from memory, until all the files of the user processed.

@karakayasemi karakayasemi marked this pull request as ready for review January 22, 2020 07:38
@jvillafanez
Copy link
Member

Any easy chance to get rid of the recursion and use a queue (or similar) instead? The idea would be:

  1. Start with a queue with some items (this would be the $nodes). Probably we'd just need the folders. Files could have been processed before.
  2. For the current item (assuming it's a folder), list the contents.
  3. For each of those contents, if it's a file then process it; if not add it to the queue
  4. Remove the current item from the queue, move to the next item and go to step2

The problem is that I haven't found a "valid" implementation for the queue, mainly because I had troubles freeing the memory of the iterator at step4. Worst case, we might need to implement our own queue, which doesn't seem worthy.

If we need many changes for the iterator approach, I think we can use the solution with the recursion.

@karakayasemi karakayasemi force-pushed the optimize-verify-checksums branch 3 times, most recently from 1bd6785 to 5bf5b0d Compare January 23, 2020 07:11
@karakayasemi
Copy link
Contributor Author

karakayasemi commented Jan 23, 2020

I re-implemented the recursive method in an iterative way. In addition, now folders are stored in an array queue and the processed folders, popped from the queue. In this way, we will optimize memory usage for folders also. I added 30000k folders in random locations to the test scenario. The peak memory usage is better than before, as expected.
@jvillafanez please take a look at the code one more time. Thanks.

@jvillafanez
Copy link
Member

Could you double-check that the traversal order makes sense? I initially expected to use a FIFO queue instead of LIFO since this would traverse the tree by depth level (if I didn't make any mistake). With the current code, I think you'll check the first level, then check the last folder's contents, and from there go deeper with the "next" last folder's contents... sounds weird.

I don't really think this is something important, so if it isn't easy to fix there is no need to change anything. A lower memory usage takes precedence over this.
Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

@karakayasemi
Copy link
Contributor Author

Note that array_shift will re-index the array, which could take a lot of time, so that isn't a good solution.

That's why I used stack (array_pop()) instead of queue(array_shift()). I intended to implement it as you said at first. But, array_pop is much more efficient in terms of time complexity (O(1) vs O(N)).

Currently implemented searching algorithm is the depth-first search algorithm. If we use a queue instead of a stack, it turns to a breadth-first search. However, if you travel across all nodes, time and memory complexity is the same for both of them. Almost no difference between them for our case.

@jvillafanez
Copy link
Member

Currently implemented searching algorithm is the depth-first search algorithm

Not entirely true. We'd need to add the files to the queue too for the algorithm to work as expected. Right now, there are files being processed when we're down the tree.

In any case, we need to process all the files, so the order doesn't really matters. It's more important to keep the memory usage as low as possible, and the current algorithm does a good job on that regard.

@jvillafanez
Copy link
Member

Taking into account that @karakayasemi says that the memory usage has gone down, I assume this has been tested (dev-wise) and it still works as expected.

@karakayasemi karakayasemi merged commit 5812ff6 into master Jan 24, 2020
@delete-merged-branch delete-merged-branch bot deleted the optimize-verify-checksums branch January 24, 2020 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Command files:checksums:verify doesn't print that the verification went well when showing output.
2 participants