-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clear memory usages of the processed files in occ files:checksums:verify #36787
Conversation
Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes. |
Codecov Report
@@ Coverage Diff @@
## master #36787 +/- ##
============================================
+ Coverage 64.14% 64.68% +0.54%
- Complexity 19121 19126 +5
============================================
Files 1269 1269
Lines 74743 74789 +46
Branches 1320 1320
============================================
+ Hits 47946 48380 +434
+ Misses 26409 26021 -388
Partials 388 388
Continue to review full report at Codecov.
|
It seems there is a |
f29f4e0
to
6de348d
Compare
There was a copy-paste mistake in the code, I fixed it now. Like before
If all the files are in the same directory and at the same level, there will be no advantage or as you said, if all tree consists of folders there will be no advantage. |
Any easy chance to get rid of the recursion and use a queue (or similar) instead? The idea would be:
The problem is that I haven't found a "valid" implementation for the queue, mainly because I had troubles freeing the memory of the iterator at step4. Worst case, we might need to implement our own queue, which doesn't seem worthy. If we need many changes for the iterator approach, I think we can use the solution with the recursion. |
1bd6785
to
5bf5b0d
Compare
I re-implemented the recursive method in an iterative way. In addition, now folders are stored in an array queue and the processed folders, popped from the queue. In this way, we will optimize memory usage for folders also. I added 30000k folders in random locations to the test scenario. The peak memory usage is better than before, as expected. |
Could you double-check that the traversal order makes sense? I initially expected to use a FIFO queue instead of LIFO since this would traverse the tree by depth level (if I didn't make any mistake). With the current code, I think you'll check the first level, then check the last folder's contents, and from there go deeper with the "next" last folder's contents... sounds weird. I don't really think this is something important, so if it isn't easy to fix there is no need to change anything. A lower memory usage takes precedence over this. |
That's why I used stack ( Currently implemented searching algorithm is the depth-first search algorithm. If we use a queue instead of a stack, it turns to a breadth-first search. However, if you travel across all nodes, time and memory complexity is the same for both of them. Almost no difference between them for our case. |
Not entirely true. We'd need to add the files to the queue too for the algorithm to work as expected. Right now, there are files being processed when we're down the tree. In any case, we need to process all the files, so the order doesn't really matters. It's more important to keep the memory usage as low as possible, and the current algorithm does a good job on that regard. |
Taking into account that @karakayasemi says that the memory usage has gone down, I assume this has been tested (dev-wise) and it still works as expected. |
5bf5b0d
to
f4ea8df
Compare
Description
We are calling the recursive
walkNodes
function for each user when calculating file checksums.core/apps/files/lib/Command/VerifyChecksums.php
Line 174 in c0da7c6
In recursive calls, memory does not free until the root call finished. Because of that, the worst-case memory complexity of the current approach is the size of all node objects of a user. Let's say the average node object size is 1 KB, a user with 1 million files exceeds 1 GB PHP memory limit in the current approach.
As an easy optimization of the current case, This pr first calculates checksums of the files in a directory and then unsets memories of the processed files and folders. In this case, the limitation will be
getDirectoryListing()
method which returns the node object list of a given directory. This means if a user has 1 million files in a folder (without nested files), again it exceeds 1 GB PHP memory limit.However, since there are already other usages of
getDirectoryListing
method, I guess, we do not need to think about optimization of this method. Reducing worst-case memory complexity to the size ofgetDirectoryListing
method should be sufficient.Also, the pr improves information messages of the command by showing the current processed user and the command run result.
Related Issue
Motivation and Context
How Has This Been Tested?
10000
random files in the root of a user's home directory10000
random files inside of the new folder30000
randomly located folder in user's home directoryocc files:checksums:verify
and measure peak memory usage with PHPmemory_get_peak_usage()
method.I applied the above scenario in the current master and the PR's branch. The result is proving improvement. Memory usage is much less in PR's branch.
Types of changes
Checklist: