-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: various optimization targeting 'rename' and "compact' (but not only...) #621
base: master
Are you sure you want to change the base?
Conversation
Monitoring the lfs_bd_read usage during time consuming operations (eg. rename) shows a very large of uncache reads on small (less than cache size) consecutive offsets. Implementing a read ahead strategy when such consecutive reads are detective is an effective optimization.
Enabled with LFS_PERF_STATS define
Rename can be VERY time consuming. One of the reasons is the 4 recursion level depth of lfs_dir_traverse() seen if a compaction happened during the rename. lfs_dir_compact() size computation [1] lfs_dir_traverse(cb=lfs_dir_commit_size) - do 'duplicates and tag update' [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) - Reaching a LFS_FROM_MOVE tag (here) [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir - do 'duplicates and tag update' [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3]) followed by the compaction itself: [1] lfs_dir_traverse(cb=lfs_dir_commit_commit) - do 'duplicates and tag update' [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) - Reaching a LFS_FROM_MOVE tag (here) [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir - do 'duplicates and tag update' [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3]) Yet, analyse shows that levels [3] and [4] don't perform anything if the callback is lfs_dir_traverse_filter... A practical example: - format and mount a 4KB block FS - create 100 files of 256 Bytes named "/dummy_%d" - create a 1024 Byte file "/test" - rename "/test" "/test_rename" - create a 1024 Byte file "/test" - rename "/test" "/test_rename" This triggers a compaction where lfs_dir_traverse was called 148393 times, generating 25e6+ lfs_bd_read calls (~100 MB+ of data) With the optimization, lfs_dir_traverse is now called 3248 times (589e3 lfs_bds_calls (~2.3MB of data) => x 43 improvement...
Dude this is seriously -amazing- work!! |
This sounds really great! Do you think it's possible to implement optimisation (2) (and possibly (1) as well) at the block device layer? That is, not within littlefs itself but rather within the read/prog calls defined by the user? |
|
@AlexanderCsr8904 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two minor typos …
Co-authored-by: BenBE <BenBE@geshi.org>
Co-authored-by: BenBE <BenBE@geshi.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix dynamic cache initialization.
Co-authored-by: GIldons <raulgildons@gmail.com>
Hi @invoxiaamo, this is a really great analysis and these are very creative solution! Thanks for creating this PR and sorry I wasn't able to properly look into it until now. (3) seems immediately actionable. The original algorithm shouldn't be exceeding O(n^2), but I can reproduce what you're seeing, so it appears this was a mistake when translating the design to implementation. Would you be able to create a separate PR for just (3)? We would be able to merge it into the next release then. Sorry if you've already mentioned it and I'm just missing it, but what I'm curious how these optimizations compare to setting
|
Hello.
While we are working with littlefs for few months, we experience few long stall of the system during up to 30s, despite having a solid hardware (250+ MHz ARM M33, lot of RAM, 50 MB/s of NOR flash bus speed).
Our investigation quickly focused on the rename operations that are used for atomic file substitution.
The issues are similar to #327.
In order to easily reproduce the issue we found the following scenario which is very representative of the problem:
The second rename [1] trigger a compaction of the dir.
We measure various performance counters:
lfs_dir_traverse
is called 148 393 timeslfs_bd_read
is called 25 205 444 times (for 100MB+ of data requested)THIS IS HUGE for a 102 files / 28KB filled FS....
We provide here 2 cache improvements and a simple but very effective
lfs_dir_traverse
optimization.All the optimizations are independent.
The cache "read ahead" optimization simply request larger rcache fill when it detects that reads are continuous.
=> 66% reduction if IO operations
=> 20% increase of data transfers on SPI
=> 33% overall time reduction of SPI transfers (since the effective data transfers is only a small part of the SPI command sequence)
A "dynamic" block caching (enabled with
LFS_DYN_CACHE
define) to malloc a whole block when we detect random consecutive read on a single block. the RAM allocated is freed as soon as the userlfs_xxxxx
operation returns. Consequently it should not increase heap usage as long as there is a single block temporary available in RAM.=> 99.997 % reduction of IO operations (compared to (1)) - only 123 remaining block read vs. 13872154
=> 99.993 % reduction of IO data transfer - only 15KB vs. 222MB.
When RAM (eg. 4KB for our plaform) is not an issue, this is a very interesting optimization.
Yet, despite (2), the overall CPU time only decrease by 75%, leaving the huge number of times
lfs_dir_traverse
is called.We analyze the call graph and saw that 4 levels of recursive calls of
lfs_dir_traverse
are done during compaction.A O(n^4) is obviously a problem...
The 4 level is due to the presence of LFS_FROM_MOVE tag is attribute walk list.
Yet, for the particular case of cb=lfs_dir_traverse_filter for [3] and [4], we saw that all the calls are useless
The shortcut of this particular case reduce the number of
lfs_dir_traverse
calls by 97.8% (3248 vs 148393).With all the 3 optimizations, the overall compaction time on our platform drop by 98.3%