-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rm / remove / delete a large file causes high load and irresponsiveness #4227
Comments
@michaelxwang you have lz4 compression enabled ? then "performance" for the other files with non-zero content however is concerning referencing: #3976 High CPU usage by "z_fr_iss" after deleting large files and judging from the comments in that issue, lock contention, held locks could be also the underlying issue - thus referencing: #4106 ZFS 0.6.5.3 servers hang trying to get mutexes |
Yes, I have lz4 enabled:
|
@michaelxwang please post output of
to see which preemption model the underlying kernel uses, how is the system holding up during that delete operation ? Still responsive ? Also could you do a quick inspection what iotop shows ? Any lvm, device-mapper, etc. ? How much snapshots do exist on that pool ? how many in the system in general ? Is deduplication enabled ? Hardware specs ? (RAM, pool capacity, are crucial) What is output of |
also what ashift was used during pool creation ? what harddrives in what configuration is used ? |
(01) Which preemption model the underlying kernel uses: zcat /proc/config.gz | grep -i preempt? I have not have /proc/config.gz file, my kernel is "3.19.0-43-generic #49~14.04.1-Ubuntu SMP". (02) How is the system holding up during that delete operation? The system did not crash but not responsive, for some commands. For example, "vi README.txt" (03) What iotop shows? IOTOP shows very little activity.
(04) Any lvm, device-mapper, etc? I did not use lvm, device-mapper, etc. (05) How much snapshots do exist on that pool ? how many in the system in general ? Zero and zero.
(06) Hardware specs ? (RAM, pool capacity, are crucial).
(07) What is output of /proc/spl/kstat/zfs/arcstats?
(08) also what ashift was used during pool creation ?
(09) what harddrives in what configuration is used ? In this test, it is Amazon AWS EBS but the problem happened with (10) Other info:
|
okay - so the memory pressure can't be too high, lots of it appears to be still free please post output of
|
dmsg does not have the info:
|
I do not know if this is related or not, but I encountered the same problem yielding stuck processes on delete. Could you please check if you have similar messages like these in your
I'm on 0.6.5.4 too. |
@michaelxwang I'm most interested in the CPU being used by the sync task. I'd suggest running |
@dweeezil Here is the output of "perf top":
@lnxbil I do not have any messages related to this in either dmsg or syslog. |
@michaelxwang I think a flame graph would be helpful here. If you're not familiar with the steps, here's how:
And then post fg.svg to a paste site. Hopefully it will shed some light as to where all the calls to |
@dweeezil Here is the flame graph: http://wikisend.com/download/745082/fg.svg Thanks! |
@michaelxwang That's not showing terribly excessive CPU usage by the sync task at all. The sync task is only 1.53% of the total time. Looking back at your original issue report, I see now that there likely wasn't very high CPU load but instead more likely a lot of blocked zio processes. My guess would be the |
I think a reference to #3976 might be in order. |
@dweeezil It is Amazon memory optimized EC2 instance "r3.8xlarge":
with details in: https://aws.amazon.com/ec2/instance-types/ There are 32 vCPU's, They are all identical, the last one is:
So I guess it has 32 threads at least. |
@michaelxwang The kernel hangcheck timer backtrace of the Data from There might be a bug in the code that keeps us from honoring That funclatency tool is from brendangregg/perf-tools#43. Here is a direct link: https://github.com/ryao/perf-tools/blob/funclatency/kernel/funclatency |
I'd be interested to know what |
If the number of hotplug CPUs shows up as some very large number, you can use the |
@michaelxwang In addition to my previous comment, what is the output of As for the other contributors posting here, here is some food for thought. If we have the problem of too much dirty data, is there any reason why we could not put the file in the unlinked set and then free the file contents in the background asynchronously at a more manageable rate? |
@dweeezil Why would extreme overhead in creating and destroying various objects cause everything to stop until the |
@dweeezil Here is the output on smpboot:
|
@ryao My thinking is that it might slow txg commits sufficiently to cause the problem you've described above. As to unlinked set, why would this operation be bypassing it? The @michaelxwang That's not as much overshoot as I expected to see but it's still excessively high. Even if it doesn't help this problem, you probably want to be booting with |
@michaelxwang Since you mentioned MySQL, has the recordsize been tuned on this filesystem? I'm wondering how many blocks might be involved. If the recordsize is very small, there's a whole lot more to free. |
@dweeezil I believe by recordsize you are referring to the
|
@michaelxwang I was referring to the zfs filesystem recordsize which you can retrieve with |
@dweeezil The datafile that I am deleting via drop table inside mysql, or the copy of the datafile created by the cp command that I am deleting via rm are resides in the data filesystem which has recordsize 16K:
|
@michaelxwang The 16K recordsize will certainly amplify the number of blocks which need to be freed. |
I've finally got a decent test scenario set up. Short story is that deleting a giant file with a 16K record size takes a long time. If, however, the file is currently open by another process, the unlinked set comes into play and the removal is instantaneous. It seem that as @ryao suggested, there ought be a way to always use the unlinked set. Maybe only for very large objects? I suspect the reason it's not is because most people expect the space to be immediately available for re-use once a delete is "finished". I'm not sure how these operations could be accelerated. My test used a 1.7TB file with 16K recordsize. In addition to the 103M L0 (data) blocks, there are many, many indirect blocks which need to be freed and to make matters worse on a low-IOPS pool, they're all dittoed which doubles the IO load. Furthermore, if there any snapshots, deleting a file of this size requires writing a couple gigabytes to a deadlist. |
@ryao This is to respond to your post on dmu_tx. I run a loop to capture the rm process, uptime, and the content of dmu_tx continuously every 15 seconds before the rm started, during the process [appearance of the rm process], and after the rm process finishes. The load reaches as high as 49, and it starts to drop before the rm process finishes. Among my repeated tests, sometimes the load reach this high number, sometimes it goes up but remained in single digit. I noticed that during the rm operation, the dmu_tx_assigned hard increased. Is this test conclusive? Can you interpret more on the test result?
|
@michaelxwang Did you post the output of |
@dweee and @ryao I have not posted the
The data resides on vol1 for the test I am doing. It is huge:
vol1 is Amazon io1 EBS volume with Provisioned IOPS (SSD) http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html The doc says "Consistently performs at provisioned level, up to 20,000 IOPS maximum", and we did provision it at the max 20,000 iops. |
@dweeezil there was some additional logic in |
@behlendorf Very interesting. I actually did poke around a bit in Illumos to see if things were different but I wasn't looking in the right place. |
We've spent some time on the large file delete problem @Nexenta, and this sounds like it could be related. The core issue I've found is that there is no throttle for how many deletes get assigned to one TXG. As a results when deleting large files we end up filling consecutive TXGs with deletes/frees, then write throttling other (more important) ops. There is an easy test case for this problem. Try deleting several large files while you do write ops on the same pool. What we've seen is performance of these write ops (let's call it sideload I/O) would drop to zero. More specifically the problem is that dmu_free_long_range_impl() can/will fill up all of the dirty data in the pool "instantly", before many of the sideload ops can get in. So sideload performance will be impacted until all the files are freed. |
@alek-p I'm not totally sure what the original problem was in this issue. Is it that the unlink blocks for a long time or is its impact on other writes? I've been assuming the former but the problem you're describing sounds like the latter for which your solution sounds good. Large deletes also cause stress because they require writes to deadlists if there are any snapshots. |
@dweeezil I should have mentioned that we have a separate patch for the unlink blocking problem. It calls zfs_inactive() async for files bigger than 64 GB. |
@ryao Here is the funclatency output with comment in /* ... */.
|
This post is to confirm a fact that @dweeezil mentioned. In one session I did a @dweeezil In your test scenario, have you reproduced both problems: (a) Removing a large file takes long; (b) The other processes are blocked while deletion is in progress? To me, they are the same issue. While deleting a large file, I cannot vi a new file on the same filesystem but I can vi a file on another system. This tells me that this is not a global CPU or memory issue but the filesystem IO issue. Thanks for looking into this problem. |
Key
Description
Workaround
|
referencing: #3725 (comment) |
I was able to get to this last night, see openzfs/openzfs#61 |
@kernelOfTruth Users performing even ONE |
referencing: #4259 Illumos zfs_remove() patches |
Thanks @kernelOfTruth I meant to link this. The patches in #4259 are lightly tests but they should help in part with this issue by making the unlinks of very large files asynchronous even if no other process has the file open. We should definitely work on porting the dmu bits of @alek-p's patch as well. The whole thing isn't completely applicable because of significant VFS differences but throttle the number of deletes in a txg is definitely a good idea. |
the biggest issue I see with there VFS differences is that there seems to be no VFS flags (particularly VFS_UNMOUNTED) . Is there an existing way to tell that a fs is unmounting/unmounted? I can port the delete throttle patch by itself to start with, but that exposes the issues I've mentioned in the pull request. |
@alek-p right this is one of those small but critical differences in Linux the VFS model. Most notably there isn't a 1:1 mapping between the mounted namespace and the filesystem super block (zfsvfs_t). There is a single super block for the filesystem but there can be multiple mounted namespaces for that super block. That means there isn't a simple mapping back to a single mounted namespace. This is managed in the super block with the usual sort of reference counting scheme with the namespaces each taking a reference. Once they all drop their references the super block could unmounted/destroyed. Additionally, file handles hold references on dentries, which hold references on inodes, which hold references on the super block. All these references need to be dropped before the filesystem can be unmounted. I need to spend some time looking at your patch but it may be that this case just can't happen under Linux. |
@michaelxwang I was tied down chasing bugs at work last week. As a belated reply, we probably could use data from @alek-p To elaborate on what @behlendorf said, the tree of mounted filesystems is a per process thing on Linux that is inherited across |
I'm running into an apparently similar problem with txg_sync etc, when rm -rf large tree of files, if I don't pause rm process every few seconds txg_sync and friends will start taking some CPU (but little actual writes) and load starts to skyrocket, even though the actual CPU usage is still rather low (90%+ idle). If I do nothing, deleting 50k or so files / directories may lead to a total system hang it never recovers from. Sometimes the hang comes some time after rm has actually completed. I seem to have 24x each z_fr_iss_N, z_rd_int_N, z_wr_int_N where N is 0..7 (as it is an 8-thread CPU). In my case dedupe is enabled, not sure if this is part of the problem (I assume the DDT must be searched and refcounts reduced, and then particular block hashes removed if ref count is zero, etc). In total there are 655 z_* processes. |
@jonathanvaughn: I suspect, given that you have dedup turned on, you are running into #3725, wherein I noted " |
@nwf and @jonathanvaughn and @ryao and @behlendorf and @alek-p and @kernelOfTruth and @dweeezil and @lnxbil As you can see from output below, I had problem in ZFS 0.6.5.4 (rebooted after installation of this version) and resolved in ZFS 0.6.5.7 (after the version was installed and rebooted). The problem could have been resolved in version 0.6.5.6 or 0.6.5.5, but I do not know as I did not reboot after these versions and test it. The ZFS version is only effective after reboot. Can you tell me from release note which fix resolved the problem? As the problem is still not resolved (not marked resolved at this post), then it must be a different problem. I always have dedup disabled from the beginning. Thank you for your attention for the problem.
|
In reference to behlendorf's comment earlier, I had been experiencing the issue where deleting large files was causing a performance issue resulting in the NFS server for the ZFS filesystem becoming unresponsive. This was a very repeatable issue for me. I rsync files from several servers as backups, so there are multiple copies of essentially the same files grouped by the date. I would delete the oldest set and see "NFS server not responding" messages every time. I applied the fixes from commit 1a04bab. I have been running batch deletes all week without the NFS server becoming unresponsive once. Although not a rigorous test, I thought the information might be helpful. |
Closing. This issue was resolved by the patches referenced above for the latest 0.6.5.x versions and 0.7. |
@behlendorf We are experiencing this exact issue right now, against ZFS pools on single EBS volumes. In our case, I was cleaning up PostgreSQL WAL archives on a few machines (in one case, 255 directories with 255 files inside). This caused all IO on the EBS volume to stop for an extended period (several minutes). I had to cancel the operation, and instead destroy the dedicated ZFS filesystem and re-create it. We are using 0.6.5.11-1~trusty (packaged from https://launchpad.net/~zfs-native/+archive/ubuntu/stable ). |
@davidblewett you may hove gotten bit by the large file delete problem, see: #5706 |
What constitutes a large file? I'm deleting folders with about 12 or so 800MB files, and I'm also seeing very high IO in This is on |
Here is a demonstration of the problem:
The top processes are things like:
The zfs version I have is 0.6.5.4 (rebooted after installing the package):
The "junk" file is copied from a MySQL database file. I tried creating a large file with dd:
but as seen, it is removed without issue. Obviously dd a file and cp a file are different.
I originally encountered the problem when dropping a large table (300GB) in MySQL database, and found this is due to the removal of the database file, so rm a large file is more basic issue.
The text was updated successfully, but these errors were encountered: