-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No zed alert if pool is degraded or if vdev unavailable #10123
Comments
After doing a little more research, it seems that this issue was brought up and closed as fixed back in 2017 per #4653. Granted, the issue in #4653 was for corrupted metadata, the disk was still showing as 'UNAVAIL' as it is in my case. The last comment by @tonyhutter indicates that after this fix, zed must be running. I double checked using Is this possibly a bug that was resolved but has now reared it's head again? |
Doing a little more testing on this in regards to faulting the drive on bad IO's from #4653 above. I tried saving 4 vm's totaling about 1tb to my pool, but still didn't receive any error from zed regarding the pool being degraded.
|
Have been letting these VM's run for the last 5 days, but there's still no alert from zed regarding the pool being degraded. |
I just ran in to this problem myself. The problem is caused by the I have a modified version of the zedlet that I'm testing out. I'm planning to open a pull request with this change soon, assuming that it works correctly. |
Hey Courtney! Glad to hear it wasn't just me that was experiencing this issue. I also looked at that script and am pretty confident I added a line to include the I ended up resolving this by using a script I found from a user in the proxmox forums and have it running as a cronjob every 5 mins to check the status. I also modied this script to make a 2nd one that runs once a night and emails me the status of the pool, just for good measure. Both are attached below. |
ran into this today... |
Just an update on this. Unfortunately I had a drive fail on me yesterday at around 11:45pm and was greeted with almost 3 dozen emails the follwing AM. I'm thankful to report that ZED reported my pool had an issue. I replaced the drive, ZFS resilvered, and about 3hrs later everything's wonderful again. F@ck yeah @openzfs!!! Way to go! I'd also like to thank WD (@westerndigital) for making a small exception to their RMA process due to this drive technically failing on the LAST day within warranty but me not noticing it until the next day (unfortunately I do need sleep). This is why I have always, and will continue to use WD drives. Thanks again @westerndigital!! My first email was stating that the drive in question was "faulted" due to "too many errors". My server is a Dell R515 (proxmox 6.1-3, 128gb ram) and has an 8 bay hot swap backplane. As soon as I removed the failed drive from the system, I received another email from ZED stating that the device had been faulted. All in all, it seems that ZED is indeed working to a large degree, as this poor WD Gold drive is indeed toast. Just wanted to share my experience on a real world ZED failure alert. Here's an output of the (many) failure emails: `NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pool: Vol1
errors: No known data errors` Output of the fixed email: `ZFS has finished a resilver: eid: 109
errors: No known data errors` |
I came across this today after experiencing the same issue. Decided to poke around cbane's github and see if they had it in a repo somewhere and sure enough they did. cbane@f4f1638 this is the commit. It is pretty dead simple to add anyways. Hopefully either we see this added to the repo or something. I would submit the pull request but its not my code so I really do not want to butt heads. |
@cbane if you can open a PR with your fix I'd be happy to merge it. |
I just experienced this last night, One of my Intel SSDs in a zfs mirror disconnected. The drives are used, and seemed to have been hammered by the previous owner. The SSDs were showing signs of issues because I've been watching their wearout indicators slowly tick up over the last couple months, so I have replacements on the way. I was surprised to find the array degraded and no email warning, though I do have emails configured and working. The Intel SSD just showed as unavailable, disconnected, not responding. That is actually a common failure mode for SSDs, and absolutely should be a mode that triggers an email since it degrades the pool. For those interested, in my case I was able to manually pull the bad drive, plug it right back in, and it came back to life. I was able to offline/online the drive to trigger a resilver and the pool went back to healthy with a clean scrub. Fingers crossed until the replacements come in. I'm going to manually apply @cbane patch, but I hope someone else picks this up if @cbane is too busy; this is critical in my opinion. |
The more I think about this, the more I wonder. Why is the statechange script checking only certain vdev states? At the very least, it should be reduced to checking for "not online" rather than specific states (for example, REMOVED is also not considered). And, I wonder why this is only checking vdev state. Arguably, this script should be only checking pool status and sending emails based on the state of the pool, not individual vdevs in the pool. Unless I'm zfs ignorant and certain configurations can have unhealthy drives but still a healthy pool... |
I was tearing my hair out for the last 30 minutes trying to figure out why on my testsystem I would not get "degraded" e-mails, while I did get emails once I reconnected the disk. I am shocked that this issue exisist since 2020 and was not fixed yet...... |
I've edited the statechange-notify.sh file by hand way back when I filed #10123, however even after editing and testing today on a new machine, I still don't get any sort of notice, aside from a resilver finish event. I'm running proxmox 7.0-13, and zfs zfs-2.0.6-pve1~bpo10+1 Adding the changes in #12630 don't result in any notifications for me. If anyone would like to have me do further testing on this, I'm available to do so, as I have a multi drive server available for testing, and would really like to see this (critical IMO) bug squashed. |
@exitcomestothis I run Proxmox 7 with ZFS as well and found the same problem a couple of months ago. I reviewed my notes and I noticed that apart of applying cbane@f4f1638 I changed these lines in -ZED_NOTIFY_VERBOSE=0
+ZED_NOTIFY_VERBOSE=1
-ZED_SYSLOG_SUBCLASS_INCLUDE="checksum|scrub_*|vdev.*"
+ZED_SYSLOG_SUBCLASS_INCLUDE="*" |
With this in place I DONT get a notification when a drive is disconnected. Only when I add UNAVAIL as per this patch, it works. |
I just did another test install, added these changes cbane@f4f1638 |
Yeah, I assumed this patch cbane@f4f1638 was already applied. I edited the original comment to clarify it. |
So it is not fixed in proxmox 7.1? I just tested it myself and i don't get mails for disconnected drives. But for resilvering... Anyway an awesome piece of software, could disconnect and reconnect on the fly while moving big files and no corruption in the end. But an alert would still be nice. |
This fix was included as of the OpenZFS 2.1.3 release. Commit 487bb77. |
I am running After taking out the only disk out of a hot swap bay in a single disk pool. Should I report this here, over at the proxmox community? I mean they are using their own flavor of the packages it seems |
This is either a resurrection bug that has more lives than a cat does or people including myself keep misunderstanding expected behavior of ZED alerts for degraded ZFS pools. I'm also expecting but not receiving the degraded pool email alert when I boot my system after removing a disk from a pool that makes it degraded, while I'm getting the alert after resilvering completes as expected (well most of the time, exceptions are below). Here is a recent and thorough series of manual tests from the Proxmox community that wraps it up very well: Note that the resilvering completed alert is also not always sent if the resilvering was very quick and I've also seen this on my system recently eg. when resilvering took 5s and it was not triggering an alert email as opposed to longer resilvering sending the email as expected. I'm running proxmox-ve: 7.3-1 (running kernel: 5.15.102-1-pve) and zfs-zed 2.1.9-pve1 |
System information
Describe the problem you're observing
If I physically remove a drive from the array to simulate a disk completely failing, ZED does not send an alert regarding the pool being in a degraded state. My system is indeed configured to send emails properly, and I receive emails on scrub completion and resilvering being completed, just not issues like this.
I see that there's some other people with this issue, but these have been for "offline" vdevs and not "unavail" vdevs.
Is there an option for zed to alert if the pool is degraded, regardless of how it was degraded?
I'm using a Dell R515 server with 128gb ecc ram, there's 8x WDRE drives connected to a PERC H310 that's been flashed to IT mode.
zed.rc config:
zpool status - before drive removed
zpool status - after drive was removed
Describe how to reproduce the problem
Creat raid-z2 array within proxmox; load data onto array (I'm running just 1 linux VM); power off system and remove drive; power on system.
Include any warning/errors/backtraces from the system logs
Syslog after drive removed
The text was updated successfully, but these errors were encountered: