-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random LZ4_uncompress_unknownOutputSize erro #12775
Comments
I'm guessing wildly, but what does |
One of the systems has this:
|
I hate being right (or at least, not proven wrong). The only bug I know of in recent memory that can cause what looks for all the world like "hey, this buffer suddenly turned into a NULL pointer when I was midway through operating on it, after it passed various points where it should have crashed first if it were NULL" is #11679, which requires encryption. (The bug, as I currently understand it, basically involves the reference count on a buffer being screwed up, so two places think they have the only pointer to the buffer, and then (among other possible fun outcomes) if one of them frees or destroys it, Bad Times Ensue.) I do not have any other proof of this being the case, so please don't take this as definitive, but that would be my "thing to investigate next" if I were picking bets at the moment, since I haven't seen any other reports of this turn up. |
Thanks a lot for your insight, we'll follow #11679. To be complete, we noticed a slightly different trace in the 3 incidents. Incident 1:
Incident 2:
Incident 3:
|
Ah, 4th incident on the same machine than incident 2. The BUG match (unable to handle page fault for address)
|
We have hit the same issue today.
One noticeable operation that was running at that moment (and froze) was a recv of an unencrypted dataset.
|
Curious. Does it repeat every time if you receive that same dataset? |
No. It has been running every 15 minutes for months, and it happened only once so far. |
Hello, I would like to re-launch the discussion here. We have moved forward with our ZFS deployment on Ubuntu 20.04 / 22.04, and we met several tracebacks leading to a freeze of the zpool. Incident 5
|
Incident 6
|
Incident 7
|
In the end, I don't know how / if these are related to #11679. We use an encrypted dataset, but we do not use the zfs send / receive feature. Although these crashes are uncommon, the increased number of machines involved make it look appear more and more frequently. |
Incident 8
|
Incident 9This one occurs the most frequently. Only reference I can find is #10401
|
Incident 10This one didn't cause the system to hang, apparently led to a corrupted file.
Few seconds later:
|
Incident 11Occurrences: 1
Moreover:
|
Incident 12Happens more and more frequently since layout change.
|
This should be fixed by 13f2b8f in 2.1.6. That said, many (all?) of these backtraces represent different issues. They probably should be filed as such. |
Good to know 👍
I'm aware of this, at this point I'm just using this issue to collect all the call traces we have with ZFS for further analysis. |
Incident 13Occurrences: 1
|
Incident 14Occurrence: 3
|
@nim-odoo have you considered trying a FreeBSD box as a test? I'd be interested to know if you get similar problems. This issue is pretty horrifying. I got a panic with a similar looking stack trace to the first one here, but the address in my case was In my case I am also not using send/receive, but unlike you I am not using encryption. (<- @rincebrain)
Fortunately this has only happened to me once so far, but it's put a damper on my enthusiasm towards ZFS. |
Thanks for the reply. Yes, mine happened on Ubuntu 22.04 (filesystems created on 20.04 and later upgraded). |
System information
Describe the problem you're observing
On our system, we observe kernel errors due to
LZ4_uncompress_unknownOutputSize
. It happens on very rare occasions: it happened 3 times within a 6 weeks period on 3 different servers (we have a total of 20 bare metal servers, provider A). It hasn't happened (yet) on the virtualized servers (24 servers, provider B).The main issue is that the complete zpool becomes unavailable: the services using it cannot be restarted, and there is no way to unmount/remount the datasets. Even a soft reboot doesn't succeed: the only solution is the hard reboot.
Therefore, the question is: are we observing the cause or the consequence of a problem? How can this situation of
LZ4_uncompress_unknownOutputSize
be triggered?Moreover, is there a way to recover the zpool without the hard reboot? It seems that when kernel traceback occurs, our only solution is the hard reboot (we already faced it with zfs send / receive).
Describe how to reproduce the problem
It happens randomly on very rare occasions. We don't have a way to reproduce it.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: