-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGBUS error with 1.2.2 #8099
Comments
They're not on K8S but are they using containers at all, or running bare metal? This is the same customer which had too many segments memory mapped, right? Could it be they have so many they ran out of memory to address? 😄 Otherwise, regarding potentially alignment issues, would be interesting to know what CPU arch they run on, but it's not crazy for us to enforce page alignment for our general (and is probably even be beneficial to do so), so we could do it regardless of the issue. There's other cases where this can occur which might point to a bug. If a file is truncated while mapped, further reads may cause SIGBUS in some JVMs (I believe this was patched but I don't know versions and if it was backported). |
So we have no alignment ? 🤔 the cpu you can see in the hot spot error report file. |
No, but I'm not sure in this case the alignment is the problem (though I could be wrong). With mmapd files, sigbus more likely refers to us writing/reading outside of the buffer. This can mean there was not enough virtual space (though the sigbus would happen when we map it, not on append), the file was truncated concurrently (unlikely), we're writing to a buffer that was since unmapped (though I would expect a SIGSEV here), or there was some weird I/O error that wasn't well handled by the JVM (e.g. writing to a network storage which was detached concurrently). See https://www.sublimetext.com/blog/articles/use-mmap-with-care#:~:text=SIGBUS%20(bus%20error)%20is%20a,we%20failed%20to%20read%2Fwrite for example |
What I can see from the customer data is that the Broker which received the SIGBUS went out of disk space and was not able to delete on several partitions data, on other partitions it was able to delete data. No exporters are configured. We see endless log statements of:
SIGBUS message which includes CPD and jdk
|
@deepthidevaki @npepinpe can we have an endless loop in deleting or why logs can be printed for same partition so often in ms ?
|
Deleting logs are not retried. So I expect this log only once when a new snapshot is taken or received. |
Oh.. wait! May be we try to compact when raft detects it is OOD? May be in LeadeRole or FollowerRole. |
@Zelldon FYI OOD may be due to this issue #7767 |
Please timebox some effort to root cause it, then put it back in triage when we have more information on this. |
SIGBUS seems of happen with java related to full |
From the looks, it's usually because it's memory mapping the files in /tmp - which succeeds, but when it tries to write it fails. I'm not sure why it's SIGBUS and not the |
Could it be that if the disk is full and we end in the endless loop like #8103 we print so many logs that we write tmp full and end in a SIGBUS because the JVM fails with the full |
Could be! I would feel a bit better about this than knowing we have to handle SIGBUS with our own memory mapped files 😄 |
Was able to reproduce it.
After failing once it seems to fail in a loop, can't access anymore the pod
HOWI setup the cluster (without elastic and without elastic exporter), run a benchmark, checked who is follower and created several big files with to fill the disk
Unfortunately the error log is empty after several restarts 🤷
|
Are you filling the data volume or /tmp? |
Please see above, the |
So writing to a memory mapped file when there is no space doesn't produce an Any rate, the fix is then to pre-allocate files. We've already discussed this, pre-allocate the files before memory mapping them to ensure we have enough disk space and don't need to deal with weird errors. See #7607. EDIT: also nice job on the quick reproduction 🎉 |
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Describe the bug
Customer has reported that they got an SIGBUS error with the latest patch release and running their benchmark, it looks like it is happening in the journal.
Impact: Since the customer doesn't run on kubernetes the system had to be restarted.
To Reproduce
I think we can ask them what they exactly doing, what I know is that they have 24 partitions and 4 brokers with replication factor 4.
Expected behavior
No sigbus I assume?
Log/Stacktrace
hs_err_pid8.log
Environment:
Support Case: SUPPORT-11966
The text was updated successfully, but these errors were encountered: