-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-allocate files to prevent no space errors during flush #7607
Comments
Prioritized up as it's part of Ole's onboarding plan. |
Writing a test for pre-allocation will be difficult for two reasons: Testing that pre-allocation works as expectedTo test that pre-allocation works as expected we need to verify that the on-disk size of the segment file is as expected. Using @npepinpe Do you think it is okay to have a platform specific tests for this? It'd only work on linux and not on all filesystems. Testing that pre-allocation can failAs discussed with @npepinpe, testing this is difficult because it relies on the filesystem to be full. We could, in theory, create this condition by using docker but this comes with considerable overhead and we could only check that Zeebe fails, not why. We decided that it is okay if we don't test this condition. |
Let's check in tomorrow - I'm having second thoughts as to whether or not we should do this now instead of proceeding with the Java 17 migration, especially because writing the test I think will be more difficult/complex than expected. |
Okay, let's discuss this in detail. Just as a note to myself: We could also do this on a best-effort basis where we try to pre-allocate on systems which support |
Did someone already tried that and can share some references or code? Might be easier to start with. |
I did, and I think Ole might have as well, but I didn't save anything. There's two portable ways of pre-allocating files: using a template file (so create a temporary blank file of the right size at startup/the first time, and copy it every other time), or using the same trick Agrona does of mmap'ing the file, then at every 4Kb page, setting one byte to 0, then flushing the whole thing at the end. The fast/non-portable way would be to use jni-ffr and LibC, and a simple call to fallocate. It's pretty easy to do: public interface LibC {
int fallocate(int fd, int mode, int offset, int len);
}
public static void main(String[] args) throws Exception {
final var libc = LibraryLoader.create(LibC.class).load("c");
try(final var file = new RandomAccessFile("/path/to/file", "rw")) {
final var fd = MethodHandles.lookup().findVarHandle(int.class, "fd", FileDescriptor.getClass()).get(file.getFD());
libc.fallocate(fd, 0, 0, 1024 * 1024 * 1024); // allocates a 1MB file
}
} Maybe there's a way to get the file descriptor without reflection, but I'm not aware of it OTOH :( |
It seems there's a /** A repository of "shared secrets", which are a mechanism for
calling implementation-private methods in another package without
using reflection. A package-private class implements a public
interface and provides the ability to call package-private methods
within that package; the object implementing that interface is
provided through a third package to which access is restricted.
This framework avoids the primary disadvantage of using reflection
for this purpose, namely the loss of compile-time checking. */
public class SharedSecrets {
...
} However this is not exported by default, so you'd need to add Anyway, if you do have it, you can do |
For testability, I think it might make sense to combine this issue with #7604 I would also omit using FFI to bind to libc for now. We can look into pre-allocating sparse files in a second iteration as a performance optimization. |
Are template files actually a portable and reliable mechanism? I'd imagine that for copy-on-write filesystems (btrfs, zfs) this will not have any effect. |
To be portable you need to really write the other files, using something like FileChannel.transferTo. This is potentially less efficient but still relatively OK since it doesn't go through the JVM. That said, we can add optimizations after, I.e fallocate. It seems even RandomFile setLength might work for our Docker image, as setLength maps to ftruncate, which on ext3/ext4 does allocate when the size is greater. But fallocate is probably the best optimization for Linux deployments in general. It's just the error handling is awkward in Java (I.e. handling errno). |
Alright, so I did a quick prototype today, and I would propose the following:
I did a prototype already, which you can see #9688. I ran into some weird issues:
Anyway, for the real implementation, I would split this into 2 PRs - adding the allocation with the "dumb" way, and then another one to add the C binding to EDIT: jnr-ffi might be a better option in the end. While it has a smaller community (though it's arguably used by many large projects), it has just better support to variable-width types (e.g. |
As discussed, we'll start with a first PR which adds a global feature flag for pre-allocation, and uses A second PR can add an optimization to use |
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9834: Restructure journal module r=npepinpe a=npepinpe ## Description This PR mostly restructures the journal module to better reflect its usage. Instead of the main API in the root package and everything else under the `file` package, we now have: ``` root <--- journal API and common classes meant to be consumed outside | |-- file <--- persistence layer, i.e. segmented journal implementation |-- record <-- record (de)serialization and manipulation |-- util <-- internal utilities (only ChecksumGenerator for now) ``` We could alternatively think about moving `file`, `record`, and `util` under a `impl` package as well to clearly indicate these aren't meant to be used outside of the module. This came out of #7607. ## Related issues related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Since we memory mapped files, the files are not pre-allocated. When the buffer is flush there is no guarantee that there is enough space in disk. This can lead to errors which are hard to diagnose. We have seen before that it returns "internalError" which is not really useful to know the root cause.
To prevent such cases, we can pre-allocate the segment when it is created. It is like reserving that space for the file.
Followup #6221
The text was updated successfully, but these errors were encountered: