-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessary allocations when using FileStream #15088
Comments
I do think this is a very good issue. If we do the buffer pool option, we first need to design a good general purpose buffer pool. We have made many attempts at it in the past, and none of them turned out to be truly general purpose (i.e. such that the pool works for many different scenarios). But I think the time is ripe for this; we need a good buffer pool built in the platform. |
A general purpose buffer pool would be wonderful, there are quite a few locations which need it. At the same time, I don't think that IBufferPool is a good idea, either. |
Hopefully we'd all agree that we did this 👍 We were looking at addressing this issue pretty recently in ASP.NET and our options to address it are really limited unless it comes from CoreFx.
|
We can revisit it, but see the discussion here dotnet/corefx#5954 (comment), then dotnet/corefx#6473. cc: @jkotas, @socket, @KrzysztofCwalina |
Thanks @stephentoub - perhaps it's my unfamiliarity with the details, but from looking at the API definitions here, it seems that the only API that could leak a reference to the buffer is the Do you think we'd solve any of the objections by using a pooled buffer inside filestream for As I understand it, the objection to using pooling in |
Not really. It's also about the internal buffer used by FileStream (FileStream doesn't actually override CopyTo{Async}). Let's say FileStream grabs a buffer from the pool when it's constructed and returns it when it's Dispose'd. What happens if misuse of the stream causes it to be Dispose'd while a ReadAsync operation is in flight? With the current implementation, we'd end up putting a buffer back into the pool and then potentially still writing into it as part of the in-flight ReadAsync operation. We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight, and that would address this particular case. But depending on to what degree we care about corruption, there's still the case that something else in the process could put a buffer erroneously back into the pool, FileStream could use that buffer for reads/writes, but the original holder of the buffer could still be using it. There's nothing we can do about that, and we'd end up in a situation where corrupted data was being read or written in the file. The concern here is that we'd be introducing the potential for non-local corruption where it never existed before; something elsewhere in the process completely unrelated to a particular FileStream instance could end up corrupting that instance. Is that a security issue? Maybe, maybe not. Is it difficult to debug? Almost certainly. |
What about ref counting the buffers? So if there are outstanding operations, it is only returned when they are all completed, even if disposed midway through? |
That's what I was referring to with "We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight". I think you're suggesting on top of that we could delay the return of the buffer until the operation completed, whereas I was suggesting we simply wouldn't return the buffer in that case. I don't think it's worth optimizing for cases of misuse (it's considered misuse to Dispose of a FileStream while operations are still in flight). |
Thanks for the summary Stephen.
This really seems more like a discussion of principles the runtime wants to follow than whether or not we can solve the issues related to filestream. Should every framework component behave as a much of a 'clean room' as possible? I think the logical conclusion of this is that there ends up being a 'framework only' instance of the pool, or no pooling at all in corefx. Every other mitigation will have an achilles heel, and there would still be cases in existing BCL apis (like The escape hatch would be to provide a constructor or method overload that accepts a caller-provided buffer. This way I'm comfortable waiting awhile on to resolve exactly what we (ASP.NET) want to do, because we don't yet have much data about the scenario in question (serving static files). I think in an ideal world, I'd have the ability to write more unsafe code to solve IO problems using stack-allocated or manually managed memory. This isn't compatible with a lot of existing APIs of course which is why we aren't just doing that 😆 |
In general, having some manner that give us Stream over |
Background and MotivationWe have recently got rid of all managed allocations for runtime/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/BufferedFileStreamStrategy.cs Line 1088 in dd8b090
is the last allocation that could be avoided. We can do that by either:
Proposed APInamespace System.IO
{
public sealed class FileStreamOptions
{
public FileStreamOptions();
public FileMode Mode { get; set; }
public FileAccess Access { get; set; } = FileAccess.Read;
public FileShare Share { get; set; } = FileShare.Read;
public FileOptions Options { get; set; }
public long PreallocationSize { get; set; }
+ public Memory<byte>? Buffer { get; set; } // default value == null => use default buffer size and allocate the buffer (current behaviour)
} Usage Examplesbyte[] array = ArrayPool<byte>.Shared.Rent(16_000);
var advanced = new FileStreamOptions
{
Mode = FileMode.CreateNew,
Access = FileAccess.Write,
Options = FileOptions.Asynchronous | FileOptions.WriteThrough,
Buffer = array
};
using FileStream fileStream = new FileStream(advanced);
// use FileStream
ArrayPool<byte>.Shared.Return(array); To disable the buffering, users would have to pass a var noBuffering = new FileStreamOptions
{
Buffer = default(Memory<byte>) // Array.Empty<byte>() would also work
}; Alternative DesignsDon't let the user provide the buffer (to minimize risk of misuse), but instead provide namespace System.IO
{
public sealed class FileStreamOptions
{
public FileStreamOptions();
public FileMode Mode { get; set; }
public FileAccess Access { get; set; } = FileAccess.Read;
public FileShare Share { get; set; } = FileShare.Read;
public FileOptions Options { get; set; }
public long PreallocationSize { get; set; }
+ public int BufferSize { get; set; }
}
public enum FileOptions
{
WriteThrough,
None,
Encrypted,
DeleteOnClose,
SequentialScan,
RandomAccess,
Asynchronous,
+ PoolBuffer // new option
} RisksAllowing the users to pass the buffer creates the risk of misusing the buffer by the user:
|
We already support creating the FileStream unbuffered, at which point the consumer is fully in control of buffering via the buffers they pass to read/write. I'd rather we just stick with that rather than exposing this scheme, which is yet another way to shoot oneself in the foot with pooling and yet another scheme in FileStream for letting the user control buffering. This also ends up being FileStream-specific... if we really think this internal buffer needs to be configurable further, we should think it through for what pattern should be used across all streams/writers/readers. And on top of that, this prescribed pattern ends up forcing a buffer to be rented/allocated in case it might be needed even if access patterns are such that it would never otherwise be reified. |
Buffer management is not restricted only to allocation. The current implementation controls
Because this is the responsibility of the user. The same applicable to |
Just to add my 2 cents, having the buffer management (position, etc) in |
@tannergooding mentioned that there may be a more generalized allocator/management feature in the works, so rather than accepting a buffer now as well as an allocator "soon", we feel that the right answer for now is just to take the buffer size, not a user provided buffer. namespace System.IO
{
partial class FileStreamOptions
{
public int BufferSize { get; set; }
}
} |
Implemented by #52928 |
This was originally a PR (dotnet/coreclr#1429), turned into an issue as a result of the comments there.
The idea is to avoid 4KB allocation for buffer whenever we need to work with large number of files.
Consider the following code:
The problem is that each instance of FileStream will allocate an independent buffer. If we are reading 10,000 files, that will result in 40MB(!) being allocated, even if we are very careful about allocations in general.
See also: dotnet/corefx#2929
The major problem is that FileStream will allocate its own buffer(s) and provide no way to really manage that. Creating large number of FileStream, or doing a big writes using WriteAsync will allocate a lot of temporary buffers, and generate a lot of GC pressure.
As I see it, there are a few options here:
The idea here is that each thread has its own set of buffers, and we'll take the buffers from there. The Dispose method will return them to the thread buffer. Note that there is no requirement to use the same thread for creation / disposal. (Although to be fair, we'll probably need to handle a case where a disposal thread is used and all streams are disposed on it).
The benefit here is that this isn't going to impact the external API, while adding the external buffer will result in external API being visible.
The text was updated successfully, but these errors were encountered: