-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fallible collection allocation 1.0 #2116
Conversation
I'm really sorry this isn't perfect. I am just deeply exhausted with working on this problem right now, and need to push out what I have just to get it out there and focus on something else for a bit. I'm not 100% convinced with all my "don't rock the boat" rationales for CollectionAllocErr, and could probably be very easily convinced to change that. It's just that my default stance on this kinda stuff is "don't touch anything, because the Servo team probably is relying on it in 17 different ways that will make me sad". |
|
||
This strategy is used on many *nix variants/descendants, including Android, iOS, MacOS, and Ubuntu. | ||
|
||
Some developers will try to use this as an argument for never *trying* to handle allocation failure. This RFC does not consider this to be a reasonable stance. First and foremost: Windows doesn't do it. So anything that's used a lot on windows (e.g. Firefox) can reasonably try to handle allocation failure there. Similarly, overcommit can be disabled completely or partially on many OSes. For instance the default for Linux is to actually fail on allocations that are "obviously" too large to handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning that allocations can fail for reasons other that running out of physical memory. e.g. running out of address space. or running into a ulimit/setrlimit.
I think the plan for unwinding is terrible. I could support the
There is no evidence of this not being considered practical. More in general, the server use case seems a little thin. I've already mentioned in the internals thread that there's a lot more to be considered there: servers come in different shapes and sizes. Considering one of Rust's 2017 goals is “Rust should be well-equipped for writing robust, high-scale servers” I think this use case (or, I'd like to argue, use cases) should be explored in more detail. Depending on unwinding for error handling is a terrible idea and entirely contrary to Rust best practices. This by itself should be listed under the “drawbacks” section. Besides being counteridiomatic, recovering from unwinding doesn't work well in at least three cases, two of which are not currently considered by the RFC:
|
Using unwinding to contain errors at task granularity is completely idiomatic. It's why Rust bothers to have unwinding at all. Allowing OOMs to panic in addition to their current behavior is totally in line with this. It's not a full solution, but it is a necessary part of one. |
Update: The suggestion was followed. No need to read rest of this comment (which I have left below the line) I suggest that the filename for this RFC be changed to something that isn't quite so subtle. (The current filename, "alloc-me-like-one-of-your-french-girls.md", is a meme/quote from the movie "Titanic"; I infer that reference is meant to bring to mind "fallibility", but I needed some help along the way.) |
0687133
to
c1da9a1
Compare
text/0000-alloc-me-maybe.md
Outdated
|
||
This strategy is used on many *nix variants/descendants, including Android, iOS, MacOS, and Ubuntu. | ||
|
||
Some developers will try to use this as an argument for never *trying* to handle allocation failure. This RFC does not consider this to be a reasonable stance. First and foremost: Windows doesn't do it. So anything that's used a lot on windows (e.g. Firefox) can reasonably try to handle allocation failure there. Similarly, overcommit can be disabled completely or partially on many OSes. For instance the default for Linux is to actually fail on allocations that are "obviously" too large to handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure that this previous comment on the earlier filename does not get lost in the shuffle; quoting here for completeness:
Worth mentioning that allocations can fail for reasons other that running out of physical memory. e.g. running out of address space. or running into a ulimit/setrlimit.
text/0000-alloc-me-maybe.md
Outdated
|
||
Here unwinding is available, and seems to be the preferred solution, as it maximizes the chances of allocation failures bubbling out of whatever libraries are used. This is unlikely to be totally robust, but that's ok. | ||
|
||
With unwinding there isn't any apparent use for an infallible allocation checker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know what "infallible allocation checker" meant when I first read through this. The term does not occur, at least not as written, elsewhere in the document.
Is it meant to be a hypothetical tool listed above with "User Profile: Embedded", namely:
some system to prevent infallible allocations from ever being used
If so, maybe just add "we'll call this an 'infallible allocation checker' when the idea is first introduced, just to define local terminology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch (it's what I end up proposing in the first Future Work section)
text/0000-alloc-me-maybe.md
Outdated
|
||
## User Profile: Runtime | ||
|
||
A garbage-collected runtime (such as SpiderMonkey or the Microsoft CLR), is generally expected to avoid crashing due to out-of-memory conditions. Different strategies and allocators are used for different situations here. Most notably, there are allocations on the GC heap for the running script, and allocations on the global heap for the actual runtime's own processing (e.g. performing a JIT compilation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using the word "crash" here in a narrow sense that just references to undefined-behavior and low-level errors like segmentation faults, etc? Or is it meant to include unchecked exceptions, which Java's OutOfMemoryError
qualifies as? (I understand that you did not include the JVM in your list of example garbage-collected runtimes, but I think most reasonable people would include it as an example of one...)
But maybe I misunderstand the real point being made in this sentence, since you draw a distinction between allocations made for the script versus allocations for the internals of the runtime. I.e. is your point that even Java avoids crashing from out-of-memory conditions that arise from the runtime internals (like JIT compilation) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I guess my previous comment is implicitly suggesting that you use a more specific word than "crash" in the first sentence. That, or add text elsewhere to the document that specifies what "crash" denotes in the context of this RFC.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should be more clear. I mostly meant crashing due to runtime internals, but script stuff should also try to recover (e.g. triggering a GC when a script allocation fails and retrying). I ended up cutting all focus from the script side because, as I note just below, the script allocations aren't actually relevant to this RFC (AFAICT).
I didn't include the JVM because I had only found the time to interviewed SM and CLR people.
Only as a last resort, such that that one assertion failure doesn't take down your whole process accidentally. Unwinding should not be used for errors that are more or less expected and you know how to deal with. https://doc.rust-lang.org/stable/book/second-edition/ch09-03-to-panic-or-not-to-panic.html |
Yes, precisely. In many situations, allocation failure is unexpected and has no meaningful response at a granularity smaller than a task. This is a reason to support oom=panic rather than just abort. |
text/0000-alloc-me-maybe.md
Outdated
|
||
## try_reserve | ||
|
||
`try_reserve` and `try_reserve_exact` would be added to `HashMap`, `Vec`, `String`, and `VecDeque`. These would have the exact same APIs as their infallible counterparts, except that OOM would be exposed as an error case, rather than a call to `Alloc::oom()`. They would have the following signatures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification request: You'll only be adding fn try_reserve_exact
to the types that already supply fn reserve_exact
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest I was working off memory and thought reserve and reserve_exact were always together. If not, then yeah what you said.
text/0000-alloc-me-maybe.md
Outdated
``` | ||
/// Tries to reserve capacity for at least `additional` more elements to be inserted | ||
/// in the given `Vec<T>`. The collection may reserve more space to avoid | ||
/// frequent reallocations. After calling `reserve`, capacity will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc-comment should be further revised since this is the doc for fn try_reserve
, not fn reserve
. In particular, instead of saying "After calling reserve
, capacity will ..." (which is not relevant to this function), you could instead say:
If
try_reserve
returnsOk
, capacity will be greater than or equal toself.len() + additional
. If capacity is already sufficient, then returnsOk
(with no side-effects to this collection).
text/0000-alloc-me-maybe.md
Outdated
|
||
## Eliminate the CapacityOverflow distinction | ||
|
||
Collections could potentially just create an `AllocErr::Unsupported("capacity overflow")` and feed it to their allocator. Presumably this wouldn't do something bad to the allocator? Then the oom=abort flag could be used to completely control whether allocation failure is a panic or abort (for participating allocators). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever! (don't know if I like it, but had to give it credit nonetheless)
@gankro Thanks for all the hard work! However, I also don't like unwinding as error handling. Unwinding should only happen when I have signaled that I don't have anything to do to help; IMHO, it is damage control, rather than error handling. Have there been any proposals to add some sort of enum OOMOutcome {
Resolved(*const u8), // OOM was resolved and here is the allocation
// Could not resolve the OOM, so do what you have to
#[cfg(oom=panic)]
Panic,
#[cfg(oom=abort)]
Abort,
}
fn set_oom_handler<H>(handler: H)
where H: Fn(/* failed operation args... */) -> OOMOutcome; You would then call The benefit of this approach is that the existing interface doesn't have to change at all, and applications don't have choose to use An alternate approach would be to make the Yet another approach would be to make the OOM handler a type that implements a trait. Allocators would then be generic over their OOMHandler type. |
Great idea, @mark-i-m. In a server app, I'd grab a chunk of memory on startup, then in my OOM handler:
On operating systems that error on allocation as opposed to access, this would remove the need to manually tune the app for request load as a function of memory consumption. Thanks @gankro for your work on this RFC. |
@gankro I have mixed feelings about this RFC. I think this is a design, it provides me with the ability to write a Still, it's much better than the current situation, and it doesn't have any drawbacks I can see as long as the guts are unstable (taking into account that application/firmware code will be written against this), so I'm favor of merging this even as-is. Thanks for writing this! |
Perhaps this could be an experimental RFC until we get experience with what doesn't work so well? That's seemed to work well in the past for the allocator interfaces... |
Given the constraints, I'd say this is good enough. I have only one question, what are the semantics of |
@gnzlbg that's implementation detail. The stdlib version, for example, is able to guarantee room for any combination of items in advance. Some variants though can't 100% guarantee room without knowing the dataset. Hopscotch and cuckoo hash tables come to my mind as examples. |
@arthurprs IIUC the whole point of Without this guarantee, one cannot avoid OOM at all, so why would I actually call So... I must be missing something, because as I understand it, this guarantee is not an implementation detail of (*) unless we add |
I don't disagree with your overall idea, I'm just saying that it can't be done in ALL cases. |
@arthurprs This is why I was asking (and why I choose the HashMap as an example). I agree with you on this. This cannot be guaranteed for all collections. |
I was looking into changing some libraries to be Already a lot of people think that catching allocation errors in C++ and trying to recover is too error-prone and a security disaster, and that's with the language always supporting that kind of recovery. In the case of Rust libraries and programs, they have been written with their security analysis assuming oom=abort. Thus, either an allocation succeeds or it's game over, so it's currently valid to create an inconsistent state in a program (e.g. have a memory-safety invariant left unsatisfied), do an allocation, and then restore the state (e.g. restore the memory-safety invariant) after the allocation. Any such code will be broken potentially memory-unsafe with In other words, the |
@briansmith I think I basically think I agree in thinking that as we get more |
To make sure that I understand correctly, you mean that this will make existing sound |
Very few crates meet that criteria. For example, most use libstd and libstd has a lot of unsafe code within it. Is libstd itself memory-safe with oom=panic?
struct S<T> {
ptr: *mut T,
}
impl<T> S<T> {
fn foo(&mut self) {
let saved = self.ptr;
self.ptr = 12345678 as *mut _; // Invalidate invariant
let _ = vec![0u8; 1_000_000];
self.ptr = saved; // Restore invariant
}
} [Pretend there are 10,000 lines of code here] impl<T> S<T> where T: Copy {
fn bar(&self) -> T {
unsafe { *self.ptr }
}
} The above code is safe for oom=abort, but unsafe for oom=panic. Note that there is no At a minimum, we should audit libstd to make sure there are no such patterns within it. I think that will be a difficult project. |
I agree that finding these kinds of issues is going to be very difficult. I just wanted to clarify that any memory safety issue will ultimately have to involve |
Once again the lack of "unsafe to touch" fields bites us. |
That is probably true for memory safety, but it isn't true for all security invariants. Consider an application that accepts requests with nonces. It appends each nonce it sees into a This makes me think that the kind of panic that OOM would issue must be distinct from the kinds of panics that
|
I disagree, because I think the ecosystem shouldn't switch to using the |
In other words, there's nothing "special" about OOM panics, other than the fact that OOM didn't used to be a panic. For that reason, I don't think they should be a different kind of panic, since it would make a historical accident into a permanent distinction. One could easily imagine that in the future there might be other operations that we'd want to make fallible, and that would have the same implications for program invariants, but we wouldn't want to keep introducing new kinds of panics. |
I also think, that since There is a huge space to explore here, but a new field in |
I don't think the expectation is that Rust developers must guard against any standard library function being able to panic, now or in the future. In any case, people don't program that way. Actually, because we have "Panic" sections in documentation of the standard library, I think one may assume that outside the circumstances already described by the "Panic" section documentation, a standard library function won't panic. So I don't think it is reasonable to make existing APIs panic when they didn't before, unless a crate has opted into that breaking change. |
By using |
In the narrow case where C calls out to Rust in just a few place sure. But for more invasive "rewrite it in Rust" scenarios, that breaks down. Same for certain freestanding/embedded projects, e.g. for the Rust in the Linux kernel, catching panics would never have been accepted by upstream. I am not saying we all need to go rewrite the world with This isn't supposed to be a doom prediction, but rather if the process goes slowly enough the cost will be amortized and not hurt :) |
The reason I became interested in this issue is because there was a request to do exactly that for Rustls to get it to work according to Apache Web Server's coding rules, and I'm trying to find a solution that doesn't involve adding a giant number of new (untested) branches to Rustls. I did sketch what this would look like in the Rustls codebase and it got ugly really quickly so I'm trying to find alternatives. |
Did the issues stem from the fact that many functions would have to return a One possibility to address the former would be to translate |
How having a new error make your code a mess ? |
Every time we'd do an allocation, we have to branch on whether the allocation succeeds or fails. Usually this using |
I don't have good solution for your case, since allocator is a global state is very hard to control, I don't think it's impossible but very hard and would require a lot of work. Even with fuse you can't test every possibility, that not because you test a branch that you test every possibility to go thought this branch. I suggest on rustls case that anything that trigger a oom make an irrecoverable error state into the connection and make it impossible to use it again. Cause let's say you are a server that doesn't want to abort on oom, you have a client that somehow trigger an oom you want to kick this client. So any oom error from a connection you want to stop it. By having a "State Broken abort the ship" that the server will receive trying to use rustls this would prevent any inconsistency state and allow to not abort on oom.
I don't see how functional patterns remove any branch, functional pattern remove side effect not branch.
True but it's Rust here. It's not a 1 to 1 equivalent, Rust have must better tool to prevent bug. But I think here we are a little off topic about this feature, your warning about |
@briansmith I get the pain of refactoring libs, but I don't understand that latest reson for being way of Now I understand the following is a matter of opinion, but I would much prefer explicit control flow to panicked control flow to audit. As we all know, Rust does error handling this much better than C, by virtue of having sum types, |
Brian seems to want all alloc failures to only abort, if I read their meaning properly. Which is certainly one way to maintain security, but maybe a poor user experience. |
OK aborting is simpler, yes. |
Add minimal support for fallible allocations to the standard collection APIs. This is done in two ways:
oom=panic
configuration is added to make global allocators panic on oom.try_reserve() -> Result<(), CollectionAllocErr>
method is added.The former is sufficient to unwinding users, but the latter is insufficient for the others (although it is a decent 80/20 solution). Completing the no-unwinding story is left for future work.
Rendered
Updated link:
Rendered