-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize having the concept of "validity invariant" and "safety invariant"? Under which name? #539
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I don't like these. Many library invariants are not safety invariants (e.g., if you have a non-deterministic So there's at least:
I think that "safety invariant" is a good name, but "validity invariant" is not a particularly good name for "if you violate this demons will pop out of your nose". tho maybe "language invariant" is a good name (so language invariant/safety invariant). |
I think any two names that do not share the first letter are better than two names that do share the same first letter. So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example. |
Well, you've made your own choices here by calling these "library invariants" and "language invariants"; choices I would not agree with. ;)
So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated. I don't recall us ever abbreviating validity/safety invariant with the first letters, so I don't think it's very important that they have different first letters. |
But there isn't actually safety UB is there? All safety invariants are there because when violated they can allow otherwise fine code to end up breaking a language rule somewhere later. As far as I'm aware, there's just the one kind of UB. |
Library UB is real and not the same as language UB. For instance, it is library UB to call any |
(From the OP) The UB you are referring to is also called "language UB", and "library UB" is used in contrast to mean basically "a violation of library invariants/preconditions/etc that could lead to (language) UB depending on usage/library implementation details/etc", which is useful to have a succinct term for. |
I would call them "language UB" (or just UB as I think there is only one thing that should get that name) and "safety violation" respectively. |
Yeah, see, I'm in agreement with @digama0. If you have a I don't think we should call that state of having invalid bytes "library UB". I don't think we should call any other similar situations "library UB" either. I do like the term "safety violation". Otherwise you have to start explaining to people that sometimes there's something called UB that they're allowed to do anyway, as long as they fix the situation before anything "really bad" happens. And that's a very easy thing for people to misunderstand, or only halfway remember months after the fact. |
Violating such library preconditions is often called UB, and I think it makes sense. UB basically means "you violated the contract of this API and now all guarantees are void". Language UB is about the contract between the programmer and the compiler (where the "API" is the language itself); library UB is about the contract between the library author and the client author. |
Right, except I'm telling you that what people often end up hearing is, "I'm allowed to do some types of UB". I think that's bad, and if we're going to deliberately pick terms, we should deliberately pick terms that don't let people think that "sometimes UB is okay". Because that's a real thing people have said to me, and I've had to stop and explain to them what's "actually UB", and that they really can't do it, not even sneakily. So call them lang invariants and lib invariants if you want, but i strongly suggest that you don't call them that if you want to draw a connection to "lang ub" and "lib ub", because those terms are not good terms that i have seen do harm to people's understanding of rust on multiple occasions over the years. |
You're not allowed to do library UB with other people's libraries either. I have seen this analogy work pretty well, too, so I wouldn't be so quick to discard it. |
I know it does work for some people, but if it works for some people and not for others, then I don't think it entirely works. Of course you should not do any library UB, but the fact that you can actually do it, without complaints from miri (eg: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=919353ecbc1b830d3f8daa6f3180888e), gives some people wrong impressions about what terms even mean. I don't think we should design our terms just for experts, we should also design them for the uninformed and forgetful too. There's a lot more uninformed and forgetful programmers than expert ones. |
I agree with @Lokathor that people do library UB and the reason is what I call "robust functions" in the unsafe mental model and what you call "super safe functions". As a library author you can document that you accept more values than what the safety invariant says. Typical examples are non-UTF-8 |
@Lokathor Nothing works for all people, it's a matter of tradeoffs. 🤷 I have registered your opinion, let's see what other people say. @ia0 Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some |
I think we all agree that if we use the term UB it should be for situations that always indicate incorrect code. ie. it's never ok for a program to exhibit something we have decided to call UB. It seems the argument is more over whether violating library invariants should always indicate a problem with the program, or whether there are some situations where violating library invariants is acceptable - if there are such cases then we shouldn't call it UB. I tend to agree with @RalfJung that it should be considered UB as long as we:
There are still some problems though:
|
Ah ok my bad. I thought the analogy between "validity invariant" and "language UB" on one side and "safety invariant" and "library UB" on the other, was stronger1. I fear this will bring additional confusion to the confusion @Lokathor is mentioning2. Footnotes
|
This is not a real distinction.
The parallel between language UB and library UB is actually quite strong. |
I fear this is a separate discussion (so probably better opening a thread on zulip). You're always allowed to look at the implementation of a crate. Simply if you rely on it for soundness (or correctness if you care about that too), then you need to pin the version you audited with the
If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler (it may not terminate, it may crash, it may produce a buggy program, or even a program that seems to "work" until an attacker figures out it doesn't). |
Am I also allowed to look at the implementation of the compiler? What about libstd?
And if you break a library invariant you're breaking the contract of the library. Literally you can't say what the library will do. And there's no guarantee on the output of the library. Both the language and the library provide a contract with the user. There's an exact parallel between them. If I pin the version of the compiler and exploit its implementation details, I can be confident in what it will output. I won't be writing "valid" Rust code anymore because I broke the contract of the language. In the same way if I pin a version of a library and exploit its implementation details, I can be confident in what it will output. I won't be a "valid" user of that library anymore because I broke the contract of the library. |
You are allowed to look at both and theoretically could pin the (rustc, cg backend, llvm/cranelift/gcc} triple, and do whatever the heck you want (including language UB) if you know what the result will be (and that result is desirable), yes. But as you alluded to, you would no longer be a valid user (of either Rust as a language or rustc as a compiler). |
But you never were. Your initial question was when the same user writes 2 crates: You're not writing the compiler, so that's a different situation. To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.) |
Actually I think std doesn't use UB (or at least we try not to), std fixes some underspecified language nondeterminism like struct layout (which is not UB if you "guess it right"). (Queue Jubilee telling me about unspeakable horrors in std or compiler-builtins.) |
It doesn't, though it 100% could from the perspective of an implementor. It's called implementor's privilege. Whether or not it should from a design perspective of course is a separate thing. (Also, nit, struct layout isn't really underspecified - it's unspecified, intentionally) |
No, that was one example of many. My point is only that there is a direct parallel between library UB and language UB. The parallel for The parallel between The parallel between Rust does not support this last use-case. The author of There is no qualitative difference between these two types of UB, only that it is more common to rely on library implementation details than it is to rely on language implementation details. |
(Also, there's a fourth case - |
The difference here is that I think one way of putting it is that undefined behavior is always considered to be a program error, and Someone other than |
The reason I believe the distinction between safety invariant and "soundness invariant"1 matters, is that type invariants only matter when writing unsafe code (otherwise the type system is here), and unsafe (or robust) public APIs are where those invariants may differ, so it's important to distinguish them. But I agree that:
Footnotes
|
That's not an invariant, that's just the precondition of a particular function. I don't think we need to come up with a new term for this very well-established concept. |
They're both part of the standard, but it really means what you mean by "part". For C++23, the standard library is defined in section 17. It's roughly analogous to "is libcore part of the rust language?". Clause 17 standardizes the "language support library," which does the rough equivalent of defining what we refer to as "lang items" here in Rust.
Yes. A famous example here is how memory allocation isn't considered observable behavior, so this code: #include <new>
int foo(){
int *five = new int;
return 42;
} will compile to foo():
mov eax, 42
ret (This is also the case if you use malloc instead of new) I don't have any strong opinion about if this has any implications on how Rust defines UB, just figured I'd add some context here. |
Those can be seen as the same thing, it's a matter of style. One is "contracts as types" and the other is "contracts as pre-posts". For example, RefinedRust has annotations in the "contracts as pre-posts" style but is actually implemented with a "contracts as types" style. A concrete example would be: /// Randomizes the content of a slice.
///
/// # Safety
///
/// `ptr` must be valid to write for `len` bytes
pub unsafe fn randomize(ptr: *mut u8, len: usize); In this example, I'm considering In the "contracts as pre-posts" style, we have a lemma to prove by the implementer and to use by the callers:
In the "contracts as types" style, we have a type in a richer type system:
In particular, the invariant of the occurrence of It's important to note that type occurrences constrained by the safety documentation refine the validity invariant, not the safety invariant (in particular, the safety invariant is the default refinement of the validity invariant). This matters to express the type of
We can unfold the definitions of
If
Indeed we don't. It's just a matter of style, which is why I said "This is somehow a matter of style, so not that important". Some people like to prove properties of their programs using pre-posts, others using types. We don't need a new term, but we need to understand the concept to understand when the parallel between language/library UB and validity/safety invariant does not hold. You can violate the safety invariant without library UB (robust functions) and you can get library UB without violating the safety invariant (unsafe functions). While violating the validity invariant is always language UB (the reciprocal is true when restricting to the type invariant section of the language contract). Similarly, violating the "soundness" invariant is always library UB.
Indeed, if one can split the language specification |
I strongly disagree with this. You are causing a completely unnecessary terminology confusion here. Using "invariant" for "has to be true when this type is passed across API boundaries (but exceptions apply within a function)" is fairly standard (in safety/library invariant), even wikipedia calls this an invariant. We are already stretching the notion of "invariant" by using it for things that have to be true "on every typed load" (in validity/language invariant), but in some sense this is similar to the former use of invariant -- it's something that has to be true for all operations on this type. But calling the precondition of a specific function an invariant is unheard of AFAIK and I will strongly push back against that. We use words that have specific technical meaning for a reason. |
I'm not calling the precondition an invariant. In the "contracts as types" style, there are no preconditions, there are just types. Pre- and post-conditions only exist in the "contracts as pre-posts" style. However, I claim there is a parallel between them, which can be seen with those 2 sentences:
Given you mention the "on every typed load" particularity of the validity invariant, it's also interesting to see that this is also true for the "soundness" invariant. You can annotate your program such that every typed load satisfy the "soundness" invariant (which implies the validity invariant). This gives us the parallel you want:
The difference with the validity and safety invariants, is that the "soundness" invariant respects subtyping (hence can be used to prove the program or library sound in a type system fashion). One never casts a value from T to S if the "soundness" invariant of T does not imply the one of S. While one can do so with the validity and safety invariant. You can cast from |
The difference here, AIUI, is that when @RalfJung or I say «type», we are referring to the syntactic type that is written (or inferred) in the Rust source code. Whereas @ia0 is using «type» to mean roughly the specific-to-an-API-signature semantic refinement of the «validity invariant» which captures all of the conditions for the soundness proof of that API signature. Warning Disclaimer: I worded this a bit authoritatively, but it isn't intended to be final, nor is it the expressed opinion of anyone in the mentioned groups other than myself. @ia0 — the work you've put in developing your unsafe mental model document is appreciated, and I do believe it's a useful resource; however, it is not shared vocabulary with T-opsem or the UCG. Surface Rust, MIR, nor MiniRust have any concept of refinement (sub)typing below the syntactical (language/validity) typing, so the attempt to use your modeling of API contracts as refinement types to influence the common language which doesn't utilize refinement such refinement typing fails to translate properly. I think we've identified the ideal resolution here:
Standard documentation can use “language invariant” and “library invariant” to refer to the (syntactic) type requirement that is required for a typed copy1 or for other safe API2 to be actually safe to use, respectively. And your doc which extends Rust's unsafety model with a more formal reasoning model for managing the unsafety in a program can go into refinement typing for attaching unsafe contracts on top of syntactic types, and how under this model the “library invariant” is just the default refinement for a syntactic type unless otherwise specified. Also keep in mind that the terms discussed here are very much intended to be used as the “surface level” terminology that gets seen by everyone who writes Rust and uses the stdlib documentation. It's important for it to not be wrong so the path to the full details of safely using And FWIW, I think “library invariant (of the syntactic type)” is a perfectly fine semantic for your model to replace with “safety invariant (of the specific flow-dependent type refinement at this point of the execution)” as your model is fully about replacing the syntactic type with the semantic type used for soundness reasoning. Footnotes
|
Just for context, because we got lost in a side discussion. My claim is that there is no parallel between language/library UB and validity/safety invariant:
I'm not claiming that we should restore this parallel (that's a matter of style). I'm just describing the situation where the parallel would hold (either outside the scope of unsafe or using the "soundness" invariant) to help understand that the parallel does not always hold. I'm not sure why this was understood as introducing new terms. Introducing new concepts (those are not new, they are as old as me) doesn't imply introducing new terms (we can refer to them by definition and locally use quoted names for brevity). In particular, it's ok to have unnamed concepts (think of the different concepts of "snow" missing from the English vocabulary). In particular, I'm fine with the language/library invariant terminology as I said above. You don't need the parallel with language/library UB to hold to justify this terminology. The parallel that holds is "X invariant is the type invariant defined by X". This doesn't involve UB. This is the parallel I choose to understand the language/library invariant terminology (in particular by implicitly adding the word "definition"). Note that there's 2 ways to look at a user type: the abstraction defined by the library (library invariant) and the representation defined by the language (language invariant). The parallel that does not hold is "Violating the X invariant implies having X UB", because it holds with X = language but doesn't with X = library .The counter-examples are robust functions, for example
That's a good answer to me. That's along the lines of what I suggested with "There might actually be no confusion if people read the definition of library invariant" source. We accept the risks of advertising the parallel, knowing it doesn't always hold. Note that this is not what Ralf says: "I don't agree; I still think there is a completely fine parallel here" source. This is what worries me, that we would advertise this parallel as a strong one, while we know it has weaknesses.
Those should ideally be the same thing. The type only matters where it's used (i.e. a value of that type is "materialized"). The fact that types "compose" doesn't create new program points where their predicate must hold. This composition is just a powerful proof technique to keep track of invariants and ultimately check all parts of a program in a modular way. In particular, in |
I agree with that statement, that is a good observation. However, I am not particularly bothered by this little asymmetry. Robust functions are pretty rare. I think the symmetry overall still holds up sufficiently well to be useful. |
Sounds good to me. |
When using language and library invariant, I realized I don't know what adjectives to use to qualify the word "value" to designate the fact that the value satisfies the language (resp. library) invariant. In the past I was using "valid" (resp. "safe") since they were going well with "validity" (resp. "safety") invariant. What's the new way now? |
TL;DR: I suggest we introduce only terms for the following:
I suggest that we pick names which are evocative and are designed to highlight the practical distinction between these terms - ie, to highlight the distinction in how these terms affect the It seems like the discussion is going in that direction anyway, but there has been a lot of discussion of finer-grained taxonomies, so I want to advocate that we stick with this simple taxonomy, at least in what we document publicly. I'd suggest a guiding principle for this discussion: We should pick nomenclature that is likely to result in Rust code in the ecosystem which is more correct and secure. As a consequence, we should pick nomenclature that is likely to help the average Rust programmer write correct and secure code. In particular:
Thus, I suggest we introduce only terms for the following:
I suggest that we pick names which are evocative and are designed to highlight the practical distinction between these terms - ie, to highlight the distinction in how these terms affect the |
This thread has become very long and it is unrealistic to expect participants to read it, so here's a summary:
Now to address your points and concerns.
I disagree. Having precise terminology is actually a necessity for "advanced or expert" programmers (or I would say, programmers who care about having formal properties about their programs and language designers who care about having formal properties about their language). That said, I agree with your sentiment that there could be 2 levels of terminology the same way there are 2 levels of Rust (safe and unsafe): one for "average" programmers (those who are fine with dynamic analysis, no analysis, or even don't care about specifying their program) and one for "advanced or expert" programmers (those who want static analysis against their program/library specification).
Yes for that second level of terminology.
I believe Rust is doing a pretty good job here, and that's exactly what the
That's one of those myths that has been debunked in this thread. By the definition of UB, the implementation of a contract doesn't matter, you get UB as soon as you violate the contract (regardless if it's the language contract or a library contract).
This is called the library invariant (assuming you have a safe API, otherwise you need to read the safety documentation). In particular, calling a safe function taking
This is called the language invariant. Footnotes
|
In rust-lang/rfcs#3458, I found it necessary to distinguish between invariants required by the language, and invariants required by user abstraction, as well as between invariants required for safety and invariants merely required for correctness. That RFC presently adheres to the following grammar:
or, to be precise:
...with these instantiations:
I'm not married to these terms — I only mention them because that RFC will be a consumer of the outcome of this discussion, and I'd like to see that whatever terminology we agree upon here can be substituted into that RFC without loss of clarity. I'm not even convinced they're the right terms. The unsafe fields feature is, I think, useful for denoting when a field violates an invariant that isn't insta-opsem-UB (or whatever you want to call it) to violate, and is expressly not to be used for cases that are insta-opsem-UB. @ia0, if we redefine UB like you propose:
...then we need a new term for things that are insta-opsem-UB (i.e., a property of a specific execution). It's a useful category. |
You might be interested in rust-lang/compiler-team#759 if you're not already aware of it. This distinction is indeed an important and common one. I guess it was not mentioned in this issue because we're in the Unsafe Code Guidelines repository and thus correctness is not a concern as long as it doesn't affect soundness (and if it does then it's part of the soundness contract).
That's an interesting question and my personal opinion is that such distinction is much less interesting for the language than it is for libraries (and thus probably doesn't exist). The reason is that when proving soundness of your program, you heavily rely on the correctness of the language. So languages have very low incentive to provide you any weaker contract than correctness. So the only contract the language provides is correctness. |
IMO, someone (maybe just "experts" and not "average programmers") needs to make this distinction in order to be able to explain why it's okay to pin a dependency and then violate one of its safety invariants. Consider an |
One has to understand the difference between a public (or default) contract and an out-of-band contract, which you get either because you are the author of that dependency, or because the author is your friend and told you they won't break you, or because you pin the version and use the implementation as the contract. In other words, you can choose the contract of a library out of the options given to you. For most users, that's the public documentation. If you choose a contract where you don't have library UB, then there's no UB, even if another (in particular the public) contract would be violated.
As written in the RFC, an unsafe field is a field that is mentioned in the library (formerly safety) invariant of the type it is part of. If such a field is public, then violating its contract is library UB (the type author must make sure that the field can be modified within its contract without breaking the library invariant). If such a field is private, then it doesn't matter that it's unsafe with regard to UB. It is unsafe only as a way to help the library author ensure they don't themselves break the library invariant by accident. |
“Valid” isn't super specific; a value is valid for a given set of constraints, e.g. how we consider if a pointer is valid for reads of a certain size. I would consider calling a value “valid” as highly contextual, with the baseline of course being to be valid for typed copies (i.e. the language invariant). And “safe” is still a good descriptor for values that cannot be used to cause UB with safe functionality, in my opinion. I'm actually now starting to wonder if a three way split, more like ia0 was alluding to, is worth it, into:
Most material would still only need to mention the concepts of language and safety invariants, as the concept of non-default library invariants is both intuitive and only really required in order to discuss the uncommon case of functions which can correctly work with “unsafe values” which don't satisfy the safety invariant (as for But I also expect that it should generally be clear when “library invariant” refers to the default or a specific point in time. And in an ideal scenario, “unsafe values” shouldn't ever need to be visible, e.g. how when temporarily breaking So in summary, I talked myself in a circle back to:
|
To be fully pedantic, when it comes to validity invariants, we should be saying that a byte sequence satisfies the validity invariant. To speak of a value presupposes that the byte sequence represents a value, which already implies that the validity invariant holds.
Fully agreed, thanks for saying this. We've definitely been distracted upthread, let's focus back on the main question. @jswrenn I am not entirely sure what you mean by "correctness invariant" in contrast to "safety invariant"? Do you simply mean preconditions/postconditions? I continue to strongly hold the position that we should not use the term "invariant" for those. It goes against all common practice in verification. We will confuse everyone who has ever taken a course in formal verification if we go with terminology like that, and we'll just needlessly diverge from established practice in the field. @CAD97 I don't fully understand what distinction you are trying to draw between "library invariant" and "safety invariant", but I suspect it might also be preconditions? |
I'm convinced. :-) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Sounds good, here's the thread. That's a similar thread to 2 notions of types created to avoid polluting this thread. Both matter for the original question, but it's simpler to have them on Zulip. (Feel free to hide as off-topic those messages that you consider off-topic, assuming you can, otherwise tell me to do it.) |
It's been more than six years since my blog post that introduced these terms, and I think it is clear at this point that the concept of these two kinds of invariants is here to stay. We should start using this concept in stable docs, so that we can write clear docs.
The main thing that gives me halt here is that I am not happy with the terminology. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already. I don't feel comfortable pushing for using these terms in stable docs until we have this resolved. We had #95 open for a while to find better terms, but that got closed as part of our backlog cleanup.
@digama0 was also recently confused by the fact that we call these "invariant" when they are not strictly speaking invariants that hold on every step of execution -- they are more invariants that hold about values at given points in the program. In verification we also talk e.g. about "loop invariants" that only hold at the beginning (or some other fixed point) during each loop, not all the time during the loop, and at least in Iris our "invariants" can be temporarily broken while they are being "opened" (and closed again within the same atomic step). The work on the anti-frame rule also uses the term "invariant" for things that are true between the operations on a data structure, but not during an operation. So IMO it is justified to use the term "invariant".
@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?
The text was updated successfully, but these errors were encountered: