Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize having the concept of "validity invariant" and "safety invariant"? Under which name? #539

Open
RalfJung opened this issue Oct 18, 2024 · 90 comments

Comments

@RalfJung
Copy link
Member

It's been more than six years since my blog post that introduced these terms, and I think it is clear at this point that the concept of these two kinds of invariants is here to stay. We should start using this concept in stable docs, so that we can write clear docs.

The main thing that gives me halt here is that I am not happy with the terminology. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already. I don't feel comfortable pushing for using these terms in stable docs until we have this resolved. We had #95 open for a while to find better terms, but that got closed as part of our backlog cleanup.

@digama0 was also recently confused by the fact that we call these "invariant" when they are not strictly speaking invariants that hold on every step of execution -- they are more invariants that hold about values at given points in the program. In verification we also talk e.g. about "loop invariants" that only hold at the beginning (or some other fixed point) during each loop, not all the time during the loop, and at least in Iris our "invariants" can be temporarily broken while they are being "opened" (and closed again within the same atomic step). The work on the anti-frame rule also uses the term "invariant" for things that are true between the operations on a data structure, but not during an operation. So IMO it is justified to use the term "invariant".

@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?

@joshlf

This comment has been minimized.

@RalfJung

This comment has been minimized.

@arielb1
Copy link

arielb1 commented Dec 18, 2024

@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?

I don't like these. Many library invariants are not safety invariants (e.g., if you have a non-deterministic Hash function, you are very likely to not find your items in an HashMap, but nothing unsafe will happen), and some language invariants are safety invariants (e.g., if you have aliasing &mut references, then as long as you don't use them nothing bad will happen).

So there's at least:

  1. abstract machine invariant - if you violate this demons might fly out of your nose, immediately do not pass go, even if you would immediately afterward call _exit(0) without calling any intermediate library function.
  2. library non-safety invariant - if you violate this, the library might do something safe but undesirable (e.g. a non-determinstic hash on an HashMap)
  3. library "validity" invariant - if you violate this and call a library function on that object, you might have demons flying out of your nose. But if you don't call a library function on that object, nothing bad will happen. e.g. I believe non-UTF8 &str falls here.
  4. safety invariants - if you violate this, you can write more or less carefully-crafted safe code that will violate an abstract machine invariant [but if you don't write that particular safe code, the program will do something very well-defined and possibly useful].
    4.1. including fully library safety invariants like an Rc with an invalid reference count - which should not cause UB as long as it doesn't unexpectedly reach 0.

I think that "safety invariant" is a good name, but "validity invariant" is not a particularly good name for "if you violate this demons will pop out of your nose". tho maybe "language invariant" is a good name (so language invariant/safety invariant).

@Lokathor
Copy link
Contributor

I think any two names that do not share the first letter are better than two names that do share the same first letter.

So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example.

@RalfJung
Copy link
Member Author

RalfJung commented Dec 18, 2024

Many library invariants are not safety invariants (e.g., if you have a non-deterministic Hash function, you are very likely to not find your items in an HashMap, but nothing unsafe will happen), and some language invariants are safety invariants (e.g., if you have aliasing &mut references, then as long as you don't use them nothing bad will happen).

Well, you've made your own choices here by calling these "library invariants" and "language invariants"; choices I would not agree with. ;)

So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example.

So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated.

I don't recall us ever abbreviating validity/safety invariant with the first letters, so I don't think it's very important that they have different first letters.

@Lokathor
Copy link
Contributor

But there isn't actually safety UB is there? All safety invariants are there because when violated they can allow otherwise fine code to end up breaking a language rule somewhere later. As far as I'm aware, there's just the one kind of UB.

@RalfJung
Copy link
Member Author

RalfJung commented Dec 18, 2024

Library UB is real and not the same as language UB. For instance, it is library UB to call any &str method with a non-UTF-8 str, but it is language UB only for some operations that happen to rely on the UTF-8 invariant in a way that triggers language UB (i.e., a Miri error) when the invariant is violated. Future library updates may change existing methods so that more (or fewer) of them have language UB when invoked with a non-UTF-8 str.

@zachs18
Copy link

zachs18 commented Dec 18, 2024

As far as I'm aware, there's just the one kind of UB.

They go along nicely with the terms "language UB" and "library UB" that we have also been using already.

(From the OP)

The UB you are referring to is also called "language UB", and "library UB" is used in contrast to mean basically "a violation of library invariants/preconditions/etc that could lead to (language) UB depending on usage/library implementation details/etc", which is useful to have a succinct term for.

@digama0
Copy link

digama0 commented Dec 18, 2024

So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated.

I would call them "language UB" (or just UB as I think there is only one thing that should get that name) and "safety violation" respectively.

@Lokathor
Copy link
Contributor

Yeah, see, I'm in agreement with @digama0. If you have a &str and unsafely set the bytes to not be utf8, then that could lead to UB, but it's not actually UB the moment the bytes are wrong.

I don't think we should call that state of having invalid bytes "library UB". I don't think we should call any other similar situations "library UB" either. I do like the term "safety violation".

Otherwise you have to start explaining to people that sometimes there's something called UB that they're allowed to do anyway, as long as they fix the situation before anything "really bad" happens. And that's a very easy thing for people to misunderstand, or only halfway remember months after the fact.

@RalfJung
Copy link
Member Author

Violating such library preconditions is often called UB, and I think it makes sense. UB basically means "you violated the contract of this API and now all guarantees are void". Language UB is about the contract between the programmer and the compiler (where the "API" is the language itself); library UB is about the contract between the library author and the client author.

@Lokathor
Copy link
Contributor

Right, except I'm telling you that what people often end up hearing is, "I'm allowed to do some types of UB".

I think that's bad, and if we're going to deliberately pick terms, we should deliberately pick terms that don't let people think that "sometimes UB is okay". Because that's a real thing people have said to me, and I've had to stop and explain to them what's "actually UB", and that they really can't do it, not even sneakily.

So call them lang invariants and lib invariants if you want, but i strongly suggest that you don't call them that if you want to draw a connection to "lang ub" and "lib ub", because those terms are not good terms that i have seen do harm to people's understanding of rust on multiple occasions over the years.

@RalfJung
Copy link
Member Author

You're not allowed to do library UB with other people's libraries either. I have seen this analogy work pretty well, too, so I wouldn't be so quick to discard it.

@Lokathor
Copy link
Contributor

Lokathor commented Dec 18, 2024

I know it does work for some people, but if it works for some people and not for others, then I don't think it entirely works.

Of course you should not do any library UB, but the fact that you can actually do it, without complaints from miri (eg: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=919353ecbc1b830d3f8daa6f3180888e), gives some people wrong impressions about what terms even mean. I don't think we should design our terms just for experts, we should also design them for the uninformed and forgetful too. There's a lot more uninformed and forgetful programmers than expert ones.

@ia0
Copy link

ia0 commented Dec 18, 2024

I agree with @Lokathor that people do library UB and the reason is what I call "robust functions" in the unsafe mental model and what you call "super safe functions". As a library author you can document that you accept more values than what the safety invariant says. Typical examples are non-UTF-8 str and pinned &mut T. You're not breaking the safety contract with the library author, just the safety invariant of the syntactical type. But that's just the default interpretation, not the one used for soundness.

@RalfJung
Copy link
Member Author

@Lokathor Nothing works for all people, it's a matter of tradeoffs. 🤷

I have registered your opinion, let's see what other people say.

@ia0 Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some &str function with non-UTF-8 data, then doing such a call does not violate the library contract, and therefore this is not library UB.

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

I think we all agree that if we use the term UB it should be for situations that always indicate incorrect code. ie. it's never ok for a program to exhibit something we have decided to call UB.

It seems the argument is more over whether violating library invariants should always indicate a problem with the program, or whether there are some situations where violating library invariants is acceptable - if there are such cases then we shouldn't call it UB.

I tend to agree with @RalfJung that it should be considered UB as long as we:

  • Are clear that library UB is still UB and always indicates incorrect code.
  • Are clear that MIRI only detects a subset of UB (presumably with the goal of detecting all language UB).

There are still some problems though:

  • It adds a lot more kinds of UB that can't be detected via tools like MIRI.

  • Library UB is determined by the library's public interface with the world - that is, when we are working inside the libary the concept of library UB for itself doesn't really exist. For example, a library may say that calling foo(0) is UB, but may call foo(0) inside its own implementation, using the knowledge that (in the current version of the library, under these conditions) it will not cause language UB.

    It's unclear to me how to encode the conditions under which a function call is subject to library invariants.

    If I have two related crates (eg. foo-core and foo) can I have foo bypass library invariants of foo-core because it is aware of foo-cores implementation details?

    We'd potentially need some notion of "open" or "closed" when depending on libraries to determine whether calls into such a library are subject to its invariants.

    As for why this would crop up - as a library author I probably want to have a simplified contract with the outside world - ie. instead of saying its UB to call foo(0) on a tuesday, I might want to just document that it's always UB to call foo(0). It makes it simpler for users to understand and it gives me more flexibility to change the implementation.

@ia0
Copy link

ia0 commented Dec 18, 2024

Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some &str function with non-UTF-8 data, then doing such a call does not violate the library contract, and therefore this is not library UB.

Ah ok my bad. I thought the analogy between "validity invariant" and "language UB" on one side and "safety invariant" and "library UB" on the other, was stronger1. I fear this will bring additional confusion to the confusion @Lokathor is mentioning2.

Footnotes

  1. "language UB" is when the "validity invariant" of a type is broken, but "library UB" is not when the "safety invariant" of a type is broken, but when the "user-defined invariant" of a type is broken (which is the "safety invariant" by default if there's no additional documentation).

  2. Namely that "library UB" is not as bad as "language UB" (because breaking the "user-defined invariant" may not immediately break soundness), and most people will confuse both concepts and use the properties of one (library UB not a big problem) to justify the other (language UB, which is actually a big problem).

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

Namely that "library UB" is not as bad as "language UB" (because breaking the "user-defined invariant" may not immediately break soundness)

This is not a real distinction.

  • Causing language UB may not immediately cause problems. But it might do on compiler upgrade or optimization change, etc.
  • Causing library UB may not immediately cause problems. But it might do on library upgrade.

The parallel between language UB and library UB is actually quite strong.

@ia0
Copy link

ia0 commented Dec 18, 2024

  • If I have two related crates (eg. foo-core and foo) can I have foo bypass library invariants of foo-core because it is aware of foo-cores implementation details?

I fear this is a separate discussion (so probably better opening a thread on zulip). You're always allowed to look at the implementation of a crate. Simply if you rely on it for soundness (or correctness if you care about that too), then you need to pin the version you audited with the = Cargo version requirement.

  • Causing language UB may not immediately cause problems. But it might do on compiler upgrade or optimization change, etc.

If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler (it may not terminate, it may crash, it may produce a buggy program, or even a program that seems to "work" until an attacker figures out it doesn't).

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

You're always allowed to look at the implementation of a crate.

Am I also allowed to look at the implementation of the compiler? What about libstd?

If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler.

And if you break a library invariant you're breaking the contract of the library. Literally you can't say what the library will do. And there's no guarantee on the output of the library.

Both the language and the library provide a contract with the user. There's an exact parallel between them.

If I pin the version of the compiler and exploit its implementation details, I can be confident in what it will output. I won't be writing "valid" Rust code anymore because I broke the contract of the language.

In the same way if I pin a version of a library and exploit its implementation details, I can be confident in what it will output. I won't be a "valid" user of that library anymore because I broke the contract of the library.

@chorman0773
Copy link
Contributor

Am I also allowed to look at the implementation of the compiler? What about libstd?

You are allowed to look at both and theoretically could pin the (rustc, cg backend, llvm/cranelift/gcc} triple, and do whatever the heck you want (including language UB) if you know what the result will be (and that result is desirable), yes. But as you alluded to, you would no longer be a valid user (of either Rust as a language or rustc as a compiler).

@ia0
Copy link

ia0 commented Dec 18, 2024

I won't be a "valid" user of that library anymore

But you never were. Your initial question was when the same user writes 2 crates: foo-core and foo. The crate foo can look at the implementation of foo-core and pin it. Other users of foo-core need to follow the public API of that crate (unless they have an out-of-band contract with the crate author).

You're not writing the compiler, so that's a different situation.

To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.)

@digama0
Copy link

digama0 commented Dec 18, 2024

To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.)

Actually I think std doesn't use UB (or at least we try not to), std fixes some underspecified language nondeterminism like struct layout (which is not UB if you "guess it right"). (Queue Jubilee telling me about unspeakable horrors in std or compiler-builtins.)

@chorman0773
Copy link
Contributor

It doesn't, though it 100% could from the perspective of an implementor. It's called implementor's privilege. Whether or not it should from a design perspective of course is a separate thing.

(Also, nit, struct layout isn't really underspecified - it's unspecified, intentionally)

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

Your initial question was when the same user writes 2 crates: foo-core and foo.

No, that was one example of many.

My point is only that there is a direct parallel between library UB and language UB.

The parallel for foo-core and foo would be the compiler and std.

The parallel between foo and foo-fork would be Rust and CrabLang.

The parallel between foo and "user who pins foo=1.2 and then relies on implementation details" would be Rust and "user who pins Rust to version X because they know they can get it to generate a very specific assembly sequence which is needed for their embedded use-case".

Rust does not support this last use-case. The author of foo also does not support this use-case.

There is no qualitative difference between these two types of UB, only that it is more common to rely on library implementation details than it is to rely on language implementation details.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 18, 2024

(Also, there's a fourth case - foo and foo-rewrite which copies the public api of foo but not the internals)

@arielb1
Copy link

arielb1 commented Dec 18, 2024

The parallel for foo-core and foo would be the compiler and std.

The difference here is that std is supposed to never cause undefined behavior in the nasal demons sense. std might depend on particular compiler implementation details, but the "nasal demons" rules apply to it just as they apply to everyone else.

I think one way of putting it is that undefined behavior is always considered to be a program error, and std is supposed to never cause unexpected program errors.

Someone other than std (e.g. random embedded user) can write code that exhibits undefined behavior but that a specific version of the compiler compiles into something that is useful to them. This is of course unsupported.

@ia0
Copy link

ia0 commented Jan 8, 2025

I don't think the extra fine-grained distinctions you are trying to introduce here are useful.

The reason I believe the distinction between safety invariant and "soundness invariant"1 matters, is that type invariants only matter when writing unsafe code (otherwise the type system is here), and unsafe (or robust) public APIs are where those invariants may differ, so it's important to distinguish them. But I agree that:

  • This is somehow a matter of style, so not that important. I'm fine with "library invariant", I understand it as "default library invariant" or "library definition invariant". The word library is good, just not enough.
  • There might actually be no confusion if people read the definition of "library invariant" instead of (or before) looking at the parallel. In particular, as long as they understand it's a type invariant and not a type occurrence invariant, which is enough to disambiguate any possible issue.

Footnotes

  1. Invariant for a given type occurrence in a public API. This is the combination of the safety invariant of the type and the safety documentation of the type occurrence. Violating the soundness invariant is library UB, the same way violating the validity invariant is language UB.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 9, 2025

Invariant for a given type occurrence in a public API.

That's not an invariant, that's just the precondition of a particular function. I don't think we need to come up with a new term for this very well-established concept.

@steveklabnik
Copy link
Member

steveklabnik commented Jan 9, 2025

By the way, isn't the C++ standard library part of the C++ language?

They're both part of the standard, but it really means what you mean by "part". For C++23, the standard library is defined in section 17.

It's roughly analogous to "is libcore part of the rust language?". Clause 17 standardizes the "language support library," which does the rough equivalent of defining what we refer to as "lang items" here in Rust.

Aren't implementations allowed to look at symbol names and if they recognize something from the standard library optimize based on the specification of this symbol by the language?

Yes. A famous example here is how memory allocation isn't considered observable behavior, so this code:

#include <new>  

int foo(){
    int *five = new int;
    return 42;
}

will compile to

foo():
  mov eax, 42
  ret

(This is also the case if you use malloc instead of new)

I don't have any strong opinion about if this has any implications on how Rust defines UB, just figured I'd add some context here.

@ia0
Copy link

ia0 commented Jan 10, 2025

That's not an invariant, that's just the precondition of a particular function.

Those can be seen as the same thing, it's a matter of style. One is "contracts as types" and the other is "contracts as pre-posts". For example, RefinedRust has annotations in the "contracts as pre-posts" style but is actually implemented with a "contracts as types" style. A concrete example would be:

/// Randomizes the content of a slice.
///
/// # Safety
///
/// `ptr` must be valid to write for `len` bytes
pub unsafe fn randomize(ptr: *mut u8, len: usize);

In this example, I'm considering ptr constrained by the safety documentation and len unconstrained (which is the default without safety documentation).

In the "contracts as pre-posts" style, we have a lemma to prove by the implementer and to use by the callers:

randomize: ∀ (p ∈ ℤ) (n ∈ ℤ)
    { valid<*mut u8>(p) ∧ valid_write(p, n) ∧ valid<usize>(n) ∧ safe<usize>(n) }
    randomize(p, n)
    { valid_write(p, n) }

In the "contracts as types" style, we have a type in a richer type system:

randomize: Π {p: ℤ} {n: ℤ}
    ( ptr: valid<*mut u8> | repr<*mut u8>(ptr, p) ∧ valid_write(p, n) )
    ( len: valid<usize> | repr<usize>(len, n) ∧ safe<usize>(n) )
    ->
    ( _: safe<()> | valid_write(p, n) )

In particular, the invariant of the occurrence of *mut u8 in the signature of randomize() is not the safety invariant, but something stronger, because it is constrained by being valid for write, while the invariant of the occurrence of usize is the safety invariant, because it's unconstrained and thus uses the default type invariant which is the safety invariant.

It's important to note that type occurrences constrained by the safety documentation refine the validity invariant, not the safety invariant (in particular, the safety invariant is the default refinement of the validity invariant). This matters to express the type of Pin::get_unchecked_mut() (in this richer type system):

get_unchecked_mut: Π {p: ℤ} ∀ (T: Type)  // quantified over semantic types
    ( self: valid<Pin<&mut T>> | repr<Pin<&mut T>>(self, p) ∧ safe<Pin<&mut T>>(p) )
    ->
    ( res: valid<&mut T> | repr<&mut T>(res, p) ∧ pinned<T>(p) )

We can unfold the definitions of valid, repr, and safe on Pin<&mut T> to confirm that we have the identity function (and this function is just a cast):

get_unchecked_mut: Π {p: ℤ} ∀ (T: Type)
    ( self: valid<&mut T> | repr<&mut T>(self, p) ∧ pinned<T>(p) )
    ->
    ( res: valid<&mut T> | repr<&mut T>(res, p) ∧ pinned<T>(p) )

If pinned<T> is different than safe<T>, then the type invariant of the result is weaker than the safety invariant (but still stronger than the validity invariant). This means you cannot call safe functions with the result value, you have to call robust functions.

I don't think we need to come up with a new term for this very well-established concept.

Indeed we don't. It's just a matter of style, which is why I said "This is somehow a matter of style, so not that important". Some people like to prove properties of their programs using pre-posts, others using types.

We don't need a new term, but we need to understand the concept to understand when the parallel between language/library UB and validity/safety invariant does not hold. You can violate the safety invariant without library UB (robust functions) and you can get library UB without violating the safety invariant (unsafe functions). While violating the validity invariant is always language UB (the reciprocal is true when restricting to the type invariant section of the language contract). Similarly, violating the "soundness" invariant is always library UB.

They're both part of the standard, but it really means what you mean by "part".

Indeed, if one can split the language specification Full in a smaller language specification Lang and a standard library specification Lib, such that it's possible to implement Lang, it's possible to implement Lib on top of Lang, and it's possible to implement Full on top of Lang and Lib, then we're also splitting the notion of Full UB into Lang UB and Lib UB. I agree this makes the distinction rather blurry and essentially implementation-defined (the Full implementation may decide to implement it all at once or split, in which case Lang doesn't know about Lib and can't optimize specifically for it, only generically as for any other function by looking at the function implementation and "reversing a specification" to use at call-sites).

@RalfJung
Copy link
Member Author

RalfJung commented Jan 11, 2025

Those can be seen as the same thing, it's a matter of style

I strongly disagree with this. You are causing a completely unnecessary terminology confusion here. Using "invariant" for "has to be true when this type is passed across API boundaries (but exceptions apply within a function)" is fairly standard (in safety/library invariant), even wikipedia calls this an invariant. We are already stretching the notion of "invariant" by using it for things that have to be true "on every typed load" (in validity/language invariant), but in some sense this is similar to the former use of invariant -- it's something that has to be true for all operations on this type.

But calling the precondition of a specific function an invariant is unheard of AFAIK and I will strongly push back against that. We use words that have specific technical meaning for a reason.

@ia0
Copy link

ia0 commented Jan 11, 2025

But calling the precondition of a specific function an invariant is unheard of AFAIK

I'm not calling the precondition an invariant. In the "contracts as types" style, there are no preconditions, there are just types. Pre- and post-conditions only exist in the "contracts as pre-posts" style. However, I claim there is a parallel between them, which can be seen with those 2 sentences:

  • "contracts as pre-posts": Violating the preconditions of a public API is library UB.
  • "contracts as types": Violating the documented types ("soundness" invariant) of a public API is library UB.

Given you mention the "on every typed load" particularity of the validity invariant, it's also interesting to see that this is also true for the "soundness" invariant. You can annotate your program such that every typed load satisfy the "soundness" invariant (which implies the validity invariant). This gives us the parallel you want:

  • Violating the documented types ("soundness" invariant) of a public API is library UB.
  • Violating the "erased" types (validity invariant) of a language operation is language UB.

The difference with the validity and safety invariants, is that the "soundness" invariant respects subtyping (hence can be used to prove the program or library sound in a type system fashion). One never casts a value from T to S if the "soundness" invariant of T does not imply the one of S. While one can do so with the validity and safety invariant. You can cast from u8 to bool if the value is 0 or 1. With the "soundness" invariant, such a type cast is the identity function, it's the same "soundness" invariant on both sides, only the backing type is different. You use subtyping when you need to ignore information, for example when calling a safe fn(bool) with true, you subtype the "soundness" invariant of the constant (which is the singleton true) to the safety invariant (which contains both true and false).

@CAD97
Copy link

CAD97 commented Jan 12, 2025

The difference here, AIUI, is that when @RalfJung or I say «type», we are referring to the syntactic type that is written (or inferred) in the Rust source code. Whereas @ia0 is using «type» to mean roughly the specific-to-an-API-signature semantic refinement of the «validity invariant» which captures all of the conditions for the soundness proof of that API signature.

Warning

Disclaimer: I worded this a bit authoritatively, but it isn't intended to be final, nor is it the expressed opinion of anyone in the mentioned groups other than myself.

@ia0 — the work you've put in developing your unsafe mental model document is appreciated, and I do believe it's a useful resource; however, it is not shared vocabulary with T-opsem or the UCG. Surface Rust, MIR, nor MiniRust have any concept of refinement (sub)typing below the syntactical (language/validity) typing, so the attempt to use your modeling of API contracts as refinement types to influence the common language which doesn't utilize refinement such refinement typing fails to translate properly.

I think we've identified the ideal resolution here:

I'm fine with "library invariant", I understand it as "default library invariant" or "library definition invariant".

Standard documentation can use “language invariant” and “library invariant” to refer to the (syntactic) type requirement that is required for a typed copy1 or for other safe API2 to be actually safe to use, respectively. And your doc which extends Rust's unsafety model with a more formal reasoning model for managing the unsafety in a program can go into refinement typing for attaching unsafe contracts on top of syntactic types, and how under this model the “library invariant” is just the default refinement for a syntactic type unless otherwise specified.


Also keep in mind that the terms discussed here are very much intended to be used as the “surface level” terminology that gets seen by everyone who writes Rust and uses the stdlib documentation. It's important for it to not be wrong so the path to the full details of safely using unsafe API is reasonably simple, but it's also fully acceptable for there to be edge cases that require further details and aren't perfectly represented by the base level split between “true because the language requires it” and “true because the library code requires it.”

And FWIW, I think “library invariant (of the syntactic type)” is a perfectly fine semantic for your model to replace with “safety invariant (of the specific flow-dependent type refinement at this point of the execution)” as your model is fully about replacing the syntactic type with the semantic type used for soundness reasoning.

Footnotes

  1. Unless otherwise specified. You may claim that language validity cannot be opened, and it can't directly be opened by the user, but e.g. MaybeUninit and MaybeDangling both are instances of the language providing syntactic types which open specific invariants required for typed copies by the language. The split is not “can't be opened” versus “can be opened” but rather “… by writing careful enough code.”

    The «validity invariant» as used by the UCG was always just the simple byte representation requirement for typed copies in the AM. You've reused the term in your unsafe model doc as the fully inviolable syntactic type invariant, but this is still a novel usage and not how the UCG has utilized the term.

    Plus, Rust always reserves the right to make any current UB into defined behavior. It is not endorsed reasoning to use the validity invariant as justification for doing anything but a typed copy; you are only allowed to do further operations because their requirements are part of your API’s safety requirements.

  2. Including other potentially implicit operations, e.g. dereferencing or casting. Relaxing requirements to give more code defined behavior is always allowed (as long as said code actually has defined behavior).

@ia0
Copy link

ia0 commented Jan 13, 2025

to influence the common language

Just for context, because we got lost in a side discussion. My claim is that there is no parallel between language/library UB and validity/safety invariant:

  • "I'm pretty sure this is going to be a big source of confusion if we start drawing parallels between language/library UB and validity/safety invariants." source
  • "this parallel between validity/safety invariant and language/library UB does not hold" source
  • "Agree that library UB is not about safety invariant" source
  • "There is no parallel between language/library UB and validity/safety invariant. More precisely, this parallel only holds outside the scope of unsafe." source
  • "We don't need a new term, but we need to understand the concept to understand when the parallel between language/library UB and validity/safety invariant does not hold." source

I'm not claiming that we should restore this parallel (that's a matter of style). I'm just describing the situation where the parallel would hold (either outside the scope of unsafe or using the "soundness" invariant) to help understand that the parallel does not always hold. I'm not sure why this was understood as introducing new terms. Introducing new concepts (those are not new, they are as old as me) doesn't imply introducing new terms (we can refer to them by definition and locally use quoted names for brevity). In particular, it's ok to have unnamed concepts (think of the different concepts of "snow" missing from the English vocabulary).

In particular, I'm fine with the language/library invariant terminology as I said above. You don't need the parallel with language/library UB to hold to justify this terminology.

The parallel that holds is "X invariant is the type invariant defined by X". This doesn't involve UB. This is the parallel I choose to understand the language/library invariant terminology (in particular by implicitly adding the word "definition"). Note that there's 2 ways to look at a user type: the abstraction defined by the library (library invariant) and the representation defined by the language (language invariant).

The parallel that does not hold is "Violating the X invariant implies having X UB", because it holds with X = language but doesn't with X = library .The counter-examples are robust functions, for example str::as_bytes() assuming it would document that the input doesn't need to be UTF-8. A viable argument here would be that we should not allow robust functions. I'm completely in favor of this and why I don't like APIs like Pin::get_unchecked_mut() which returns something that you can only use with robust functions.

it's also fully acceptable for there to be edge cases

That's a good answer to me. That's along the lines of what I suggested with "There might actually be no confusion if people read the definition of library invariant" source. We accept the risks of advertising the parallel, knowing it doesn't always hold. Note that this is not what Ralf says: "I don't agree; I still think there is a completely fine parallel here" source. This is what worries me, that we would advertise this parallel as a strong one, while we know it has weaknesses.

  1. The «validity invariant» as used by the UCG was always just the simple byte representation requirement for typed copies in the AM. You've reused the term in your unsafe model doc as the fully inviolable syntactic type invariant, but this is still a novel usage and not how the UCG has utilized the term.

Those should ideally be the same thing. The type only matters where it's used (i.e. a value of that type is "materialized"). The fact that types "compose" doesn't create new program points where their predicate must hold. This composition is just a powerful proof technique to keep track of invariants and ultimately check all parts of a program in a modular way.

In particular, in MaybeUninit<T>, the validity invariant of T does not matter at this point (only its layout and ABI do). The validity invariant of T matters when calling assume_init() because that's the type returned (a value of that type is "materialized"). Similarly for Vec<T>, the validity invariant of T only matters when actually pushing or accessing a T, otherwise it doesn't matter at all.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 5, 2025

The parallel that does not hold is "Violating the X invariant implies having X UB", because it holds with X = language but doesn't with X = library .

I agree with that statement, that is a good observation. However, I am not particularly bothered by this little asymmetry. Robust functions are pretty rare. I think the symmetry overall still holds up sufficiently well to be useful.

@ia0
Copy link

ia0 commented Mar 5, 2025

However, I am not particularly bothered by this little asymmetry. Robust functions are pretty rare. I think the symmetry overall still holds up sufficiently well to be useful.

Sounds good to me.

@ia0
Copy link

ia0 commented Mar 13, 2025

When using language and library invariant, I realized I don't know what adjectives to use to qualify the word "value" to designate the fact that the value satisfies the language (resp. library) invariant. In the past I was using "valid" (resp. "safe") since they were going well with "validity" (resp. "safety") invariant. What's the new way now?

@joshlf
Copy link

joshlf commented Mar 13, 2025

TL;DR: I suggest we introduce only terms for the following:

  • The type of invariant which needs to hold when calling into an API (e.g., strs need to be UTF-8). Calling an API while such an invariant is violated may exhibit UB.
  • The type of invariant which immediately exhibits UB upon violation (e.g., bools need to be only 0x00 or 0x01).

I suggest that we pick names which are evocative and are designed to highlight the practical distinction between these terms - ie, to highlight the distinction in how these terms affect the unsafe code you write.

It seems like the discussion is going in that direction anyway, but there has been a lot of discussion of finer-grained taxonomies, so I want to advocate that we stick with this simple taxonomy, at least in what we document publicly.


I'd suggest a guiding principle for this discussion: We should pick nomenclature that is likely to result in Rust code in the ecosystem which is more correct and secure. As a consequence, we should pick nomenclature that is likely to help the average Rust programmer write correct and secure code.

In particular:

  • Advanced and expert programmers have enough context to know what's going on regardless of nomenclature
  • The nomenclature need not capture details which won't help the average programmer write better code
  • The nomenclature should be as simple as possible to aid in understanding, and to make it clear what code should and should not be written
  • We should not attempt to capture the distinction between what currently exhibits UB and what might exhibit UB depending on how an API (library or language) is implemented
    • This distinction distracts from the distinction between code which is always correct/secure and code which is not necessarily always correct/secure. This is the distinction we care the most about.

Thus, I suggest we introduce only terms for the following:

  • The type of invariant which needs to hold when calling into an API (e.g., strs need to be UTF-8). Calling an API while such an invariant is violated may exhibit UB.
  • The type of invariant which immediately exhibits UB upon violation (e.g., bools need to be only 0x00 or 0x01).

I suggest that we pick names which are evocative and are designed to highlight the practical distinction between these terms - ie, to highlight the distinction in how these terms affect the unsafe code you write.

@ia0
Copy link

ia0 commented Mar 13, 2025

This thread has become very long and it is unrealistic to expect participants to read it, so here's a summary:

  • Undefined Behavior (UB) is a violation of a "safety" contract. There are 2 main authors of such "safety" contracts: the language and libraries (including the standard library). One can say language UB or library UB depending on the contract being violated.
  • For each type, there are 2 invariants: the one defined by the language and the one defined by the library (this role is taken by the language for language types)1.

Now to address your points and concerns.

Advanced and expert programmers have enough context to know what's going on regardless of nomenclature

I disagree. Having precise terminology is actually a necessity for "advanced or expert" programmers (or I would say, programmers who care about having formal properties about their programs and language designers who care about having formal properties about their language).

That said, I agree with your sentiment that there could be 2 levels of terminology the same way there are 2 levels of Rust (safe and unsafe): one for "average" programmers (those who are fine with dynamic analysis, no analysis, or even don't care about specifying their program) and one for "advanced or expert" programmers (those who want static analysis against their program/library specification).

The nomenclature need not capture details which won't help the average programmer write better code

Yes for that second level of terminology.

The nomenclature should be as simple as possible to aid in understanding, and to make it clear what code should and should not be written

I believe Rust is doing a pretty good job here, and that's exactly what the unsafe keyword is for. The "average" programmer should not write unsafe code (they obviously can, but they have been properly warned).

We should not attempt to capture the distinction between what currently exhibits UB and what might exhibit UB depending on how an API (library or language) is implemented

That's one of those myths that has been debunked in this thread. By the definition of UB, the implementation of a contract doesn't matter, you get UB as soon as you violate the contract (regardless if it's the language contract or a library contract).

The type of invariant which needs to hold when calling into an API (e.g., strs need to be UTF-8). Calling an API while such an invariant is violated may exhibit UB.

This is called the library invariant (assuming you have a safe API, otherwise you need to read the safety documentation). In particular, calling a safe function taking str (without robustness documentation) is always UB (not "may exhibit UB" as you state, which is a bit inconsistent with your last goal).

The type of invariant which immediately exhibits UB upon violation (e.g., bools need to be only 0x00 or 0x01).

This is called the language invariant.

Footnotes

  1. Actually, I'd prefer to state this as "the representation (formerly validity) invariant and the author (formerly safety) invariant", but this wasn't discussed in this thread yet and might be too late to bring up.

@jswrenn
Copy link
Member

jswrenn commented Mar 13, 2025

In rust-lang/rfcs#3458, I found it necessary to distinguish between invariants required by the language, and invariants required by user abstraction, as well as between invariants required for safety and invariants merely required for correctness.

That RFC presently adheres to the following grammar:

(language|library)? (correctness|safety)? invariant

or, to be precise:

(language|library( correctness| safety)?)? invariant

...with these instantiations:

  • language invariant
    An invariant assumed to be true by the language; e.g., that a bool is 0 or 1.
  • library invariant
    An invariant assumed to be true by a user abstraction. Either a:
    • library safety invariant
      An invariant assumed to be true by a library, necessary for memory safety.
    • library correctness invariant
      An invariant assumed to be true by a library, not necessary for memory safety.

I'm not married to these terms — I only mention them because that RFC will be a consumer of the outcome of this discussion, and I'd like to see that whatever terminology we agree upon here can be substituted into that RFC without loss of clarity.

I'm not even convinced they're the right terms. The unsafe fields feature is, I think, useful for denoting when a field violates an invariant that isn't insta-opsem-UB (or whatever you want to call it) to violate, and is expressly not to be used for cases that are insta-opsem-UB. @ia0, if we redefine UB like you propose:

We should not attempt to capture the distinction between what currently exhibits UB and what might exhibit UB depending on how an API (library or language) is implemented

That's one of those myths that has been debunked in this thread. By the definition of UB, the implementation of a contract doesn't matter, you get UB as soon as you violate the contract (regardless if it's the language contract or a library contract).

...then we need a new term for things that are insta-opsem-UB (i.e., a property of a specific execution). It's a useful category.

@ia0
Copy link

ia0 commented Mar 13, 2025

as well as between invariants required for safety and invariants merely required for correctness

You might be interested in rust-lang/compiler-team#759 if you're not already aware of it. This distinction is indeed an important and common one. I guess it was not mentioned in this issue because we're in the Unsafe Code Guidelines repository and thus correctness is not a concern as long as it doesn't affect soundness (and if it does then it's part of the soundness contract).

I haven't had any need to distinguish between language safety and correctness invariants — if such a distinction exists

That's an interesting question and my personal opinion is that such distinction is much less interesting for the language than it is for libraries (and thus probably doesn't exist). The reason is that when proving soundness of your program, you heavily rely on the correctness of the language. So languages have very low incentive to provide you any weaker contract than correctness. So the only contract the language provides is correctness.

@joshlf
Copy link

joshlf commented Mar 13, 2025

We should not attempt to capture the distinction between what currently exhibits UB and what might exhibit UB depending on how an API (library or language) is implemented

That's one of those myths that has been debunked in this thread. By the definition of UB, the implementation of a contract doesn't matter, you get UB as soon as you violate the contract (regardless if it's the language contract or a library contract).

IMO, someone (maybe just "experts" and not "average programmers") needs to make this distinction in order to be able to explain why it's okay to pin a dependency and then violate one of its safety invariants. Consider an EvenU8 type with the publicly-documented library safety invariant that it only stores even values. Imagine I have a specific version of that library doesn't expose any API which, when invoked on an odd value, exhibits opsem-UB (to borrow @jswrenn's term). I can happily pin to that version of the library, violate its library safety invariant (which, IIUC, you would refer to as "library UB" or similar), and justify that my code is sound. But what terminology do I use in this justification? IIUC, you'd need a separate concept of a specific execution exhibiting opsem-UB as distinct from the "library UB" concept of merely violating the library's safety invariant.

@ia0
Copy link

ia0 commented Mar 13, 2025

@joshlf

explain why it's okay to pin a dependency and then violate one of its safety invariants

One has to understand the difference between a public (or default) contract and an out-of-band contract, which you get either because you are the author of that dependency, or because the author is your friend and told you they won't break you, or because you pin the version and use the implementation as the contract. In other words, you can choose the contract of a library out of the options given to you. For most users, that's the public documentation. If you choose a contract where you don't have library UB, then there's no UB, even if another (in particular the public) contract would be violated.

@jswrenn

The unsafe fields feature is, I think, useful for denoting when a field violates an invariant that isn't insta-opsem-UB (or whatever you want to call it) to violate, and is expressly not to be used for cases that are insta-opsem-UB

As written in the RFC, an unsafe field is a field that is mentioned in the library (formerly safety) invariant of the type it is part of. If such a field is public, then violating its contract is library UB (the type author must make sure that the field can be modified within its contract without breaking the library invariant). If such a field is private, then it doesn't matter that it's unsafe with regard to UB. It is unsafe only as a way to help the library author ensure they don't themselves break the library invariant by accident.

@CAD97
Copy link

CAD97 commented Mar 13, 2025

“Valid” isn't super specific; a value is valid for a given set of constraints, e.g. how we consider if a pointer is valid for reads of a certain size. I would consider calling a value “valid” as highly contextual, with the baseline of course being to be valid for typed copies (i.e. the language invariant). And “safe” is still a good descriptor for values that cannot be used to cause UB with safe functionality, in my opinion.

I'm actually now starting to wonder if a three way split, more like ia0 was alluding to, is worth it, into:

  • “language invariant,” the validity requirement of typed copies, causes effectively immediate UB when violated;
  • “library invariant,” the validity requirements imposed by library definitions, whether based on syntactic type or functions refining those; and
  • “safety invariant,” the default library invariant upheld by and for safe code.

Most material would still only need to mention the concepts of language and safety invariants, as the concept of non-default library invariants is both intuitive and only really required in order to discuss the uncommon case of functions which can correctly work with “unsafe values” which don't satisfy the safety invariant (as for unsafe fn adding restrictions, we call it safety requirements).

But I also expect that it should generally be clear when “library invariant” refers to the default or a specific point in time. And in an ideal scenario, “unsafe values” shouldn't ever need to be visible, e.g. how when temporarily breaking str's UTF-8 invariant, this is correctly done in the domain of &mut [u8] instead.

So in summary, I talked myself in a circle back to:

  • “language invariant” and “[default] library invariant” are the main two concepts.
  • “Safety invariant” will still stick around in the ecosystem as the intuitive term for the "unsafe contract" of safe API.
  • Highly pedantically correct material may and perhaps should be careful to distinguish the “default library invariant” and the “effective library invariant” of specific signatures and/or points of execution. But the latter should avoid ever being the unqualified “library invariant,” as the unqualified invariant is the default invariant.

@RalfJung
Copy link
Member Author

@ia0

I realized I don't know what adjectives to use to qualify the word "value" to designate the fact that the value satisfies the language (resp. library) invariant.

To be fully pedantic, when it comes to validity invariants, we should be saying that a byte sequence satisfies the validity invariant. To speak of a value presupposes that the byte sequence represents a value, which already implies that the validity invariant holds.

@joshlf

TL;DR: I suggest we introduce only terms for the following:

Fully agreed, thanks for saying this. We've definitely been distracted upthread, let's focus back on the main question.

@jswrenn I am not entirely sure what you mean by "correctness invariant" in contrast to "safety invariant"? Do you simply mean preconditions/postconditions? I continue to strongly hold the position that we should not use the term "invariant" for those. It goes against all common practice in verification. We will confuse everyone who has ever taken a course in formal verification if we go with terminology like that, and we'll just needlessly diverge from established practice in the field.

@CAD97 I don't fully understand what distinction you are trying to draw between "library invariant" and "safety invariant", but I suspect it might also be preconditions?

@jswrenn
Copy link
Member

jswrenn commented Mar 20, 2025

I continue to strongly hold the position that we should not use the term "invariant" for those. It goes against all common practice in verification. We will confuse everyone who has ever taken a course in formal verification if we go with terminology like that, and we'll just needlessly diverge from established practice in the field.

I'm convinced. :-)

@ia0

This comment has been minimized.

@RalfJung

This comment has been minimized.

@ia0

This comment has been minimized.

@RalfJung

This comment has been minimized.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 21, 2025

This thread has gotten hopelessly derailed. Maybe we can resolve the question of whether we need to invent a new word for contracts / preconditions and postconditions off-thread, on Zulip or so, and keep this thread focused on the original question, which @joshlf helpfully summarized above.

@ia0
Copy link

ia0 commented Mar 21, 2025

Maybe we can resolve the question on Zulip

Sounds good, here's the thread. That's a similar thread to 2 notions of types created to avoid polluting this thread. Both matter for the original question, but it's simpler to have them on Zulip.

(Feel free to hide as off-topic those messages that you consider off-topic, assuming you can, otherwise tell me to do it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests