Skip to content

Commit f60bc3a

Browse files
committed
Auto merge of #44505 - nikomatsakis:lotsa-comments, r=steveklabnik
rework the README.md for rustc and add other readmes OK, so, long ago I committed to the idea of trying to write some high-level documentation for rustc. This has proved to be much harder for me to get done than I thought it would! This PR is far from as complete as I had hoped, but I wanted to open it so that people can give me feedback on the conventions that it establishes. If this seems like a good way forward, we can land it and I will open an issue with a good check-list of things to write (and try to take down some of them myself). Here are the conventions I established on which I would like feedback. **Use README.md files**. First off, I'm aiming to keep most of the high-level docs in `README.md` files, rather than entries on forge. My thought is that such files are (a) more discoverable than forge and (b) closer to the code, and hence can be edited in a single PR. However, since they are not *in the code*, they will naturally get out of date, so the intention is to focus on the highest-level details, which are least likely to bitrot. I've included a few examples of common functions and so forth, but never tried to (e.g.) exhaustively list the names of functions and so forth. - I would like to use the tidy scripts to try and check that these do not go out of date. Future work. **librustc/README.md as the main entrypoint.** This seems like the most natural place people will look first. It lays out how the crates are structured and **is intended** to give pointers to the main data structures of the compiler (I didn't update that yet; the existing material is terribly dated). **A glossary listing abbreviations and things.** It's much harder to read code if you don't know what some obscure set of letters like `infcx` stands for. **Major modules each have their own README.md that documents the high-level idea.** For example, I wrote some stuff about `hir` and `ty`. Both of them have many missing topics, but I think that is roughly the level of depth that would be good. The idea is to give people a "feeling" for what the code does. What is missing primarily here is lots of content. =) Here are some things I'd like to see: - A description of what a QUERY is and how to define one - Some comments for `librustc/ty/maps.rs` - An overview of how compilation proceeds now (i.e., the hybrid demand-driven and forward model) and how we would like to see it going in the future (all demand-driven) - Some coverage of how incremental will work under red-green - An updated list of the major IRs in use of the compiler (AST, HIR, TypeckTables, MIR) and major bits of interesting code (typeck, borrowck, etc) - More advice on how to use `x.py`, or at least pointers to that - Good choice for `config.toml` - How to use `RUST_LOG` and other debugging flags (e.g., `-Zverbose`, `-Ztreat-err-as-bug`) - Helpful conventions for `debug!` statement formatting cc @rust-lang/compiler @mgattozzi
2 parents 325ba23 + 638958b commit f60bc3a

File tree

20 files changed

+2571
-1757
lines changed

20 files changed

+2571
-1757
lines changed

Diff for: src/librustc/README.md

+185-156
Large diffs are not rendered by default.

Diff for: src/librustc/hir/README.md

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Introduction to the HIR
2+
3+
The HIR -- "High-level IR" -- is the primary IR used in most of
4+
rustc. It is a desugared version of the "abstract syntax tree" (AST)
5+
that is generated after parsing, macro expansion, and name resolution
6+
have completed. Many parts of HIR resemble Rust surface syntax quite
7+
closely, with the exception that some of Rust's expression forms have
8+
been desugared away (as an example, `for` loops are converted into a
9+
`loop` and do not appear in the HIR).
10+
11+
This README covers the main concepts of the HIR.
12+
13+
### Out-of-band storage and the `Crate` type
14+
15+
The top-level data-structure in the HIR is the `Crate`, which stores
16+
the contents of the crate currently being compiled (we only ever
17+
construct HIR for the current crate). Whereas in the AST the crate
18+
data structure basically just contains the root module, the HIR
19+
`Crate` structure contains a number of maps and other things that
20+
serve to organize the content of the crate for easier access.
21+
22+
For example, the contents of individual items (e.g., modules,
23+
functions, traits, impls, etc) in the HIR are not immediately
24+
accessible in the parents. So, for example, if had a module item `foo`
25+
containing a function `bar()`:
26+
27+
```
28+
mod foo {
29+
fn bar() { }
30+
}
31+
```
32+
33+
Then in the HIR the representation of module `foo` (the `Mod`
34+
stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
35+
details of the function `bar()`, we would lookup `I` in the
36+
`items` map.
37+
38+
One nice result from this representation is that one can iterate
39+
over all items in the crate by iterating over the key-value pairs
40+
in these maps (without the need to trawl through the IR in total).
41+
There are similar maps for things like trait items and impl items,
42+
as well as "bodies" (explained below).
43+
44+
The other reason to setup the representation this way is for better
45+
integration with incremental compilation. This way, if you gain access
46+
to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
47+
gain access to the contents of the function `bar()`. Instead, you only
48+
gain access to the **id** for `bar()`, and you must invoke some
49+
function to lookup the contents of `bar()` given its id; this gives us
50+
a chance to observe that you accessed the data for `bar()` and record
51+
the dependency.
52+
53+
### Identifiers in the HIR
54+
55+
Most of the code that has to deal with things in HIR tends not to
56+
carry around references into the HIR, but rather to carry around
57+
*identifier numbers* (or just "ids"). Right now, you will find four
58+
sorts of identifiers in active use:
59+
60+
- `DefId`, which primarily name "definitions" or top-level items.
61+
- You can think of a `DefId` as being shorthand for a very explicit
62+
and complete path, like `std::collections::HashMap`. However,
63+
these paths are able to name things that are not nameable in
64+
normal Rust (e.g., impls), and they also include extra information
65+
about the crate (such as its version number, as two versions of
66+
the same crate can co-exist).
67+
- A `DefId` really consists of two parts, a `CrateNum` (which
68+
identifies the crate) and a `DefIndex` (which indixes into a list
69+
of items that is maintained per crate).
70+
- `HirId`, which combines the index of a particular item with an
71+
offset within that item.
72+
- the key point of a `HirId` is that it is *relative* to some item (which is named
73+
via a `DefId`).
74+
- `BodyId`, this is an absolute identifier that refers to a specific
75+
body (definition of a function or constant) in the crate. It is currently
76+
effectively a "newtype'd" `NodeId`.
77+
- `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
78+
- While these are still in common use, **they are being slowly phased out**.
79+
- Since they are absolute within the crate, adding a new node
80+
anywhere in the tree causes the node-ids of all subsequent code in
81+
the crate to change. This is terrible for incremental compilation,
82+
as you can perhaps imagine.
83+
84+
### HIR Map
85+
86+
Most of the time when you are working with the HIR, you will do so via
87+
the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
88+
the `hir::map` module). The HIR map contains a number of methods to
89+
convert between ids of various kinds and to lookup data associated
90+
with a HIR node.
91+
92+
For example, if you have a `DefId`, and you would like to convert it
93+
to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
94+
returns an `Option<NodeId>` -- this will be `None` if the def-id
95+
refers to something outside of the current crate (since then it has no
96+
HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
97+
the definition.
98+
99+
Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
100+
`NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
101+
defined in the map; by matching on this you can find out what sort of
102+
node the node-id referred to and also get a pointer to the data
103+
itself. Often, you know what sort of node `n` is -- e.g., if you know
104+
that `n` must be some HIR expression, you can do
105+
`tcx.hir.expect_expr(n)`, which will extract and return the
106+
`&hir::Expr`, panicking if `n` is not in fact an expression.
107+
108+
Finally, you can use the HIR map to find the parents of nodes, via
109+
calls like `tcx.hir.get_parent_node(n)`.
110+
111+
### HIR Bodies
112+
113+
A **body** represents some kind of executable code, such as the body
114+
of a function/closure or the definition of a constant. Bodies are
115+
associated with an **owner**, which is typically some kind of item
116+
(e.g., a `fn()` or `const`), but could also be a closure expression
117+
(e.g., `|x, y| x + y`). You can use the HIR map to find find the body
118+
associated with a given def-id (`maybe_body_owned_by()`) or to find
119+
the owner of a body (`body_owner_def_id()`).

Diff for: src/librustc/hir/map/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
2+
HIR and convert between various forms of identifiers. See [the HIR README] for more information.
3+
4+
[the HIR README]: ../README.md

Diff for: src/librustc/hir/mod.rs

+25-1
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,10 @@ pub struct WhereEqPredicate {
413413

414414
pub type CrateConfig = HirVec<P<MetaItem>>;
415415

416+
/// The top-level data structure that stores the entire contents of
417+
/// the crate currently being compiled.
418+
///
419+
/// For more details, see [the module-level README](README.md).
416420
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
417421
pub struct Crate {
418422
pub module: Mod,
@@ -927,7 +931,27 @@ pub struct BodyId {
927931
pub node_id: NodeId,
928932
}
929933

930-
/// The body of a function or constant value.
934+
/// The body of a function, closure, or constant value. In the case of
935+
/// a function, the body contains not only the function body itself
936+
/// (which is an expression), but also the argument patterns, since
937+
/// those are something that the caller doesn't really care about.
938+
///
939+
/// # Examples
940+
///
941+
/// ```
942+
/// fn foo((x, y): (u32, u32)) -> u32 {
943+
/// x + y
944+
/// }
945+
/// ```
946+
///
947+
/// Here, the `Body` associated with `foo()` would contain:
948+
///
949+
/// - an `arguments` array containing the `(x, y)` pattern
950+
/// - a `value` containing the `x + y` expression (maybe wrapped in a block)
951+
/// - `is_generator` would be false
952+
///
953+
/// All bodies have an **owner**, which can be accessed via the HIR
954+
/// map using `body_owner_def_id()`.
931955
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
932956
pub struct Body {
933957
pub arguments: HirVec<Arg>,

Diff for: src/librustc/lib.rs

+22-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,28 @@
88
// option. This file may not be copied, modified, or distributed
99
// except according to those terms.
1010

11-
//! The Rust compiler.
11+
//! The "main crate" of the Rust compiler. This crate contains common
12+
//! type definitions that are used by the other crates in the rustc
13+
//! "family". Some prominent examples (note that each of these modules
14+
//! has their own README with further details).
15+
//!
16+
//! - **HIR.** The "high-level (H) intermediate representation (IR)" is
17+
//! defined in the `hir` module.
18+
//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
19+
//! defined in the `mir` module. This module contains only the
20+
//! *definition* of the MIR; the passes that transform and operate
21+
//! on MIR are found in `librustc_mir` crate.
22+
//! - **Types.** The internal representation of types used in rustc is
23+
//! defined in the `ty` module. This includes the **type context**
24+
//! (or `tcx`), which is the central context during most of
25+
//! compilation, containing the interners and other things.
26+
//! - **Traits.** Trait resolution is implemented in the `traits` module.
27+
//! - **Type inference.** The type inference code can be found in the `infer` module;
28+
//! this code handles low-level equality and subtyping operations. The
29+
//! type check pass in the compiler is found in the `librustc_typeck` crate.
30+
//!
31+
//! For a deeper explanation of how the compiler works and is
32+
//! organized, see the README.md file in this directory.
1233
//!
1334
//! # Note
1435
//!

Diff for: src/librustc/ty/README.md

+165
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Types and the Type Context
2+
3+
The `ty` module defines how the Rust compiler represents types
4+
internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
5+
which is the central data structure in the compiler.
6+
7+
## The tcx and how it uses lifetimes
8+
9+
The `tcx` ("typing context") is the central data structure in the
10+
compiler. It is the context that you use to perform all manner of
11+
queries. The struct `TyCtxt` defines a reference to this shared context:
12+
13+
```rust
14+
tcx: TyCtxt<'a, 'gcx, 'tcx>
15+
// -- ---- ----
16+
// | | |
17+
// | | innermost arena lifetime (if any)
18+
// | "global arena" lifetime
19+
// lifetime of this reference
20+
```
21+
22+
As you can see, the `TyCtxt` type takes three lifetime parameters.
23+
These lifetimes are perhaps the most complex thing to understand about
24+
the tcx. During Rust compilation, we allocate most of our memory in
25+
**arenas**, which are basically pools of memory that get freed all at
26+
once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
27+
you know that it refers to arena-allocated data (or data that lives as
28+
long as the arenas, anyhow).
29+
30+
We use two distinct levels of arenas. The outer level is the "global
31+
arena". This arena lasts for the entire compilation: so anything you
32+
allocate in there is only freed once compilation is basically over
33+
(actually, when we shift to executing LLVM).
34+
35+
To reduce peak memory usage, when we do type inference, we also use an
36+
inner level of arena. These arenas get thrown away once type inference
37+
is over. This is done because type inference generates a lot of
38+
"throw-away" types that are not particularly interesting after type
39+
inference completes, so keeping around those allocations would be
40+
wasteful.
41+
42+
Often, we wish to write code that explicitly asserts that it is not
43+
taking place during inference. In that case, there is no "local"
44+
arena, and all the types that you can access are allocated in the
45+
global arena. To express this, the idea is to us the same lifetime
46+
for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
47+
confusing, we tend to use the name `'tcx` in such contexts. Here is an
48+
example:
49+
50+
```rust
51+
fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
52+
// ---- ----
53+
// Using the same lifetime here asserts
54+
// that the innermost arena accessible through
55+
// this reference *is* the global arena.
56+
}
57+
```
58+
59+
In contrast, if we want to code that can be usable during type inference, then you
60+
need to declare a distinct `'gcx` and `'tcx` lifetime parameter:
61+
62+
```rust
63+
fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
64+
// ---- ----
65+
// Using different lifetimes here means that
66+
// the innermost arena *may* be distinct
67+
// from the global arena (but doesn't have to be).
68+
}
69+
```
70+
71+
### Allocating and working with types
72+
73+
Rust types are represented using the `Ty<'tcx>` defined in the `ty`
74+
module (not to be confused with the `Ty` struct from [the HIR]). This
75+
is in fact a simple type alias for a reference with `'tcx` lifetime:
76+
77+
```rust
78+
pub type Ty<'tcx> = &'tcx TyS<'tcx>;
79+
```
80+
81+
[the HIR]: ../hir/README.md
82+
83+
You can basically ignore the `TyS` struct -- you will basically never
84+
access it explicitly. We always pass it by reference using the
85+
`Ty<'tcx>` alias -- the only exception I think is to define inherent
86+
methods on types. Instances of `TyS` are only ever allocated in one of
87+
the rustc arenas (never e.g. on the stack).
88+
89+
One common operation on types is to **match** and see what kinds of
90+
types they are. This is done by doing `match ty.sty`, sort of like this:
91+
92+
```rust
93+
fn test_type<'tcx>(ty: Ty<'tcx>) {
94+
match ty.sty {
95+
ty::TyArray(elem_ty, len) => { ... }
96+
...
97+
}
98+
}
99+
```
100+
101+
The `sty` field (the origin of this name is unclear to me; perhaps
102+
structural type?) is of type `TypeVariants<'tcx>`, which is an enum
103+
definined all of the different kinds of types in the compiler.
104+
105+
> NB: inspecting the `sty` field on types during type inference can be
106+
> risky, as there are may be inference variables and other things to
107+
> consider, or sometimes types are not yet known that will become
108+
> known later.).
109+
110+
To allocate a new type, you can use the various `mk_` methods defined
111+
on the `tcx`. These have names that correpond mostly to the various kinds
112+
of type variants. For example:
113+
114+
```rust
115+
let array_ty = tcx.mk_array(elem_ty, len * 2);
116+
```
117+
118+
These methods all return a `Ty<'tcx>` -- note that the lifetime you
119+
get back is the lifetime of the innermost arena that this `tcx` has
120+
access to. In fact, types are always canonicalized and interned (so we
121+
never allocate exactly the same type twice) and are always allocated
122+
in the outermost arena where they can be (so, if they do not contain
123+
any inference variables or other "temporary" types, they will be
124+
allocated in the global arena). However, the lifetime `'tcx` is always
125+
a safe approximation, so that is what you get back.
126+
127+
> NB. Because types are interned, it is possible to compare them for
128+
> equality efficiently using `==` -- however, this is almost never what
129+
> you want to do unless you happen to be hashing and looking for
130+
> duplicates. This is because often in Rust there are multiple ways to
131+
> represent the same type, particularly once inference is involved. If
132+
> you are going to be testing for type equality, you probably need to
133+
> start looking into the inference code to do it right.
134+
135+
You can also find various common types in the tcx itself by accessing
136+
`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).
137+
138+
### Beyond types: Other kinds of arena-allocated data structures
139+
140+
In addition to types, there are a number of other arena-allocated data
141+
structures that you can allocate, and which are found in this
142+
module. Here are a few examples:
143+
144+
- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
145+
specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
146+
would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
147+
- `TraitRef`, typically passed by value -- a **trait reference**
148+
consists of a reference to a trait along with its various type
149+
parameters (including `Self`), like `i32: Display` (here, the def-id
150+
would reference the `Display` trait, and the substs would contain
151+
`i32`).
152+
- `Predicate` defines something the trait system has to prove (see `traits` module).
153+
154+
### Import conventions
155+
156+
Although there is no hard and fast rule, the `ty` module tends to be used like so:
157+
158+
```rust
159+
use ty::{self, Ty, TyCtxt};
160+
```
161+
162+
In particular, since they are so common, the `Ty` and `TyCtxt` types
163+
are imported directly. Other types are often referenced with an
164+
explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
165+
choose to import a larger or smaller set of names explicitly.

Diff for: src/librustc/ty/context.rs

+4-3
Original file line numberDiff line numberDiff line change
@@ -793,9 +793,10 @@ impl<'tcx> CommonTypes<'tcx> {
793793
}
794794
}
795795

796-
/// The data structure to keep track of all the information that typechecker
797-
/// generates so that so that it can be reused and doesn't have to be redone
798-
/// later on.
796+
/// The central data structure of the compiler. It stores references
797+
/// to the various **arenas** and also houses the results of the
798+
/// various **compiler queries** that have been performed. See [the
799+
/// README](README.md) for more deatils.
799800
#[derive(Copy, Clone)]
800801
pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
801802
gcx: &'a GlobalCtxt<'gcx>,

0 commit comments

Comments
 (0)