diff --git a/src/input-format.md b/src/input-format.md index 8d921bf8c..79121fe21 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -1,26 +1,41 @@ # Input format +r[input] + +r[input.intro] This chapter describes how a source file is interpreted as a sequence of tokens. See [Crates and source files] for a description of how programs are organised into files. ## Source encoding +r[input.encoding] + +r[input.encoding.utf8] Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. + +r[input.encoding.invalid] It is an error if the file is not valid UTF-8. ## Byte order mark removal +r[input.byte-order-mark] + If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed. ## CRLF normalization +r[input.crlf] + Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]). ## Shebang removal +r[input.shebang] + +r[input.shebang.intro] If the remaining sequence begins with the characters `#!`, the characters up to and including the first `U+000A` (LF) are removed from the sequence. For example, the first line of the following file would be ignored: @@ -34,6 +49,7 @@ fn main() { } ``` +r[input.shebang.inner-attribute] As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed. This prevents an [inner attribute] at the start of a source file being removed. @@ -41,8 +57,9 @@ This prevents an [inner attribute] at the start of a source file being removed. ## Tokenization -The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter. +r[input.tokenization] +The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter. [`include!`]: ../std/macro.include.md [`include_bytes!`]: ../std/macro.include_bytes.md