Implement RFC 3348, `c"foo"` literals #108801

fee1-dead · 2023-03-06T07:12:27Z

RFC: rust-lang/rfcs#3348
Tracking issue: #105723

rustbot · 2023-03-06T07:12:33Z

r? @wesleywiser

(rustbot has picked a reviewer for you, use r? to override)

compiler/rustc_parse/src/lexer/mod.rs

compiler/rustc_parse/locales/en-US.ftl

compiler/rustc_ast/src/ast.rs

compiler/rustc_lexer/src/lib.rs

compiler/rustc_ast_passes/src/feature_gate.rs

compiler/rustc_lexer/src/unescape.rs

petrochenkov · 2023-03-06T14:27:27Z

compiler/rustc_lexer/src/unescape.rs

+    Char(char),
+}
+
+pub fn unescape_c_string<F>(src: &str, mode: Mode, callback: &mut F)


If c"..." requires different unescaping from some other existing strings, then something is going wrong, in general.

Perhaps implementation for c"..." and the stuff from rust-lang/rfcs#3349 should be decoupled.

It has to be different because returning a char doesn't cover all cases for C string literals. If the RFC that you mentioned is accepted, then byte string literals can't have units represented as characters too. We need to differentiate unicode characters that should be encoded using utf8. c"À" is C3 80 while codepoint is 0xC0, and c"\xC0" would encode to [0xC0] directly. Before this PR, byte strings pass these byte values as chars which are then converted into u8s, while C strings need to pass chars that need to be encoded as UTF-8 as chars and bytes that need to be appended as u8s.

Sorry, I don't understand what you are saying.
Both byte and C strings support non-UTF8 so (Rust) chars are out of the question.
I'm concerned about the difference between byte strings and C strings, both produce arbitrary non-UTF [u8] and any differences between them should eventually be eliminated (that's the point of rust-lang/rfcs#3349 from what I remember).

what you are saying is true. but currently, both byte strings and normal strings emit chars in their implementation. Byte strings just use the codepoints to represent the byte values, but that would need to be changed to an enum (just like how this PR changes it for c literals) if we were to implement that rfc.

My understanding is that most of this complication comes from the fact that the C-str RFC explicit states that it supports both \u and \x escapes in c"" literals. Is that correct?

@compiler-errors Not necessarily about the \u escape, but more about the \x escape which has a different meaning in byte strings and characters. nnethercote's comment at the RFC mentioned above suggested that a table should make this clearer:

Example # sets* Characters Escapes

Character 'H' 0 All Unicode Quote & ASCII & Unicode

String "hello" 0 All Unicode Quote & ASCII & Unicode

Raw string r#"hello"# <256 All Unicode N/A

Byte b'H' 0 All ASCII Quote & Byte

Byte string b"hello" 0 All ASCII Quote & Byte

Raw byte string br#"hello"# <256 All ASCII N/A

C string c"hello" 0 All unicode Quote & Byte & Unicode

Note that since normal strings accept unicode, we can emit chars that correspond to the actual characters. But for byte strings this is different. Byte strings allow bytes that are not encoded as UTF-8. (e.g. \xFF allowed in byte strings but not in normal strings) How do we unescape them currently? We currently emit the codepoint (e.g. \xFF -> ÿ U+00FF) for byte strings and then interpret the values later on.

That means that ÿ character emitted by a normal string means "ÿ", with codepoint U+00FF, encoded in UTF-8 as 0xC3 0xBF. But this emitted for a byte string would mean the byte 0xFF only. C strings are explicitly allowed to have both, therefore it is necessary to use an enum to convey either the character encoded as UTF-8 or the byte value.

@fee1-dead In the entry in the "Characters" column, in the "C string" row, do you really mean "all bytes exept NUL"?. IIRC Rust files are required to be valid UTF-8, and RFC 3348 has changed nothing about that. At least I found nothing in the RFC's text indicating that. The goal was more about the escapes column: the encoded result can be a non-valid unicode string, but the literal itself still has to be valid UTF-8. Otherwise this would mean that programs processing rust source code cannot assume UTF-8 validity of the source code any more. In other words, any program that uses Rust's String type to represent a slice of Rust code (including Rust's proc macro infrastructure!) would fail for specific snippets containing c strings that have invalid UTF-8.

I think that entry should rather read "All Unicode" or "All UTF-8".

@est31: corrected, thanks.

rustbot · 2023-03-10T15:31:46Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

Some changes occurred in src/tools/clippy

cc @rust-lang/clippy

bors · 2023-03-11T08:19:13Z

☔ The latest upstream changes (presumably #108998) made this pull request unmergeable. Please resolve the merge conflicts.

fee1-dead · 2023-04-07T13:43:23Z

r? compiler

…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723

Rollup of 6 pull requests Successful merges: - rust-lang#103056 (Fix `checked_{add,sub}_duration` incorrectly returning `None` when `other` has more than `i64::MAX` seconds) - rust-lang#108801 (Implement RFC 3348, `c"foo"` literals) - rust-lang#110773 (Reduce MIR dump file count for MIR-opt tests) - rust-lang#110876 (Added default target cpu to `--print target-cpus` output and updated docs) - rust-lang#111068 (Improve check-cfg implementation) - rust-lang#111238 (btree_map: `Cursor{,Mut}::peek_prev` must agree) Failed merges: - rust-lang#110694 (Implement builtin # syntax and use it for offset_of!(...)) r? `@ghost` `@rustbot` modify labels: rollup

klensy · 2023-05-16T15:42:17Z

Looks like rustfmt don't know about that new literals, sadly.

…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723

kanashimia · 2023-05-29T23:14:08Z

@fee1-dead this should be feature gated under c_str_literal and not c_str_literals , as mentioned in the RFC and tracking issue, right?

Extreme confusion:

error[E0635]: unknown feature `c_str_literal`

fee1-dead · 2023-05-30T13:25:16Z

feature(c_str_literals) made more sense to me, but I don't really mind c_str_literal either. If anyone has preference for one over the other, feel free to open a PR to either the rfcs repo or to rust-lang/rust. I've updated the tracking issue description in the mean time.

use c literals in compiler and library Use c literals rust-lang#108801 in compiler and library currently blocked on: * <strike>rustfmt: don't know how to format c literals</strike> nope, nightly one works. * <strike>bootstrap</strike> r? `@ghost` `@rustbot` blocked

…=compiler-errors Revert the lexing of `c"…"` string literals Fixes \[after beta-backport\] rust-lang#113235. Further progress is tracked in rust-lang#113333. This PR *manually* reverts parts of rust-lang#108801 (since a git-revert would've been too coarse-grained & messy) and git-reverts rust-lang#111647. CC `@fee1-dead` (rust-lang#108801) `@klensy` (rust-lang#111647) r? `@compiler-errors` `@rustbot` label F-c_str_literals beta-nominated

…ilstrieb Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.

Stabilize C string literals RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html Tracking issue: rust-lang/rust#105723 Documentation PR (reference manual): rust-lang/reference#1423 # Stabilization report Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later. ```rust const HELLO: &core::ffi::CStr = c"Hello, world!"; ``` C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`. ## Implementation Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021. The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021. ## Resolutions to open questions from the RFC * Adding C character literals (`c'.'`) of type `c_char` is not part of this feature. * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future. * C string literals should not be blocked on making `&CStr` a thin pointer. * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`. * The unstable `concat_bytes!` macro should not accept `c"..."` literals. * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous. * Adding a type to represent C strings containing valid UTF-8 is not part of this feature. * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.

rustbot assigned wesleywiser Mar 6, 2023

fee1-dead force-pushed the c-str branch from ae3bfc8 to 9f6ce34 Compare March 6, 2023 07:13

petrochenkov self-assigned this Mar 6, 2023

fee1-dead force-pushed the c-str branch from 793c5d9 to c40bc6d Compare March 6, 2023 07:58

fbstj reviewed Mar 6, 2023

View reviewed changes

compiler/rustc_parse/src/lexer/mod.rs Outdated Show resolved Hide resolved

fee1-dead force-pushed the c-str branch from c40bc6d to 973dd61 Compare March 6, 2023 10:35

fee1-dead commented Mar 6, 2023

View reviewed changes

compiler/rustc_parse/locales/en-US.ftl Outdated Show resolved Hide resolved

fmease reviewed Mar 6, 2023

View reviewed changes

compiler/rustc_ast/src/ast.rs Outdated Show resolved Hide resolved

compiler/rustc_lexer/src/lib.rs Outdated Show resolved Hide resolved

petrochenkov reviewed Mar 6, 2023

View reviewed changes

petrochenkov removed their assignment Mar 6, 2023

m-ou-se mentioned this pull request Mar 6, 2023

Tracking Issue for c"…" string literals #105723

Closed

12 tasks

This comment has been minimized.

Sign in to view

fee1-dead added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 8, 2023

fee1-dead force-pushed the c-str branch from d5d877e to aff63f2 Compare March 10, 2023 14:30

fee1-dead marked this pull request as ready for review March 10, 2023 15:31

fee1-dead added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 10, 2023

fee1-dead force-pushed the c-str branch from aff63f2 to a9d67e8 Compare March 12, 2023 08:54

fee1-dead changed the title ~~[WIP] Implement RFC 3348, c"foo" literals~~ Implement RFC 3348, c"foo" literals Mar 12, 2023

rustbot assigned jackh726 and unassigned wesleywiser Apr 7, 2023

Manishearth mentioned this pull request May 4, 2023

Rollup of 7 pull requests #111204

Closed

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request May 4, 2023

Rollup merge of rust-lang#108801 - fee1-dead-contrib:c-str, r=compile…

7236d10

…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723

matthiaskrgr mentioned this pull request May 4, 2023

Rollup of 8 pull requests #111219

Closed

Dylan-DPC mentioned this pull request May 5, 2023

Rollup of 6 pull requests #111248

Merged

bors merged commit 4891f02 into rust-lang:master May 5, 2023

rustbot added this to the 1.71.0 milestone May 5, 2023

klensy mentioned this pull request May 16, 2023

use c literals in compiler and library #111647

Merged

flip1995 pushed a commit to flip1995/rust that referenced this pull request May 20, 2023

Rollup merge of rust-lang#108801 - fee1-dead-contrib:c-str, r=compile…

a48c735

…r-errors Implement RFC 3348, `c"foo"` literals RFC: rust-lang/rfcs#3348 Tracking issue: rust-lang#105723

fee1-dead deleted the c-str branch May 30, 2023 13:24

celinval mentioned this pull request Jun 21, 2023

Upgrade rust toolchain to nightly-2023-06-20 model-checking/kani#2551

Merged

4 tasks

This was referenced Jul 2, 2023

regression: c"..." are experimental #113235

Closed

Reimplement the lexing of c"…" string literals with backward compatibility in mind #113333

Closed

Revert the lexing of c"…" string literals #113334

Merged

jmillikin mentioned this pull request Nov 1, 2023

Stabilize C string literals #117472

Merged

	Example	# sets*	Characters	Escapes
Character	'H'	0	All Unicode	Quote & ASCII & Unicode
String	"hello"	0	All Unicode	Quote & ASCII & Unicode
Raw string	r#"hello"#	<256	All Unicode	N/A
Byte	b'H'	0	All ASCII	Quote & Byte
Byte string	b"hello"	0	All ASCII	Quote & Byte
Raw byte string	br#"hello"#	<256	All ASCII	N/A
C string	c"hello"	0	All unicode	Quote & Byte & Unicode

Implement RFC 3348, c"foo" literals #108801

Implement RFC 3348, c"foo" literals #108801

Uh oh!

Conversation

fee1-dead commented Mar 6, 2023

Uh oh!

rustbot commented Mar 6, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrochenkov Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

fee1-dead Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

petrochenkov Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

fee1-dead Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compiler-errors Apr 19, 2023

Choose a reason for hiding this comment

Uh oh!

fee1-dead Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

est31 Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fee1-dead Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

rustbot commented Mar 10, 2023

Uh oh!

bors commented Mar 11, 2023

Uh oh!

fee1-dead commented Apr 7, 2023

Uh oh!

klensy commented May 16, 2023

Uh oh!

kanashimia commented May 29, 2023

Uh oh!

fee1-dead commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Implement RFC 3348, `c"foo"` literals #108801

Implement RFC 3348, `c"foo"` literals #108801

fee1-dead Mar 7, 2023 •

edited

Loading

fee1-dead Apr 19, 2023 •

edited

Loading

est31 Apr 27, 2023 •

edited

Loading

fee1-dead commented May 30, 2023 •

edited

Loading