Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize all identifiers to NFC #2489

Merged
merged 1 commit into from
Aug 9, 2023
Merged

Conversation

tamaroning
Copy link
Contributor

@tamaroning tamaroning commented Jul 30, 2023

Addresses #2287
depends on #2467

Normalize all identifiers (tokens) to their NFC form.
Normalization must be done before any macro expansion.

See https://doc.rust-lang.org/reference/identifiers.html#normalization for details

Changelog

gccrs: Normalize all identifier tokens
gcc/rust/ChangeLog:

	* lex/rust-lex.cc (assert_source_content): Fix namespace specifier
	(test_buffer_input_source): Likewise.
	(test_file_input_source): Likewise.
	* lex/rust-lex.h: Move InputSource ...
	* lex/rust-input-source.h: ... to here. (New file)
	* lex/rust-token.cc (nfc_normalize_token_string): New function
	* lex/rust-token.h (nfc_normalize_token_string): New function
	* rust-lang.cc (run_rust_tests): Modify order of selftests.
	* rust-session-manager.cc (validate_crate_name): Modify interface of Utf8String.
	* util/rust-unicode.cc (lookup_cc): Modify codepoint_t typedef.
	(lookup_recomp): Likewise.
	(recursive_decomp_cano): Likewise.
	(decomp_cano): Likewise.
	(sort_cano): Likewise.
	(compose_hangul): Likewise.
	(assert_normalize): Likewise.
	(Utf8String::nfc_normalize): New function.
	* util/rust-unicode.h: Modify interface of Utf8String.

gcc/testsuite/ChangeLog:

	* rust/compile/unicode_norm1.rs: New test.

@tamaroning tamaroning changed the title Ucnorm parser Normalize all identifiers to NFC Jul 30, 2023
@tamaroning tamaroning force-pushed the ucnorm-parser branch 3 times, most recently from a6273d2 to 6c8bf02 Compare August 6, 2023 09:00
@tamaroning tamaroning marked this pull request as ready for review August 6, 2023 09:01
Comment on lines -207 to -208
// Input source wrapper thing.
class InputSource
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has been moved to new file rust-input-source.h to avoid recursive #include

Comment on lines +34 to +35
static tl::optional<Utf8String>
make_utf8_string (const std::string &maybe_utf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a factory function for Utf8String.
It returns an optional type value. Returns non-null value if a give std::string is properly encoded as UTF-8.

Comment on lines +1 to +6
fn main() {
// U+304C
let が = ();
// U+304B + U+3099
let _ = が;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code compiles despite that these two identifiers has different byte string.
It means identifier normalization seems to work.


return buf;
};

// Returns UTF codepoints when string is valid as UTF-8, returns nullopt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this comment be updated ? Looks like the function signature has changed and does not reflect it's description.

gcc/rust/util/rust-unicode.h Outdated Show resolved Hide resolved
@@ -309,9 +318,10 @@ is_numeric (uint32_t codepoint)
namespace selftest {

void
assert_normalize (std::vector<uint32_t> origin, std::vector<uint32_t> expected)
assert_normalize (std::vector<Rust::Codepoint> origin,
std::vector<Rust::Codepoint> expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected could be const. Also, we should probably take some references here.

gcc/rust/util/rust-unicode.cc Outdated Show resolved Hide resolved
gcc/rust/ChangeLog:

	* lex/rust-lex.cc (assert_source_content): Fix namespace specifier
	(test_buffer_input_source): Likewise.
	(test_file_input_source): Likewise.
	* lex/rust-lex.h: Move InputSource ...
	* lex/rust-input-source.h: ... to here. (New file)
	* lex/rust-token.cc (nfc_normalize_token_string): New function
	* lex/rust-token.h (nfc_normalize_token_string): New function
	* rust-lang.cc (run_rust_tests): Modify order of selftests.
	* rust-session-manager.cc (validate_crate_name): Modify interface of Utf8String.
	* util/rust-unicode.cc (lookup_cc): Modify codepoint_t typedef.
	(lookup_recomp): Likewise.
	(recursive_decomp_cano): Likewise.
	(decomp_cano): Likewise.
	(sort_cano): Likewise.
	(compose_hangul): Likewise.
	(assert_normalize): Likewise.
	(Utf8String::nfc_normalize): New function.
	* util/rust-unicode.h: Modify interface of Utf8String.

gcc/testsuite/ChangeLog:

	* rust/compile/unicode_norm1.rs: New test.

Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>
@tamaroning
Copy link
Contributor Author

@P-E-P Thank you for your review. Fixed all.

@P-E-P P-E-P added this pull request to the merge queue Aug 9, 2023
Merged via the queue into Rust-GCC:master with commit a4b7e73 Aug 9, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants