Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode check for crate_name attribute #2463

Merged
merged 1 commit into from
Jul 30, 2023

Conversation

tamaroning
Copy link
Contributor

@tamaroning tamaroning commented Jul 21, 2023

depends on PR #2425
Addresses #2287

Modify the function validate_crate_name to check if the crate name contains only Unicode alphabetic and numeric.
See https://doc.rust-lang.org/reference/crates-and-source-files.html#the-crate_name-attribute

gcc/rust/ChangeLog:

	* lex/rust-codepoint.h: Add comment
	* lex/rust-lex.h: New method to get decoded characters
	* rust-session-manager.cc (validate_crate_name): Modify unicode check
	(rust_crate_name_validation_test): Add testcases
	* util/rust-unicode.h (RUST_UNICODE_H): New class Utf8String.
	(class Utf8String): New class.

gcc/testsuite/ChangeLog:

	* rust/compile/bad-crate-name.rs: Moved to...
	* rust/compile/bad-crate-name1.rs: ...here.
	* rust/compile/bad-crate-name2.rs: New test.

Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>

@tamaroning tamaroning changed the title [wip} [wip] unicode check for crate_name attribute Jul 21, 2023
@tamaroning tamaroning changed the title [wip] unicode check for crate_name attribute Unicode check for crate_name attribute Jul 21, 2023
Comment on lines +26 to 27
// FIXME: move this to rust-unicode.h?
struct Codepoint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be moved to rust-unicode.h.
At least it should not be put in under gcc/rust/lex/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems good to me

Comment on lines -1233 to -1277
// FIXME: The next test does not pass as of current implementation
// ASSERT_TRUE (Rust::CompileOptions::validate_crate_name ("惊吓"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this original comment is probably wrong.
The kanji 惊(U+60CA) are not categorized as Alphabetic.

In https://www.unicode.org/Public/14.0.0/ucd/DerivedCoreProperties.txt

...
30FF          ; Alphabetic # Lo       KATAKANA DIGRAPH KOTO
3105..312F    ; Alphabetic # Lo  [43] BOPOMOFO LETTER B..BOPOMOFO LETTER NN
3131..318E    ; Alphabetic # Lo  [94] HANGUL LETTER KIYEOK..HANGUL LETTER ARAEAE
31A0..31BF    ; Alphabetic # Lo  [32] BOPOMOFO LETTER BU..BOPOMOFO LETTER AH
31F0..31FF    ; Alphabetic # Lo  [16] KATAKANA LETTER SMALL KU..KATAKANA LETTER SMALL RO
3400..4DBF    ; Alphabetic # Lo [6592] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DBF
4E00..A014    ; Alphabetic # Lo [21013] CJK UNIFIED IDEOGRAPH-4E00..YI SYLLABLE E
A015          ; Alphabetic # Lm       YI SYLLABLE WU
A016..A48C    ; Alphabetic # Lo [1143] YI SYLLABLE BIT..YI SYLLABLE YYR
A4D0..A4F7    ; Alphabetic # Lo  [40] LISU LETTER BA..LISU LETTER OE
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking haha. did you try it with rustc directly?

Copy link
Contributor Author

@tamaroning tamaroning Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my god. i just checked and rustc compiles it (@_@)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked 4E00..A014 containing 0x60CA haha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha no worries it's all good :D

@tamaroning tamaroning marked this pull request as ready for review July 21, 2023 04:33
@tamaroning tamaroning force-pushed the crate-name branch 3 times, most recently from 7d8619a to 3631430 Compare July 21, 2023 05:20
@tamaroning
Copy link
Contributor Author

The new introduced class Utf8String can be used to fix #2411

@CohenArthur CohenArthur requested review from CohenArthur and philberty and removed request for CohenArthur July 27, 2023 09:18
Copy link
Member

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! Good work!

Comment on lines +26 to 27
// FIXME: move this to rust-unicode.h?
struct Codepoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems good to me

error = Error (UNDEF_LOCATION, "crate name is not a valid UTF-8 string");
return false;
}
if (uchars.value ().empty ())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be clearer to add a temporary variable after the first if (!uchars.has_value())? You keep calling uchars.value() which maybe hurts readability just a little bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. fixed

@@ -113,24 +114,32 @@ infer_crate_name (const std::string &filename)
static bool
validate_crate_name (const std::string &crate_name, Error &error)
{
if (crate_name.empty ())
Utf8String utf8_name = {crate_name};
auto uchars = utf8_name.get_chars ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would rename this variable to uchars_opt

{
error = Error (UNDEF_LOCATION, "crate name is not a valid UTF-8 string");
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then add this

Suggested change
}
}
auto uchars = uchars_opt.value();

Comment on lines -1233 to -1277
// FIXME: The next test does not pass as of current implementation
// ASSERT_TRUE (Rust::CompileOptions::validate_crate_name ("惊吓"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking haha. did you try it with rustc directly?

static bool
validate_crate_name (const std::string &crate_name, Error &error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function would probably benefit from the new tl::expected type we have, but that's a problem for another PR!


namespace Rust {

class Utf8String
{
private:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say a nit here to move the private fields to the bottom of the class declaration but then i realised we do it this style in AST and HIR classes.

What is the GCC style? I think having them at the top like this is probably the best now when i think about it.

@dkm @CohenArthur @tschwinge any opinions on the style of putting private fields at the top of the class or bottom?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having private fields at the bottom of my classes personally. I never gave it much attention haha

Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

gcc/rust/ChangeLog:

	* lex/rust-codepoint.h: Add comment
	* lex/rust-lex.h: New method to get decoded characters
	* rust-session-manager.cc (validate_crate_name): Modify unicode check
	(rust_crate_name_validation_test): Add testcases
	* util/rust-unicode.h (RUST_UNICODE_H): New class Utf8String.
	(class Utf8String): New class.
	* util/rust-unicode.cc (binary_search_sorted_array): Add comment.
	(recursive_decomp_cano): Add comment.
	(recomp): Remove dead code.
	(dump_string): Removed.

gcc/testsuite/ChangeLog:

	* rust/compile/bad-crate-name.rs: Moved to...
	* rust/compile/bad-crate-name1.rs: ...here.
	* rust/compile/bad-crate-name2.rs: New test.

Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>
@tamaroning
Copy link
Contributor Author

Thank you for review.
Rebased and fixed parts @CohenArthur pointed out.

@tamaroning tamaroning mentioned this pull request Jul 30, 2023
Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@philberty philberty added this pull request to the merge queue Jul 30, 2023
Merged via the queue into Rust-GCC:master with commit a3fc40c Jul 30, 2023
9 checks passed
@tamaroning tamaroning deleted the crate-name branch July 31, 2023 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants