Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools: add a script to download Unicode data files, and rename grapheme_break_property_data_gen.py to unicode_properties_data_gen.py #4435

Merged
merged 5 commits into from
Mar 8, 2024

Conversation

cpplearner
Copy link
Contributor

These are attempts to make it easier to regenerate source files from the Unicode data.

SG16 recommends specifying Unicode 15.1.0 as the minimum Unicode version for C++23 (as a DR) and C++26. See cplusplus/papers#1736 and https://github.com/sg16-unicode/sg16-meetings.

Currently, __msvc_format_ucd_tables.hpp is based on Unicode 15.0.0. In order to support 15.1.0, this file will need to be regenerated.

(Other changes are also needed. For example, _Grapheme_break_property_iterator might need to account for the changes in UAX 29. I haven't taken a close look, though.)

@cpplearner cpplearner requested a review from a team as a code owner March 3, 2024 05:14
@StephanTLavavej StephanTLavavej added the enhancement Something can be improved label Mar 3, 2024
@StephanTLavavej StephanTLavavej self-assigned this Mar 3, 2024
@StephanTLavavej StephanTLavavej removed their assignment Mar 5, 2024
@StephanTLavavej StephanTLavavej self-assigned this Mar 6, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 1fa7781 into microsoft:main Mar 8, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for simplifying this process - I've wanted a download script for a while! 🪄 😻 🎉

@cpplearner cpplearner deleted the download-unicode-data branch March 8, 2024 03:52
Comment on lines +7 to +14
Unicode_data_files = {
"DerivedCoreProperties.txt": "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt",
"DerivedGeneralCategory.txt": "https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt",
"EastAsianWidth.txt": "https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt",
"GraphemeBreakProperty.txt": "https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt",
"GraphemeBreakTest.txt": "https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt",
"emoji-data.txt": "https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it desirable that these pull data from the latest available UCD files? That complicates rerunning the scripts with the intention to match what might have been downloaded previously. If it is intentional, perhaps rename the script to download_latest_unicode_data_files.py. Otherwise, the files specific to Unicode 15.1.0 (for example) are available at https://unicode.org/Public/15.1.0/ucd. Another alternative would be to allow a base Unicode version URL to be supplied on the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved format C++20/23 format
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants