-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tools: add a script to download Unicode data files, and rename grapheme_break_property_data_gen.py
to unicode_properties_data_gen.py
#4435
Conversation
4a5b648
to
733d66c
Compare
733d66c
to
d11ad04
Compare
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks for simplifying this process - I've wanted a download script for a while! 🪄 😻 🎉 |
Unicode_data_files = { | ||
"DerivedCoreProperties.txt": "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt", | ||
"DerivedGeneralCategory.txt": "https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt", | ||
"EastAsianWidth.txt": "https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt", | ||
"GraphemeBreakProperty.txt": "https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt", | ||
"GraphemeBreakTest.txt": "https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt", | ||
"emoji-data.txt": "https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it desirable that these pull data from the latest available UCD files? That complicates rerunning the scripts with the intention to match what might have been downloaded previously. If it is intentional, perhaps rename the script to download_latest_unicode_data_files.py
. Otherwise, the files specific to Unicode 15.1.0 (for example) are available at https://unicode.org/Public/15.1.0/ucd. Another alternative would be to allow a base Unicode version URL to be supplied on the command line.
These are attempts to make it easier to regenerate source files from the Unicode data.
SG16 recommends specifying Unicode 15.1.0 as the minimum Unicode version for C++23 (as a DR) and C++26. See cplusplus/papers#1736 and https://github.com/sg16-unicode/sg16-meetings.
Currently,
__msvc_format_ucd_tables.hpp
is based on Unicode 15.0.0. In order to support 15.1.0, this file will need to be regenerated.(Other changes are also needed. For example,
_Grapheme_break_property_iterator
might need to account for the changes in UAX 29. I haven't taken a close look, though.)