Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make datagen faster by caching more things #3625

Merged
merged 3 commits into from
Jul 5, 2023

Conversation

sffc
Copy link
Member

@sffc sffc commented Jul 4, 2023

Fixes #3620

robertbastian
robertbastian previously approved these changes Jul 4, 2023
@sffc
Copy link
Member Author

sffc commented Jul 4, 2023

Good stuff: the full-datagen job finishes in 10m 0s, of which 7m 47s is building datagen in release mode. Spot-checking, the slowest keys seem to be timezones and segmenter again, as they had been before. Displaynames finishes in less than 2 seconds.

@robertbastian
Copy link
Member

I failed as a linter

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the displaynames locale set to Recommended again

@sffc
Copy link
Member Author

sffc commented Jul 4, 2023

From running icu4x-datagen in release mode with rayon disabled and comparing timestamps, the slowest keys are, in order of slowness (slowest on top):

Key Time (seconds)
segmenter/word@1 0.348
segmenter/dictionary/w_auto@1 0.343
segmenter/line@1 0.303
segmenter/dictionary/wl_ext@1 0.296
segmenter/sentence@1 0.255
datetime/japanext/datesymbols@1 0.247
datetime/japanese/datesymbols@1 0.212
list/unit@1 0.206
time_zone/specific_long@1 0.2
datetime/indian/datesymbols@1 0.198
segmenter/grapheme@1 0.182
datetime/gregory/datesymbols@1 0.176
props/bidiauxiliaryprops@1 0.172
decimal/symbols@1 0.172
datetime/coptic/datesymbols@1 0.163
datetime/buddhist/datesymbols@1 0.158
datetime/ethiopic/datesymbols@1 0.152
time_zone/generic_short@1 0.15
time_zone/generic_long@1 0.136
collator/data@1 0.122
datetime/japanese/datelengths@1 0.122
time_zone/exemplar_cities@1 0.122
props/exemplarchars/main@1 0.12
list/and@1 0.116
props/exemplarchars/index@1 0.109
list/or@1 0.108
props/exemplarchars/punctuation@1 0.107
props/exemplarchars/numbers@1 0.107
time_zone/specific_short@1 0.105
datetime/gregory/datelengths@1 0.102
datetime/timelengths@1 0.099
props/exemplarchars/auxiliary@1 0.099
datetime/timesymbols@1 0.098
datetime/buddhist/datelengths@1 0.088
datetime/japanext/datelengths@1 0.087
datetime/coptic/datelengths@1 0.084
datetime/ethiopic/datelengths@1 0.083
time_zone/formats@1 0.082
datetime/indian/datelengths@1 0.073

@robertbastian robertbastian merged commit 67ff684 into unicode-org:main Jul 5, 2023
22 checks passed
pub(crate) struct CldrCache(pub SerdeCache);
pub(crate) struct CldrCache {
serde_cache: SerdeCache,
is_full: RwLock<Option<bool>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rust version update followup: use OnceCell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve displaynames datagen perf and restore all data
3 participants