-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cldr data processing script #20
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I added a `CLDR_VERSION` variable to the __init__.py for the linearmoney package. In order to have the CLDR_VERSION update automatically with the version of the cldr data that is parsed into the library, I added a few lines to the process_cldr_data.py script to open the cldr-core package of the cldr-json data and read the `version` string. I then write the version string to a new package resource file `cldr_version.json`, which is loaded at import-time via the `resources` module in order to ensure the CLDR_VERSION gives the correct version information for the cldr data the library is currently using, even if the user manually runs the processing script with a different version of the cldr-json data than what linearmoney is distributed with and rebuilds the package. This should allow proper automated reporting of the correct cldr data version without coupling the cldr version to the library source code. This is not the most ideal solution as it requires a file read at import-time, but it only happens once, and it is a very simple and flexible implementation. A more robust solution that doesn't require a package resource file read is likely possible, but it doesn't provide much practical benefit over the simple package resource solution, and I don't have the time to put into it right now, so I am leaving this as is for now. This is something that can be improved in the future though.
I added a `tests/cldr.py` test script for regression testing the cldr data processing script when making changes. The test script simply checks the produced JSON files against files produced by a known correct version of the script. Basically, it's just a snapshot test. This means that the reference JSON files in the tests directory need to be updated when the version of the CLDR data is updated, but I have the `cldr_version.json` file as part of the test data, so if the version is accidentally updated, then the test suite will fail on the check of that file, and we will know that we need to either roll back changes or verify that the new version works as expected and update the reference JSON.
When I wrapped the loading of locale data from the CLDR json files into memory in a function, I forgot to call the function afterwards, so the cldr data processing script was not creating any locale data. This is fixed. I also added a new `cldr` environment to Hatch for building and testing the cldr data. The tests are working correctly. That's actually how I noticed that the script was broken and where to look for the problem.
I added a section to the CONTRIBUTING.md file explaining the process for updating the CLDR data used by the library. I also changed the structure of the cldr test data to keep the test script and all data in a dedicated `tests/cldr` directory instead of having everything sitting in the top-level `tests` directory.
fix #11 I finished the primary refactoring the `process_cldr_data.py` script. It's still not great, but it's broken up into more clearly delineated function now, so it should be much easier to make changes to the way we parse the data or to change the output format. Through this refactoring process, I also found a couple bugs in the script that were causing the incorrect data to be parsed, so this commit also includes updated currencies and locales data. Specifically, several currencies were being parsed as having `cash_places = 2` when they should have been `cash_places = 0`. This was caused by using `j` instead of `i` in the loop in the old `_build_fractions_data` function. This caused the branch that read the data for `cash_places` to get skipped for all currencies, so any that had specific `cash_places` would instead get the value of the `DEFAULT` currency. The other problem was that the comparisons with `" "` weren't working for locales that use some other unicode whitespace character for their spaces in the formatting patterns. This caused several locales to have incorrect currency symbol and positive/negative sign placement. I added a helper function for normalizing whitespace with the `unicodedata` module, and this is now fixed. The new data should have the correct sign and symbol placement and spacing.
The updates to the locales data affected the fr_FR locale, which was being parsed incorrectly before and not including the space between the numeric portion and the currency symbol. The changes to the cldr data processing script fix this error in the parsed locales data, but I forgot to update the test case that checkes for non-ascii characters in the `scalars.l10n` function. This was not caught because the venv that hatch created did not update with the new data files when running the test suite locally, so it was not caught until CI ran on the PR for this. I manually recreated the venv locally to confirm the changes fix the failing test case and reflect the correct localization formatting.
GrammAcc
added a commit
that referenced
this pull request
Apr 30, 2024
Bumped version to 0.1.3 to create a release since the cldr data included in v0.1.2 on pypi is incorrect. Version 0.1.3 will include the corrected cldr data that was added with the cldr data processing refactor. See #20
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes the refactorings in the
process_cldr_data.py
script, which was unmaintainable before.I broke the script up into named functions that should make it easier to make changes to the way the data is parsed or the format of the output without breaking anything going forward.
The actual parsing of the data has been made somewhat more intelligible, but there's still a lot of if chains in this script, so it's still not easy to understand everything that's going on. There's still room for improvement here.
I also added parsing of the version string of the CLDR data itself to the script, so we now have an automated way to keep track of the specific version of CLDR data that is shipped with the library. This is accessible to the user through the
resources
module, which could help with bug reports. Mainly though, this just makes it a lot easier to keep track of what CLDR data is included with specific versions of the linearmoney library. I've documented that a change in CLDR data version constitutes a minor version change in linearmoney since this doesn't break any API compatibility, but it can change the output presented to the user, which could have unexpected consequences.This PR also adds a
cldr
hatch environment with commandscldr:build
andcldr:test
, which parse and test the data respectively. Thecldr:test
command simply does a snapshot test comparing the current resource files in the source directory to a known good version of the resources stored in the tests/cldr directory. This ensures that running the data processing script orcldr:build
does not break anything. It also checks the CLDR version file. This is helpful because if a contributor has cloned the cldr-json repo and have a newer version than what is shipped with linearmoney, then running the processing script will likely change the output of functions likelinearmoney.scalar.l10n
even if the data is correct. We want to avoid making changes to the cldr data version like this unless the decision was explicit.I also added some instructions related to updating the CLDR data to the CONTRIBUTING.md document.