Refactor cldr data processing script #20

GrammAcc · 2024-04-30T00:56:26Z

This PR includes the refactorings in the process_cldr_data.py script, which was unmaintainable before.

I broke the script up into named functions that should make it easier to make changes to the way the data is parsed or the format of the output without breaking anything going forward.

The actual parsing of the data has been made somewhat more intelligible, but there's still a lot of if chains in this script, so it's still not easy to understand everything that's going on. There's still room for improvement here.

I also added parsing of the version string of the CLDR data itself to the script, so we now have an automated way to keep track of the specific version of CLDR data that is shipped with the library. This is accessible to the user through the resources module, which could help with bug reports. Mainly though, this just makes it a lot easier to keep track of what CLDR data is included with specific versions of the linearmoney library. I've documented that a change in CLDR data version constitutes a minor version change in linearmoney since this doesn't break any API compatibility, but it can change the output presented to the user, which could have unexpected consequences.

This PR also adds a cldr hatch environment with commands cldr:build and cldr:test, which parse and test the data respectively. The cldr:test command simply does a snapshot test comparing the current resource files in the source directory to a known good version of the resources stored in the tests/cldr directory. This ensures that running the data processing script or cldr:build does not break anything. It also checks the CLDR version file. This is helpful because if a contributor has cloned the cldr-json repo and have a newer version than what is shipped with linearmoney, then running the processing script will likely change the output of functions like linearmoney.scalar.l10n even if the data is correct. We want to avoid making changes to the cldr data version like this unless the decision was explicit.

I also added some instructions related to updating the CLDR data to the CONTRIBUTING.md document.

I added a `CLDR_VERSION` variable to the __init__.py for the linearmoney package. In order to have the CLDR_VERSION update automatically with the version of the cldr data that is parsed into the library, I added a few lines to the process_cldr_data.py script to open the cldr-core package of the cldr-json data and read the `version` string. I then write the version string to a new package resource file `cldr_version.json`, which is loaded at import-time via the `resources` module in order to ensure the CLDR_VERSION gives the correct version information for the cldr data the library is currently using, even if the user manually runs the processing script with a different version of the cldr-json data than what linearmoney is distributed with and rebuilds the package. This should allow proper automated reporting of the correct cldr data version without coupling the cldr version to the library source code. This is not the most ideal solution as it requires a file read at import-time, but it only happens once, and it is a very simple and flexible implementation. A more robust solution that doesn't require a package resource file read is likely possible, but it doesn't provide much practical benefit over the simple package resource solution, and I don't have the time to put into it right now, so I am leaving this as is for now. This is something that can be improved in the future though.

I added a `tests/cldr.py` test script for regression testing the cldr data processing script when making changes. The test script simply checks the produced JSON files against files produced by a known correct version of the script. Basically, it's just a snapshot test. This means that the reference JSON files in the tests directory need to be updated when the version of the CLDR data is updated, but I have the `cldr_version.json` file as part of the test data, so if the version is accidentally updated, then the test suite will fail on the check of that file, and we will know that we need to either roll back changes or verify that the new version works as expected and update the reference JSON.

When I wrapped the loading of locale data from the CLDR json files into memory in a function, I forgot to call the function afterwards, so the cldr data processing script was not creating any locale data. This is fixed. I also added a new `cldr` environment to Hatch for building and testing the cldr data. The tests are working correctly. That's actually how I noticed that the script was broken and where to look for the problem.

I added a section to the CONTRIBUTING.md file explaining the process for updating the CLDR data used by the library. I also changed the structure of the cldr test data to keep the test script and all data in a dedicated `tests/cldr` directory instead of having everything sitting in the top-level `tests` directory.

fix #11 I finished the primary refactoring the `process_cldr_data.py` script. It's still not great, but it's broken up into more clearly delineated function now, so it should be much easier to make changes to the way we parse the data or to change the output format. Through this refactoring process, I also found a couple bugs in the script that were causing the incorrect data to be parsed, so this commit also includes updated currencies and locales data. Specifically, several currencies were being parsed as having `cash_places = 2` when they should have been `cash_places = 0`. This was caused by using `j` instead of `i` in the loop in the old `_build_fractions_data` function. This caused the branch that read the data for `cash_places` to get skipped for all currencies, so any that had specific `cash_places` would instead get the value of the `DEFAULT` currency. The other problem was that the comparisons with `" "` weren't working for locales that use some other unicode whitespace character for their spaces in the formatting patterns. This caused several locales to have incorrect currency symbol and positive/negative sign placement. I added a helper function for normalizing whitespace with the `unicodedata` module, and this is now fixed. The new data should have the correct sign and symbol placement and spacing.

The updates to the locales data affected the fr_FR locale, which was being parsed incorrectly before and not including the space between the numeric portion and the currency symbol. The changes to the cldr data processing script fix this error in the parsed locales data, but I forgot to update the test case that checkes for non-ascii characters in the `scalars.l10n` function. This was not caught because the venv that hatch created did not update with the new data files when running the test suite locally, so it was not caught until CI ran on the PR for this. I manually recreated the venv locally to confirm the changes fix the failing test case and reflect the correct localization formatting.

Bumped version to 0.1.3 to create a release since the cldr data included in v0.1.2 on pypi is incorrect. Version 0.1.3 will include the corrected cldr data that was added with the cldr data processing refactor. See #20

GrammAcc added 6 commits April 25, 2024 09:12

GrammAcc linked an issue Apr 30, 2024 that may be closed by this pull request

[Tooling] Refactor CLDR Data Processing Script #11

Closed

GrammAcc merged commit 64094b8 into dev Apr 30, 2024
2 checks passed

GrammAcc deleted the refactor-cldr-data-processing-script branch April 30, 2024 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cldr data processing script #20

Refactor cldr data processing script #20

GrammAcc commented Apr 30, 2024

Refactor cldr data processing script #20

Refactor cldr data processing script #20

Conversation

GrammAcc commented Apr 30, 2024