Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cldr data processing script #20

Merged
merged 6 commits into from
Apr 30, 2024

Conversation

GrammAcc
Copy link
Owner

This PR includes the refactorings in the process_cldr_data.py script, which was unmaintainable before.

I broke the script up into named functions that should make it easier to make changes to the way the data is parsed or the format of the output without breaking anything going forward.

The actual parsing of the data has been made somewhat more intelligible, but there's still a lot of if chains in this script, so it's still not easy to understand everything that's going on. There's still room for improvement here.

I also added parsing of the version string of the CLDR data itself to the script, so we now have an automated way to keep track of the specific version of CLDR data that is shipped with the library. This is accessible to the user through the resources module, which could help with bug reports. Mainly though, this just makes it a lot easier to keep track of what CLDR data is included with specific versions of the linearmoney library. I've documented that a change in CLDR data version constitutes a minor version change in linearmoney since this doesn't break any API compatibility, but it can change the output presented to the user, which could have unexpected consequences.

This PR also adds a cldr hatch environment with commands cldr:build and cldr:test, which parse and test the data respectively. The cldr:test command simply does a snapshot test comparing the current resource files in the source directory to a known good version of the resources stored in the tests/cldr directory. This ensures that running the data processing script or cldr:build does not break anything. It also checks the CLDR version file. This is helpful because if a contributor has cloned the cldr-json repo and have a newer version than what is shipped with linearmoney, then running the processing script will likely change the output of functions like linearmoney.scalar.l10n even if the data is correct. We want to avoid making changes to the cldr data version like this unless the decision was explicit.

I also added some instructions related to updating the CLDR data to the CONTRIBUTING.md document.

I added a `CLDR_VERSION` variable to the __init__.py for the linearmoney
package.

In order to have the CLDR_VERSION update automatically with the version
of the cldr data that is parsed into the library, I added a few lines
to the process_cldr_data.py script to open the
cldr-core package of the cldr-json data and read the `version` string.

I then write the version string to a new package resource file
`cldr_version.json`, which is loaded at import-time via the `resources`
module in order to ensure the CLDR_VERSION gives the correct version
information for the cldr data the library is currently using, even if
the user manually runs the processing script with a different version
of the cldr-json data than what linearmoney is distributed with and
rebuilds the package.

This should allow proper automated reporting of the correct cldr data
version without coupling the cldr version to the library source code.

This is not the most ideal solution as it requires a file read at
import-time, but it only happens once, and it is a very simple and
flexible implementation. A more robust solution that doesn't require
a package resource file read is likely possible, but it doesn't
provide much practical benefit over the simple package resource
solution, and I don't have the time to put into it right now, so I am
leaving this as is for now. This is something that can be improved in
the future though.
I added a `tests/cldr.py` test script for regression testing the cldr
data processing script when making changes.

The test script simply checks the produced JSON files against files
produced by a known correct version of the script. Basically, it's just
a snapshot test.

This means that the reference JSON files in the tests directory need to be
updated when the version of the CLDR data is updated, but I have the
`cldr_version.json` file as part of the test data, so if the version is
accidentally updated, then the test suite will fail on the check of that
file, and we will know that we need to either roll back changes or
verify that the new version works as expected and update the reference
JSON.
When I wrapped the loading of locale data from the CLDR json files into
memory in a function, I forgot to call the function afterwards, so the
cldr data processing script was not creating any locale data.

This is fixed.

I also added a new `cldr` environment to Hatch for building and testing
the cldr data. The tests are working correctly. That's actually how I
noticed that the script was broken and where to look for the problem.
I added a section to the CONTRIBUTING.md file explaining the process for
updating the CLDR data used by the library.

I also changed the structure of the cldr test data to keep the test
script and all data in a dedicated `tests/cldr` directory instead of
having everything sitting in the top-level `tests` directory.
fix #11

I finished the primary refactoring the `process_cldr_data.py` script.

It's still not great, but it's broken up into more clearly delineated
function now, so it should be much easier to make changes to the way we
parse the data or to change the output format.

Through this refactoring process, I also found a couple bugs in the
script that were causing the incorrect data to be parsed, so this commit
also includes updated currencies and locales data.

Specifically, several currencies were being parsed as having
`cash_places = 2` when they should have been `cash_places = 0`. This was
caused by using `j` instead of `i` in the loop in the old
`_build_fractions_data` function. This caused the branch that read the
data for `cash_places` to get skipped for all currencies, so any that
had specific `cash_places` would instead get the value of the `DEFAULT`
currency.

The other problem was that the comparisons with `" "` weren't working
for locales that use some other unicode whitespace character for their
spaces in the formatting patterns. This caused several locales to have
incorrect currency symbol and positive/negative sign placement.

I added a helper function for normalizing whitespace with the
`unicodedata` module, and this is now fixed. The new data should have
the correct sign and symbol placement and spacing.
The updates to the locales data affected the fr_FR locale, which was
being parsed incorrectly before and not including the space between the
numeric portion and the currency symbol.

The changes to the cldr data processing script fix this error in the
parsed locales data, but I forgot to update the test case that checkes
for non-ascii characters in the `scalars.l10n` function. This was not
caught because the venv that hatch created did not update with the new
data files when running the test suite locally, so it was not caught
until CI ran on the PR for this.

I manually recreated the venv locally to confirm the changes fix the
failing test case and reflect the correct localization formatting.
@GrammAcc GrammAcc linked an issue Apr 30, 2024 that may be closed by this pull request
@GrammAcc GrammAcc merged commit 64094b8 into dev Apr 30, 2024
2 checks passed
@GrammAcc GrammAcc deleted the refactor-cldr-data-processing-script branch April 30, 2024 01:19
GrammAcc added a commit that referenced this pull request Apr 30, 2024
Bumped version to 0.1.3 to create a release since the cldr data included
in v0.1.2 on pypi is incorrect. Version 0.1.3 will include the corrected
cldr data that was added with the cldr data processing refactor. See #20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tooling] Refactor CLDR Data Processing Script
1 participant