Skip to content

Conversation

benmss
Copy link
Member

@benmss benmss commented Aug 6, 2025

Summary

This PR allows tags with non-utf8 characters to be parsed and handled.

Description of changes

While it is rare for repositories to have non-utf8 characters in their tags, it is possible. Currently, Macaron retrieves tags for finding provenance and matching PURL versions to repository commits. When accessing the tags via repo.tags of GitPython, a unicode decode error can be thrown if non-utf8 characters are encountered. This error only occurs if the related tag is found within the .git/packed-refs file that can be created in a repository via git pack-refs command. Tags found in individual files under .git/refs/tags should be fine (possibly depending on the filesystem encoding).

To fix this issue, places where tags are needed now use a set of functions that first attempt to use the previous collection method, before falling back to a Git subprocess that calls git show-ref --tags. The result of this command is decoded using one of the top 10 most common character encodings (assuming UTF-8 has already failed). This list could be extended to include all supported Python encodings if desired. There are 97 total encodings since Python 3.8. See https://docs.python.org/3.8/library/codecs.html#standard-encodings

Possible encodings are tried until one succeeds, or all of them fail. Finding the "correct" encoding is not currently important for our use case because these non-utf8 characters end up being percent-encoded when part of a PURL. This means that these tags cannot be matched. E.g. v1.0%C3%83 != v1.0Ã

An integration test is included that makes use of the repository where a tag of this type was found: https://github.com/ACRA/acra
A unit test is included that leverages the pre-existing commit finder testing repository, adding a packed-refs file with a non-utf8 tag.

The issue with GitPython is unlikely to be fixed due to the repository being in "maintenance mode".

@benmss benmss self-assigned this Aug 6, 2025
@benmss benmss added the enhancement Enhancement of a feature label Aug 6, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 6, 2025
benmss added 5 commits August 6, 2025 15:22
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
… for consistency

Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
@benmss benmss force-pushed the benmss/tags-outside-utf8 branch from 2a58ecf to bd1e3ff Compare August 6, 2025 05:23
@benmss benmss marked this pull request as ready for review August 7, 2025 00:07
@benmss benmss requested review from behnazh-w and tromai as code owners August 7, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement of a feature OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant