-
-
Notifications
You must be signed in to change notification settings - Fork 402
Closed
Description
git has not always supported encodings, and so there are commits "in the wild" that contain 8-bit non-UTF-8 messages.
Here's an example from the git tree itself that uses ISO-8859-1.
Unfortunately, pygit2 assumes and enforces UTF-8 encodings, in C, at read time. This means that commit metadata that is not UTF-8 causes a UnicodeDecodeError
to be thrown, in such a way that there is no possible workaround in Python.
As far as I can tell, there are just a few possible sensible approaches:
- Add a
message_raw
field toCommit
andTag
objects that supplies the raw bytes and allows an app to decide how to decode them. This would give a decent fallback. Do the same for author and committerSignature
objects. - Perform decoding in
lenient
orreplace
mode. I don't like this because it can lead to silent data loss, which is slightly worse than the current "unrecoverable exception" approach :-)
Metadata
Metadata
Assignees
Labels
No labels