Skip to content

commit metadata is incorrectly assumed to be UTF-8 #77

@bos

Description

@bos

git has not always supported encodings, and so there are commits "in the wild" that contain 8-bit non-UTF-8 messages.

Here's an example from the git tree itself that uses ISO-8859-1.

Unfortunately, pygit2 assumes and enforces UTF-8 encodings, in C, at read time. This means that commit metadata that is not UTF-8 causes a UnicodeDecodeError to be thrown, in such a way that there is no possible workaround in Python.

As far as I can tell, there are just a few possible sensible approaches:

  • Add a message_raw field to Commit and Tag objects that supplies the raw bytes and allows an app to decide how to decode them. This would give a decent fallback. Do the same for author and committer Signature objects.
  • Perform decoding in lenient or replace mode. I don't like this because it can lead to silent data loss, which is slightly worse than the current "unrecoverable exception" approach :-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions