Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mid3v2 crashes with "UnicodeEncodeError: surrogates not allowed" on files with accented characters in the filename #648

Open
martinwguy opened this issue May 20, 2024 · 2 comments

Comments

@martinwguy
Copy link

martinwguy commented May 20, 2024

Trying to see whether ISRC tags are present in a large audio collection using
mid3v2 -l 00*/*3 | grep -a TSRC
it dies halfway through, saying

IDv2 tag info for 00-225167/mina - volami nel cuore.mp3
TIT2=Volami nel cuore
TPE1=MINA
TRCK=1
IDv2 tag info for Traceback (most recent call last):
  File "/usr/bin/mid3v2", line 33, in <module>
    sys.exit(load_entry_point('mutagen==1.46.0', 'console_scripts', 'mid3v2')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 484, in entry_point
    return main(sys.argv)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 469, in main
    list_tags(args)
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 335, in list_tags
    print("IDv2 tag info for", filename)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc85' in position 13: surrogates not allowed

This isn't Mina's fault; it's the following file's name which is ANSI or CP437 encoded: "modà - la notte.mp3" where à is represented by character 0x85.
The same goes for other files whose names contain 0x8A for è, 0xB4 for é, 0x95 for ò, 0x97 for ù, 0xA2 for ó and so on.

On Debian GNU/Linux with LANG=en_GB.UTF-8

@martinwguy martinwguy changed the title mid3v2 crashes with UnicodeEncodeError: surrogates not allowed mid3v2 crashes with "UnicodeEncodeError: surrogates not allowed" on files with accented characters in the filename May 20, 2024
@lazka
Copy link
Member

lazka commented Jun 30, 2024

Python is sadly still broken in this case. We could reopen stdout etc with surrogateescape to work around it.

@antlarr
Copy link

antlarr commented Jul 4, 2024

I think that this is not mutagen's nor python's fault. If the filename is encoded in CP437 and not UTF-8, which is what python expects according to your LANG setting, then I'd say the best fix is to reencode the filenames correctly.

This can be done with: convmv -f cp437 -t utf-8 *. That will just show how the files will be renamed but doesn't do any change. Once you check that the encoding is right, you can run: convmv -f cp437 -t utf-8 --notest * to actually change the filenames in disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants