Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when filename includes non ASCII characters #287

Open
davuses opened this issue Mar 31, 2023 · 2 comments
Open

UnicodeDecodeError when filename includes non ASCII characters #287

davuses opened this issue Mar 31, 2023 · 2 comments

Comments

@davuses
Copy link

davuses commented Mar 31, 2023

trying to read from a file whose filename is not ascii characters:

magic.from_file("説明.txt")

And this gives me error:

Traceback (most recent call last):
  File "G:\BaiduNet\unarchive.py", line 64, in <module>
    magic.from_file("説明.txt")
  File "C:\Users\davuses\AppData\Local\Programs\Python\Python311\Lib\site-packages\magic\magic.py", line 135, in from_file
    return m.from_file(filename)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\davuses\AppData\Local\Programs\Python\Python311\Lib\site-packages\magic\magic.py", line 89, in from_file
    return maybe_decode(magic_file(self.cookie, filename))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\davuses\AppData\Local\Programs\Python\Python311\Lib\site-packages\magic\magic.py", line 214, in maybe_decode
    return s.decode('utf-8')
           ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 16: invalid continuation byte

If I rename the file to ASCII name, say file.txt, the problem disappears.

Also, if I use .from_buffer(), there's no issue:

magic.from_buffer(open("説明.txt", "rb").read(2048), mime=True)

weird, not sure if this is related to this issue #205

The package is installed with pip install python-magic-bin on WIndows 11, Python3.11

@silente
Copy link

silente commented Apr 11, 2023

Hi, I have the same problem.

My code is:

magic.from_file(file_path, mime=True)

My error is:

  File "C:\Program Files\Python38\lib\site-packages\magic\magic.py", line 135, in from_file
    return m.from_file(filename)
  File "C:\Program Files\Python38\lib\site-packages\magic\magic.py", line 89, in from_file
    return maybe_decode(magic_file(self.cookie, filename))
  File "C:\Program Files\Python38\lib\site-packages\magic\magic.py", line 214, in maybe_decode
    return s.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 57: invalid continuation byte

I tried to edit "C:\Program Files\Python38\lib\site-packages\magic\magic.py", line 214 from return s.decode('utf-8') to return s.decode('utf-8', errors='ignore') or return s.decode('utf-8', errors='replace') but I still encounter the problem.

@ember91
Copy link

ember91 commented Jan 10, 2025

This happens if the system encoding is not UTF-8 and unicode characters are provided in the file path. Most often this is a problem on Windows.

For details, see file_or_fd() in src/magic.c in the libmagic source code. It uses open() which assumes the system encoding unless a locale has been set with e.g. setlocale(). I don't think any locale has been set though. In this repository it incorrectly assumes that the file path is UTF-8 in coerce_filename() in __init__.py.

Some solutions, none of them perfect, are:

  • Ensure the dynamic library loaded with python-magic sets its locale to UTF-8 in some way before open(). I have not succeeded with this unfortunately.
  • Temporarily move or symlink the file to a more accessible path without unicode characters before from_file(). Note that symlink support seems to be a feature added to master in 2a01b18#diff-ecec88c33adb7591ee6aa88e29b62ad52ef443611cba5e0f0ecac9b5725afdba but not yet included in any release of this library.
  • Read a number of bytes from the file in Python and pass them to from_buffer() instead. This seems to work, although it's not clear how many bytes should be read for consistent results. README.md in this repository recommends at least 2048 bytes.
  • Open a file in Python and pass its descriptor to from_descriptor(). I haven't made this work unfortunately. It may be possible on Linux but not on Windows: https://stackoverflow.com/questions/9200560/passing-a-file-descriptor-to-a-c-library-function-through-ctypes-on-windows
  • Use another library.
  • Use chcp 65001 before calling the application using this library, or supply a side-by-side manifest that sets the encoding to UTF-8. I have not tested this.

Another idea I had was to encode the file path as a byte string in the system encoding, but due to another bug that doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants