You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, all the text-like files are loaded as utf-8. But this causes issues, for example, with csv files created on Windows in Japanese. Those might be encoded as CP932. https://github.com/jawah/charset_normalizer can help with this issue. Currently it throws this:
markitdown._markitdown.FileConversionException: Could not convert 'test.csv' to Markdown. File type was recognized as ['.csv']. While converting the file, the following error was encountered:
Traceback (most recent call last):
File "/Users/brc-dd/foo/.venv/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1041, in _convert
res = converter.convert(local_path, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/brc-dd/foo/.venv/lib/python3.12/site-packages/markitdown/_markitdown.py", line 166, in convert
text_content = fh.read()
^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte
chardet doesn't work with such files. It detects them as Shift JIS while Windows uses an extended version of Shift JIS (CP932).
The text was updated successfully, but these errors were encountered:
Currently, all the text-like files are loaded as utf-8. But this causes issues, for example, with csv files created on Windows in Japanese. Those might be encoded as CP932. https://github.com/jawah/charset_normalizer can help with this issue. Currently it throws this:
chardet
doesn't work with such files. It detects them as Shift JIS while Windows uses an extended version of Shift JIS (CP932).The text was updated successfully, but these errors were encountered: