Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use charset_normalizer #18

Closed
brc-dd opened this issue Dec 14, 2024 · 0 comments · Fixed by #19
Closed

use charset_normalizer #18

brc-dd opened this issue Dec 14, 2024 · 0 comments · Fixed by #19
Labels
bug Something isn't working

Comments

@brc-dd
Copy link
Contributor

brc-dd commented Dec 14, 2024

Currently, all the text-like files are loaded as utf-8. But this causes issues, for example, with csv files created on Windows in Japanese. Those might be encoded as CP932. https://github.com/jawah/charset_normalizer can help with this issue. Currently it throws this:

markitdown._markitdown.FileConversionException: Could not convert 'test.csv' to Markdown. File type was recognized as ['.csv']. While converting the file, the following error was encountered:

Traceback (most recent call last):
  File "/Users/brc-dd/foo/.venv/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1041, in _convert
    res = converter.convert(local_path, **_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brc-dd/foo/.venv/lib/python3.12/site-packages/markitdown/_markitdown.py", line 166, in convert
    text_content = fh.read()
                   ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte

chardet doesn't work with such files. It detects them as Shift JIS while Windows uses an extended version of Shift JIS (CP932).

@gagb gagb added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Dec 14, 2024
@gagb gagb added open for contribution Invites open-source developers to contribute to the project. and removed help wanted Extra attention is needed good first issue Good for newcomers open for contribution Invites open-source developers to contribute to the project. labels Dec 14, 2024
@gagb gagb closed this as completed in #19 Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants