Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON not supported? #34

Closed
adhishthite opened this issue Dec 15, 2024 · 6 comments · Fixed by #261
Closed

[BUG] JSON not supported? #34

adhishthite opened this issue Dec 15, 2024 · 6 comments · Fixed by #261
Labels
question Further information is requested

Comments

@adhishthite
Copy link

Based on the README, I see that JSON is a supported extension, however, when I passed a sample JSON to it, I get the following error:

markitdown._markitdown.UnsupportedFormatException: Could not convert '/tmp/SAMPLE-FILE.json' to Markdown.
The formats ['.json', '.json'] are not supported.

Please look into this, thanks!

@gagb gagb added the question Further information is requested label Dec 15, 2024
@olin2011
Copy link

olin2011 commented Dec 16, 2024

markitdown._markitdown.UnsupportedFormatException: Could not convert 'qa-1213.csv' to Markdown. The formats ['.csv'] are not supported.

I tried csv, json, and both had the same problem.

@realrajaryan
Copy link
Member

After taking a look, I believe there was an assumption that PlainTextConverter class would’ve handled the file extensions mentioned under “Various other text-based formats” in the README.

However, the PlainTextConverter class only handles files that mimetypes identifies as text/*, but JSON text for example has a MIME type of application/json.

Adding converters for these extensions separately would be the best way to go about fixing this.

@Gad
Copy link

Gad commented Dec 18, 2024

Would a converter that strips JSON text of its nested syntax delimiters be ok for the purpose of "indexing, text analysis" while keeping some indentation (thus loosing e.g. the distinction between key/values and arrays) or should reformatting it into e.g. tables be more useful for this project ?

@Evgeny105
Copy link

I could be wrong, but in my case, the problem with text formats was solved when I installed the "mime-support" package additionally in the docker container

@Gad Gad mentioned this issue Dec 26, 2024
@afourney
Copy link
Member

afourney commented Jan 4, 2025

After taking a look, I believe there was an assumption that PlainTextConverter class would’ve handled the file extensions mentioned under “Various other text-based formats” in the README.

However, the PlainTextConverter class only handles files that mimetypes identifies as text/*, but JSON text for example has a MIME type of application/json.

Adding converters for these extensions separately would be the best way to go about fixing this.

You are correct. This was my assumption, and it was wrong. I'll address this asap.

@afourney
Copy link
Member

afourney commented Jan 4, 2025

Addressed in #261. Other PRs (#219 and #251) that do something more sophisticated are still under review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants