-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"invalid UTF-8 character index" error in writetable #813
Comments
Not sure offhand. Any chance you can shrink the error to a specific entry in your DataFrame that fails? Probably we need to do UTF8 translation in |
The file is partly sensitive but I have located the first faulty line. Can I send it to you offline? |
Ok, I trimmed down and sent example files to show the problem to you privately, John. |
Thanks. |
Here's a small test case that we can use to check out performance on this: using DataFrames
bytes = [
0x43,
0x6f,
0x6c,
0x75,
0x6d,
0x6e,
0x31,
0x0a,
0x4e,
0x5f,
0xd7,
0xd3,
0xbd,
0xd3,
0xbf,
0xda,
0x0a,
]
io = open("input.csv", "w")
for b in bytes
write(io, b)
end
close(io)
df = readtable("input.csv")
df[1, 1] Ideally, we'd raise an error during parsing that the input is invalid UTF8. |
Maybe @ScottPJones's recent PRs would fix this by checking the |
@nalimilan I've just investigated this... unfortunately, it has led me to yet another place that needs to be overhauled... The DataFrames code used |
In this case I was running 0.3.8 but I also tried 0.3.9 and had the exact same exception/error. |
OK, my fixes have only been in 0.4, not the released versions. |
The file is an UTF-8 file that has been corrupted, yes. The problem is that it is read in as a UTF-8 file and I can work on the resulting DataFrame but when writing it again the error surfaces. The issue is that since it is read in I lost quite some time in filtering and working with the file before I realized, upon writing, that there must be something wrong. If one consider this a DataFrames bug or not depends on your viewpoint and what is considered in the specification, I guess. If the design decision is that the user has the responsibility to make sure that files read in with readtable should be valid UTF-8 files it should be documented imho. Thanks for your efforts and sorry if I've been unclear. |
@robertfeldt That's fine, we were just curious about where the data comes from. I would advocate for raising an error from |
Well, what I think really should happen, and hopefully I'll be allowed to do this (maybe in 0.5), is for julia to have consistent conversion and validity checking methods (it's totally inconsistent now, and outright buggy... I've got about 20 PRs already merged into 0.4 fixing problems, and another 4-5 major ones that have been under review for 2 months now...). |
@nalimilan Your comment came in right while I was responding... that's precisely what I want to achieve, but as a general facility, for all conversions, not just DataFrames. |
|
I read in a csv file which has UTF-8 characters, filter it but when I try to write it back to disk with writetable I get:
Not sure this is a DataFrames problem. Might be print_escaped? Any advice hints?
The text was updated successfully, but these errors were encountered: