Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"invalid UTF-8 character index" error in writetable #813

Closed
robertfeldt opened this issue Jun 14, 2015 · 14 comments
Closed

"invalid UTF-8 character index" error in writetable #813

robertfeldt opened this issue Jun 14, 2015 · 14 comments

Comments

@robertfeldt
Copy link

I read in a csv file which has UTF-8 characters, filter it but when I try to write it back to disk with writetable I get:

invalid UTF-8 character index
 in next at /Applications/Julia-0.3.8.app/Contents/Resources/julia/lib/julia/sys.dylib
 in need_full_hex at /Applications/Julia-0.3.8.app/Contents/Resources/julia/lib/julia/sys.dylib
 in print_escaped at string.jl:868
 in escapedprint at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:12
 in printtable at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:41
 in anonymous at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:108
 in open at iostream.jl:137
 in writetable at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:107

Not sure this is a DataFrames problem. Might be print_escaped? Any advice hints?

@johnmyleswhite
Copy link
Contributor

Not sure offhand. Any chance you can shrink the error to a specific entry in your DataFrame that fails? Probably we need to do UTF8 translation in readtable to make things safe to use later on.

@robertfeldt
Copy link
Author

The file is partly sensitive but I have located the first faulty line. Can I send it to you offline?

@robertfeldt
Copy link
Author

Ok, I trimmed down and sent example files to show the problem to you privately, John.

@johnmyleswhite
Copy link
Contributor

Thanks.

@johnmyleswhite
Copy link
Contributor

Here's a small test case that we can use to check out performance on this:

using DataFrames

bytes = [
    0x43,
    0x6f,
    0x6c,
    0x75,
    0x6d,
    0x6e,
    0x31,
    0x0a,
    0x4e,
    0x5f,
    0xd7,
    0xd3,
    0xbd,
    0xd3,
    0xbf,
    0xda,
    0x0a,
]

io = open("input.csv", "w")
for b in bytes
    write(io, b)
end
close(io)

df = readtable("input.csv")
df[1, 1]

Ideally, we'd raise an error during parsing that the input is invalid UTF8.

@nalimilan
Copy link
Member

Maybe @ScottPJones's recent PRs would fix this by checking the UTF8String on construction?

@ScottPJones
Copy link

@nalimilan I've just investigated this... unfortunately, it has led me to yet another place that needs to be overhauled... The DataFrames code used bytestring() (which has many inconsistent methods) and the particular method used ends up calling jl_pchar_to_string(), which calls jl_array_to_string, which calls u8_isvalid, which I fixed in JuliaLang/julia#11203.
The particular sequence above is not valid UTF-8, and should have been caught on input (with my change)...
@robertfeldt What version of julia are you using? @johnmyleswhite Which version did you test under?

@robertfeldt
Copy link
Author

In this case I was running 0.3.8 but I also tried 0.3.9 and had the exact same exception/error.

@ScottPJones
Copy link

OK, my fixes have only been in 0.4, not the released versions.
I can't figure out what character set that might be from... is this exactly the sequence that it dies on in your real file: 0xd7, 0xd3, 0xbd, 0xd3, 0xbf, 0xda?
Also, are there other, valid, UTF-8 sequences before that in the file, or are all the previous characters ASCII? Either this file is corrupted, or it isn't UTF-8 at all.

@robertfeldt
Copy link
Author

The file is an UTF-8 file that has been corrupted, yes. The problem is that it is read in as a UTF-8 file and I can work on the resulting DataFrame but when writing it again the error surfaces. The issue is that since it is read in I lost quite some time in filtering and working with the file before I realized, upon writing, that there must be something wrong. If one consider this a DataFrames bug or not depends on your viewpoint and what is considered in the specification, I guess. If the design decision is that the user has the responsibility to make sure that files read in with readtable should be valid UTF-8 files it should be documented imho. Thanks for your efforts and sorry if I've been unclear.

@nalimilan
Copy link
Member

@robertfeldt That's fine, we were just curious about where the data comes from. I would advocate for raising an error from readtable by default, with an option to replace invalid sequences.

@ScottPJones
Copy link

Well, what I think really should happen, and hopefully I'll be allowed to do this (maybe in 0.5), is for julia to have consistent conversion and validity checking methods (it's totally inconsistent now, and outright buggy... I've got about 20 PRs already merged into 0.4 fixing problems, and another 4-5 major ones that have been under review for 2 months now...).
When dealing with corrupted data, you should have the choice of getting an error or inputting the data but replacing the corrupted data with a default or user-supplied (and validated!) replacement character(s).
You also should have the ability to read in commonly used UTF variants such as "Modified UTF-8" (used by Java and many others), and "CESU-8" (used by Oracle, MySQL, and many others) [my PRs handle that].
Comments in support of my fixes, in the lengthy debates about merging in or not my PRs (JuliaLang/julia#11575 JuliaLang/julia#11551 JuliaLang/julia#11607 JuliaLang/julia#11624), would be greatly appreciated!

@ScottPJones
Copy link

@nalimilan Your comment came in right while I was responding... that's precisely what I want to achieve, but as a general facility, for all conversions, not just DataFrames.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

readtable/writetable are now deprecated in favor of CSV.jl/TextParse.jl, which shouldn't have any issues here.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants