"invalid UTF-8 character index" error in writetable #813

robertfeldt · 2015-06-14T19:16:50Z

I read in a csv file which has UTF-8 characters, filter it but when I try to write it back to disk with writetable I get:

invalid UTF-8 character index
 in next at /Applications/Julia-0.3.8.app/Contents/Resources/julia/lib/julia/sys.dylib
 in need_full_hex at /Applications/Julia-0.3.8.app/Contents/Resources/julia/lib/julia/sys.dylib
 in print_escaped at string.jl:868
 in escapedprint at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:12
 in printtable at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:41
 in anonymous at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:108
 in open at iostream.jl:137
 in writetable at /Users/feldt/.julia/v0.3/DataFrames/src/abstractdataframe/io.jl:107

Not sure this is a DataFrames problem. Might be print_escaped? Any advice hints?

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2015-06-14T19:55:08Z

Not sure offhand. Any chance you can shrink the error to a specific entry in your DataFrame that fails? Probably we need to do UTF8 translation in readtable to make things safe to use later on.

robertfeldt · 2015-06-14T19:57:15Z

The file is partly sensitive but I have located the first faulty line. Can I send it to you offline?

robertfeldt · 2015-06-14T20:07:37Z

Ok, I trimmed down and sent example files to show the problem to you privately, John.

johnmyleswhite · 2015-06-14T20:15:43Z

Thanks.

johnmyleswhite · 2015-06-14T21:00:07Z

Here's a small test case that we can use to check out performance on this:

using DataFrames

bytes = [
    0x43,
    0x6f,
    0x6c,
    0x75,
    0x6d,
    0x6e,
    0x31,
    0x0a,
    0x4e,
    0x5f,
    0xd7,
    0xd3,
    0xbd,
    0xd3,
    0xbf,
    0xda,
    0x0a,
]

io = open("input.csv", "w")
for b in bytes
    write(io, b)
end
close(io)

df = readtable("input.csv")
df[1, 1]

Ideally, we'd raise an error during parsing that the input is invalid UTF8.

nalimilan · 2015-06-15T07:53:12Z

Maybe @ScottPJones's recent PRs would fix this by checking the UTF8String on construction?

ScottPJones · 2015-06-15T10:46:40Z

@nalimilan I've just investigated this... unfortunately, it has led me to yet another place that needs to be overhauled... The DataFrames code used bytestring() (which has many inconsistent methods) and the particular method used ends up calling jl_pchar_to_string(), which calls jl_array_to_string, which calls u8_isvalid, which I fixed in JuliaLang/julia#11203.
The particular sequence above is not valid UTF-8, and should have been caught on input (with my change)...
@robertfeldt What version of julia are you using? @johnmyleswhite Which version did you test under?

robertfeldt · 2015-06-15T11:08:42Z

In this case I was running 0.3.8 but I also tried 0.3.9 and had the exact same exception/error.

ScottPJones · 2015-06-15T11:35:55Z

OK, my fixes have only been in 0.4, not the released versions.
I can't figure out what character set that might be from... is this exactly the sequence that it dies on in your real file: 0xd7, 0xd3, 0xbd, 0xd3, 0xbf, 0xda?
Also, are there other, valid, UTF-8 sequences before that in the file, or are all the previous characters ASCII? Either this file is corrupted, or it isn't UTF-8 at all.

robertfeldt · 2015-06-15T12:10:04Z

The file is an UTF-8 file that has been corrupted, yes. The problem is that it is read in as a UTF-8 file and I can work on the resulting DataFrame but when writing it again the error surfaces. The issue is that since it is read in I lost quite some time in filtering and working with the file before I realized, upon writing, that there must be something wrong. If one consider this a DataFrames bug or not depends on your viewpoint and what is considered in the specification, I guess. If the design decision is that the user has the responsibility to make sure that files read in with readtable should be valid UTF-8 files it should be documented imho. Thanks for your efforts and sorry if I've been unclear.

nalimilan · 2015-06-15T12:35:35Z

@robertfeldt That's fine, we were just curious about where the data comes from. I would advocate for raising an error from readtable by default, with an option to replace invalid sequences.

ScottPJones · 2015-06-15T12:35:45Z

Well, what I think really should happen, and hopefully I'll be allowed to do this (maybe in 0.5), is for julia to have consistent conversion and validity checking methods (it's totally inconsistent now, and outright buggy... I've got about 20 PRs already merged into 0.4 fixing problems, and another 4-5 major ones that have been under review for 2 months now...).
When dealing with corrupted data, you should have the choice of getting an error or inputting the data but replacing the corrupted data with a default or user-supplied (and validated!) replacement character(s).
You also should have the ability to read in commonly used UTF variants such as "Modified UTF-8" (used by Java and many others), and "CESU-8" (used by Oracle, MySQL, and many others) [my PRs handle that].
Comments in support of my fixes, in the lengthy debates about merging in or not my PRs (JuliaLang/julia#11575 JuliaLang/julia#11551 JuliaLang/julia#11607 JuliaLang/julia#11624), would be greatly appreciated!

ScottPJones · 2015-06-15T12:38:08Z

@nalimilan Your comment came in right while I was responding... that's precisely what I want to achieve, but as a general facility, for all conversions, not just DataFrames.

quinnj · 2017-09-07T05:32:22Z

readtable/writetable are now deprecated in favor of CSV.jl/TextParse.jl, which shouldn't have any issues here.

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"invalid UTF-8 character index" error in writetable #813

"invalid UTF-8 character index" error in writetable #813

robertfeldt commented Jun 14, 2015

johnmyleswhite commented Jun 14, 2015

robertfeldt commented Jun 14, 2015

robertfeldt commented Jun 14, 2015

johnmyleswhite commented Jun 14, 2015

johnmyleswhite commented Jun 14, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

robertfeldt commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

robertfeldt commented Jun 15, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

quinnj commented Sep 7, 2017