-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to ignore encoding errors #763
Comments
Hmm, not quite sure how we'd implement. Currently, encoding errors are caught in |
I second this issue. Eg metmuseum/openaccess#11 is a 224Mb MetObjects.csv by the Met Museum, and to extract some statistics I'd like to skip rows with badly encoded chars. The exception there is
|
The world is rife with huge CSV files that contain the occasional "bad" UTF-8 character. The Unicode REPLACEMENT CHARACTER (U+FFFD) exists to handle these situations. While I don't care to address the virtue of ignoring all conceivable errors, it seems reasonable to expect at least csvclean and possibly other csvkit tools (optionally?) handle invalid UTF-8 characters with replacement. In the mean time, one can work around at least UTF-8 exceptions by eliding the invalid bytes with iconv (where it exists):
|
I concur. The codec error caught me a few times and then it is quite a hassle to create a cleaned version of the CSV file before using |
Agreed. With all above. Has cost me mountains of time trying to sanitize. Would be a big help. If I had an ounce of practical skill, I'd make a pull request myself. |
I know this is an old thread, but here goes anyway, I also run into this problem all the time, "special" characters, with a variety of data files. For example hex 96 bd be e0 e9 91 and many others. It seems the OP was seeing even worse problems. But I think the same principle that I outline here may apply. I want the file to be JUST ASCII. I use a preprocessing step, in my case using a short section of C code, reading a UNIX input data file. For each character read, if \n newline, I increment the line count and continue. If a tab character and not a TSV file, or if I run as a preprocessing step, but you could also convert the input file to the cleaned version, and then keep using the cleaned version. Anyway, the above works well for me. My experience is that it does not take all that long to run. The bottleneck is almost totally the input and output. Luckily for me, the "special" characters are ALWAYS in a field that I do not use. For the files I see, there are large text fields with various descriptions that contain the "special" characters. I never use those fields, so I don't care about wiping out the "special" characters. I just want the file to import, and the process not to crash, as others have said. Daniel |
I use the tools from 'csvkit' on very, very large files, piped in from stdout - often times in the tens of gigabytes and hundreds of millions of lines - and these processes can take a very long while, sometimes in excess of 12 hours, to run. A major bane in my side is when I check on its progress and find that it bailed out after processing for just a few minutes because of some Python exception that wasn't caught. The older version of csvkit that I had was full of these problems, and the latest version remedied many, but I encountered an error involving a bad UTF character causing the program to bail out. The error was as follows from csvgrep:
Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag. Use the -v flag to see the complete error.
I found the line, and ran it against iconv / uni2ascii / recode etc. and its unanimous - there was some bad byte pattern present in the input file for whatever reason. Using -e to specify different types (e.g. ascii, utf-8, etc) did not work. Ultimately, because it would take too long to use iconv and recode and such on it, and uni2ascii was bailing out, I just piped the file through the "strings" utility before passing into csvgrep as ASCII.
So, in order to prevent these types of errors from causing the program to unequivocally exit (crash, in my opinion!), it would be nice to have an option common to all csvkit tools that forces all errors to be ignored and perhaps just output to stderr or written to a log file along with the content of the accompanying record(s), line number(s), and reason for exception. The line could then be left out of the output, and if needed the particular line could be fixed manually before re-running csvkit tools.
This would make it much, much more friendly for running against large file sets. Again, when it takes for example 12 hours to pipe just one of my data sets against csvgrep, it absolutely crushes me to see an error that stopped it cold in its tracks just 45 minutes in and having to use grep to find the line that it crashed on from the original file, do the subtraction to get the remaining lines to convert, tail that remaining count from the source file to another file, try to figure out the problem line, and then re-run csvkit to AGAIN find that a SINGLE BAD BYTE crashed the dang thing.
I hope you understand my frustration, and why an option to forcefully and explicitly continue in the face of errors, ignore the record(s) in error, and just output them to stderr and/or a log would be helpful.
Thank you!
The text was updated successfully, but these errors were encountered: