-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with presence of unicode #4
Comments
Can you provide an example or a few lines from a CSV that is not being processed correctly? Did you get a stacktrace of any kind? If so, can you also post this? |
Suppose my CSV file contains a column like this: This is the error I am getting: UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 8: invalid s |
Hi, I've experienced the same problem. Currently, I'm doing a hack around it because I can afford to lose the unicode entries. The bug itself seems to lie within the simplejson lib itself. You should get a stack trace if you run csv2es on some data which has unicode characters. |
The problem seems to be linked with how pyelasticsearch encodes the json. If the function crashes when trying to encode it in utf-8, the bulk process stops. If the problem persists when importing we can try other encoding options such as 'latin1'. A workaround can be done by modifying on the fly the libraries (although this works for us there is A HIGH RISK HERE! AS WE ARE MIXING INSTALLATIONS) as follows:
We have performed a pull request to address this issue and provide more flexibility in the future. |
I am having the same issue. I removed unicode characters then it worked. ( iconv -c -f utf-8 -t ascii file.csv ) |
Transforming utf-8 to ascii may solve the "crash" issue, but leads to a loss of fidelity. It may also not solve the core problem if there is an "illegal" utf-8 encoding in your incoming stream. In our cyber domain we have the issue of "bad guys/gals" who intentionally use illegal utf-8 encodings to bypass signature detection and for "typo-squatting". For us doing a "try:" block around the utf-8 parsing/encoding/decoding statements, and transforming "bad" characters into their Byte-Code representations, when there is an Exception, provides a reasonable compromise. [Note have tried modifying Core libraries as well -- it is a very slippery slope: you will discover more and more codecs, parsers, etc. that you have to "tweak", which will ultimately lead to very bad outcomes for you and anyone trying to leverage your code]. |
The code does not work if unicode characters are present in the csv. Can you add that feature?
The text was updated successfully, but these errors were encountered: