Problem with presence of unicode #4

rohtrj · 2015-09-18T08:48:22Z

The code does not work if unicode characters are present in the csv. Can you add that feature?

rholder · 2015-09-18T20:45:44Z

Can you provide an example or a few lines from a CSV that is not being processed correctly? Did you get a stacktrace of any kind? If so, can you also post this?

rohtrj · 2015-09-19T08:51:47Z

Suppose my CSV file contains a column like this:
https://cloud.githubusercontent.com/assets/14344478/9975392/b0e078dc-5ed8-11e5-9633-be084fb04531.png

This is the error I am getting:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 8: invalid s
tart byte

infosec-au · 2015-10-07T12:07:30Z

Hi,

I've experienced the same problem. Currently, I'm doing a hack around it because I can afford to lose the unicode entries. The bug itself seems to lie within the simplejson lib itself. You should get a stack trace if you run csv2es on some data which has unicode characters.

i3visio · 2016-01-19T12:06:31Z

The problem seems to be linked with how pyelasticsearch encodes the json. If the function crashes when trying to encode it in utf-8, the bulk process stops. If the problem persists when importing we can try other encoding options such as 'latin1'. A workaround can be done by modifying on the fly the libraries (although this works for us there is A HIGH RISK HERE! AS WE ARE MIXING INSTALLATIONS) as follows:

# Downloading patched version
wget https://raw.githubusercontent.com/i3visio/pyelasticsearch/master/pyelasticsearch/client.py
# Backing up the original file:
sudo cp /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py_old
sudo cp client.py /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py
# A new dependency may be required... So this may break things around. Do not use in production:
sudo apt-get install certifi

We have performed a pull request to address this issue and provide more flexibility in the future.

DRN88 · 2016-10-27T14:31:42Z

I am having the same issue. I removed unicode characters then it worked. ( iconv -c -f utf-8 -t ascii file.csv )

packet-rat · 2017-02-06T14:05:09Z

Transforming utf-8 to ascii may solve the "crash" issue, but leads to a loss of fidelity.

It may also not solve the core problem if there is an "illegal" utf-8 encoding in your incoming stream. In our cyber domain we have the issue of "bad guys/gals" who intentionally use illegal utf-8 encodings to bypass signature detection and for "typo-squatting". For us doing a "try:" block around the utf-8 parsing/encoding/decoding statements, and transforming "bad" characters into their Byte-Code representations, when there is an Exception, provides a reasonable compromise.

[Note have tried modifying Core libraries as well -- it is a very slippery slope: you will discover more and more codecs, parsers, etc. that you have to "tweak", which will ultimately lead to very bad outcomes for you and anyone trying to leverage your code].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with presence of unicode #4

Problem with presence of unicode #4

rohtrj commented Sep 18, 2015

rholder commented Sep 18, 2015

rohtrj commented Sep 19, 2015

infosec-au commented Oct 7, 2015

i3visio commented Jan 19, 2016

DRN88 commented Oct 27, 2016 •

edited

Loading

packet-rat commented Feb 6, 2017 •

edited

Loading

Problem with presence of unicode #4

Problem with presence of unicode #4

Comments

rohtrj commented Sep 18, 2015

rholder commented Sep 18, 2015

rohtrj commented Sep 19, 2015

infosec-au commented Oct 7, 2015

i3visio commented Jan 19, 2016

DRN88 commented Oct 27, 2016 • edited Loading

packet-rat commented Feb 6, 2017 • edited Loading

DRN88 commented Oct 27, 2016 •

edited

Loading

packet-rat commented Feb 6, 2017 •

edited

Loading