Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with presence of unicode #4

Open
rohtrj opened this issue Sep 18, 2015 · 6 comments
Open

Problem with presence of unicode #4

rohtrj opened this issue Sep 18, 2015 · 6 comments

Comments

@rohtrj
Copy link

rohtrj commented Sep 18, 2015

The code does not work if unicode characters are present in the csv. Can you add that feature?

@rholder
Copy link
Owner

rholder commented Sep 18, 2015

Can you provide an example or a few lines from a CSV that is not being processed correctly? Did you get a stacktrace of any kind? If so, can you also post this?

@rohtrj
Copy link
Author

rohtrj commented Sep 19, 2015

Suppose my CSV file contains a column like this:
https://cloud.githubusercontent.com/assets/14344478/9975392/b0e078dc-5ed8-11e5-9633-be084fb04531.png

This is the error I am getting:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 8: invalid s
tart byte

@infosec-au
Copy link

Hi,

I've experienced the same problem. Currently, I'm doing a hack around it because I can afford to lose the unicode entries. The bug itself seems to lie within the simplejson lib itself. You should get a stack trace if you run csv2es on some data which has unicode characters.

@i3visio
Copy link

i3visio commented Jan 19, 2016

The problem seems to be linked with how pyelasticsearch encodes the json. If the function crashes when trying to encode it in utf-8, the bulk process stops. If the problem persists when importing we can try other encoding options such as 'latin1'. A workaround can be done by modifying on the fly the libraries (although this works for us there is A HIGH RISK HERE! AS WE ARE MIXING INSTALLATIONS) as follows:

# Downloading patched version
wget https://raw.githubusercontent.com/i3visio/pyelasticsearch/master/pyelasticsearch/client.py
# Backing up the original file:
sudo cp /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py_old
sudo cp client.py /usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py
# A new dependency may be required... So this may break things around. Do not use in production:
sudo apt-get install certifi

We have performed a pull request to address this issue and provide more flexibility in the future.

@DRN88
Copy link

DRN88 commented Oct 27, 2016

I am having the same issue. I removed unicode characters then it worked. ( iconv -c -f utf-8 -t ascii file.csv )

@packet-rat
Copy link

packet-rat commented Feb 6, 2017

Transforming utf-8 to ascii may solve the "crash" issue, but leads to a loss of fidelity.

It may also not solve the core problem if there is an "illegal" utf-8 encoding in your incoming stream. In our cyber domain we have the issue of "bad guys/gals" who intentionally use illegal utf-8 encodings to bypass signature detection and for "typo-squatting". For us doing a "try:" block around the utf-8 parsing/encoding/decoding statements, and transforming "bad" characters into their Byte-Code representations, when there is an Exception, provides a reasonable compromise.

[Note have tried modifying Core libraries as well -- it is a very slippery slope: you will discover more and more codecs, parsers, etc. that you have to "tweak", which will ultimately lead to very bad outcomes for you and anyone trying to leverage your code].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants