load_conll, unicode support #19

alexeyev · 2015-12-02T23:10:41Z

Hi, loading data in conll format fails on my custom dataset with non-ascii characters. So when I read data with encoding 'utf-8' set, I get corresponding errors here:

  File "/usr/local/lib/python2.7/dist-packages/seqlearn/datasets.py", line 65, in <genexpr>
    lines = (str.split(line) for line in  f)
TypeError: descriptor 'split' requires a 'str' object but received a 'unicode'

def _conll_sequences(f, features, labels, lengths, split):
    # Divide input into blocks of empty and non-empty lines.
    lines = (str.strip(line) for line in  f)

Everything works perfectly, when I modify the last line like that:

 lines = (line.strip() for line in  f)

Is there anything that makes such fix unwanted?

The text was updated successfully, but these errors were encountered:

alexeyev · 2016-06-02T12:27:34Z

Hi, the project isn't supported anymore, is it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_conll, unicode support #19

load_conll, unicode support #19

alexeyev commented Dec 2, 2015

alexeyev commented Jun 2, 2016

load_conll, unicode support #19

load_conll, unicode support #19

Comments

alexeyev commented Dec 2, 2015

alexeyev commented Jun 2, 2016