You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue:
The value returned on token.ent_iob_ is a string, not unicode.
Code:
The above issue is reproducible with the following:
importspacynlp=spacy.load('en')
txt=u'''Lorem Ipsum is simply dummy text of the printing and typesetting industry.'''doc=nlp(txt)
fortokindoc[:5]:
printtype(tok.ent_iob_)
Comments:
Pretty sure this is caused by this line in token.pyx.
Possible solutions are to change that line or import unicode_literals in that file. I'm not sure how the project handles strings internally but having all modules use unicode_literals might not be a terrible idea.
Just fixing the single line would be easy though. If I want to submit a PR as small as this do I need to run a bunch of tests or can I just put u in front of each of those letters? That said, adding some kind of automated test builder to ensure that all properties and return values respect the contracts in the documentation might not be a bad idea. Alternatively, from what little I know about cython, maybe the properties could get type declarations that would be enforced by the compiler?
Followup question, is there a page with instructions for contributing?
-- Eric
The text was updated successfully, but these errors were encountered:
All modules should definitely have unicode_literals. Good suggestions re the testing, which currently needs to be refactored and improved. I don't know how to add a type declaration to a property in Cython, though. You can only specify return types for cdef and cpdef functions, I believe.
You can find the contribution guidelines here. Thanks again!
Spacy version: 1.3.0
System: Ubuntu 14.04
Issue:
The value returned on
token.ent_iob_
is a string, not unicode.Code:
The above issue is reproducible with the following:
Results in:
Comments:
Pretty sure this is caused by this line in token.pyx.
Possible solutions are to change that line or import
unicode_literals
in that file. I'm not sure how the project handles strings internally but having all modules useunicode_literals
might not be a terrible idea.Just fixing the single line would be easy though. If I want to submit a PR as small as this do I need to run a bunch of tests or can I just put
u
in front of each of those letters? That said, adding some kind of automated test builder to ensure that all properties and return values respect the contracts in the documentation might not be a bad idea. Alternatively, from what little I know about cython, maybe the properties could get type declarations that would be enforced by the compiler?Followup question, is there a page with instructions for contributing?
-- Eric
The text was updated successfully, but these errors were encountered: