Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: token.ent_iob_ is str not unicode #672

Closed
ELind77 opened this issue Dec 8, 2016 · 3 comments
Closed

BUG: token.ent_iob_ is str not unicode #672

ELind77 opened this issue Dec 8, 2016 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@ELind77
Copy link

ELind77 commented Dec 8, 2016

Spacy version: 1.3.0
System: Ubuntu 14.04

Issue:
The value returned on token.ent_iob_ is a string, not unicode.

Code:
The above issue is reproducible with the following:

import spacy
nlp = spacy.load('en')

txt = u'''Lorem Ipsum is simply dummy text of the printing and typesetting industry.'''

doc = nlp(txt)
for tok in doc[:5]:
    print type(tok.ent_iob_)

Results in:

<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>

Comments:
Pretty sure this is caused by this line in token.pyx.
Possible solutions are to change that line or import unicode_literals in that file. I'm not sure how the project handles strings internally but having all modules use unicode_literals might not be a terrible idea.

Just fixing the single line would be easy though. If I want to submit a PR as small as this do I need to run a bunch of tests or can I just put u in front of each of those letters? That said, adding some kind of automated test builder to ensure that all properties and return values respect the contracts in the documentation might not be a bad idea. Alternatively, from what little I know about cython, maybe the properties could get type declarations that would be enforced by the compiler?

Followup question, is there a page with instructions for contributing?

-- Eric

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Dec 10, 2016
@honnibal
Copy link
Member

Thanks for the report!

All modules should definitely have unicode_literals. Good suggestions re the testing, which currently needs to be refactored and improved. I don't know how to add a type declaration to a property in Cython, though. You can only specify return types for cdef and cpdef functions, I believe.

You can find the contribution guidelines here. Thanks again!

honnibal added a commit that referenced this issue Dec 18, 2016
@honnibal
Copy link
Member

Should be fixed now.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants