Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an instance of Doc with a wrongly-typed word list results in wrong TypeError #9437

Closed
DTWdata opened this issue Oct 12, 2021 · 2 comments · Fixed by #9541
Closed
Labels
feat / doc Feature: Doc, Span and Token objects feat / ux Feature: User experience, error messages etc.

Comments

@DTWdata
Copy link

DTWdata commented Oct 12, 2021

How to reproduce the behaviour

We forget to cast an object to string, using str(potato()) as an example.

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

class potato:
    def __init__(self):
        pass
    def __str__(self):
        return "potato"

token_texts = ["I", "like", potato(), "!"]
labels = [("O"), ("O"), ("I-FOOD"), ("O")]
whitespaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)

results in the error

doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
File "spacy\tokens\doc.pyx", line 268, in spacy.tokens.doc.Doc.__init__
TypeError: an integer is required

whereas it should result in

doc = Doc(nlp.vocab, words=token_texts, ents=labels, spaces=whitespaces)
File "spacy\tokens\doc.pyx", line 268, in spacy.tokens.doc.Doc.__init__
TypeError: a string is required

The line 268 in spacy.tokens.doc.Doc.init does however not raise this specific error message, unless I am mistaken. I propose adding another elif clause to check if it is not a string, and raising an error. I don't know what the else clause is for, so I am not opening a pull request (although I could, if desired)

Your Environment

  • spaCy version: 3.1.1
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.8.2
  • Pipelines: de_core_news_lg (3.1.0), de_core_news_sm (3.1.0), de_dep_news_trf (3.1.0), en_core_web_sm (3.1.0)
@polm polm added feat / doc Feature: Doc, Span and Token objects feat / ux Feature: User experience, error messages etc. labels Oct 13, 2021
@polm
Copy link
Contributor

polm commented Oct 26, 2021

This error is confusing but what's happening is that when it's not a string it's being treated as potentially an ID/hash. That would be an int type as an argument to get_by_orth, and the mismatch there is raising the error.

We can look at making the error clearer. I'm not sure we actually need to support providing IDs in this context anyway, so that's something else we can check.

polm added a commit to polm/spaCy that referenced this issue Oct 26, 2021
@adrianeboyd adrianeboyd linked a pull request Oct 27, 2021 that will close this issue
3 tasks
adrianeboyd added a commit that referenced this issue Oct 29, 2021
* Clarify error when words are of wrong type

See #9437

* Update docs

* Use try/except

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / doc Feature: Doc, Span and Token objects feat / ux Feature: User experience, error messages etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants