Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NerPipeline (TokenClassification) now outputs offsets of words #8781

Merged
merged 2 commits into from
Nov 30, 2020

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Nov 25, 2020

What does this PR do?

  • It happens that the offsets are missing, it forces the user to pattern
    match the "word" from his input, which is not always feasible.
    For instance if a sentence contains the same word twice, then there
    is no way to know which is which.
  • This PR proposes to fix that by outputting 2 new keys for this
    pipelines outputs, "start" and "end", which correspond to the string
    offsets of the word. That means that we should always have the
    invariant:
input[entity["start"]: entity["end"]] == entity["entity_group"]
                                    # or entity["entity"] if not grouped

Example of users that encounter problems:

https://huggingface.co/dslim/bert-base-NER?text=Hello+Sarah+Jessica+Parker+who+Jessica+lives+in+New+York
https://discuss.huggingface.co/t/token-positions-when-using-the-inference-api/2188

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

- It happens that the offsets are missing, it forces the user to pattern
match the "word" from his input, which is not always feasible.
For instance if a sentence contains the same word twice, then there
is no way to know which is which.
- This PR proposes to fix that by outputting 2 new keys for this
pipelines outputs, "start" and "end", which correspond to the string
offsets of the word. That means that we should always have the
invariant:

```python
input[entity["start"]: entity["end"]] == entity["entity_group"]
                                    # or entity["entity"] if not grouped
```
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for working on this @Narsil.

@LysandreJik LysandreJik merged commit d8fc26e into huggingface:master Nov 30, 2020
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
…ngface#8781)

* NerPipeline (TokenClassification) now outputs offsets of words

- It happens that the offsets are missing, it forces the user to pattern
match the "word" from his input, which is not always feasible.
For instance if a sentence contains the same word twice, then there
is no way to know which is which.
- This PR proposes to fix that by outputting 2 new keys for this
pipelines outputs, "start" and "end", which correspond to the string
offsets of the word. That means that we should always have the
invariant:

```python
input[entity["start"]: entity["end"]] == entity["entity_group"]
                                    # or entity["entity"] if not grouped
```

* Fixing doc style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants