Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer — problems with punctuation? #801

Closed
ematvey opened this issue Feb 2, 2017 · 3 comments
Closed

Tokenizer — problems with punctuation? #801

ematvey opened this issue Feb 2, 2017 · 3 comments

Comments

@ematvey
Copy link
Contributor

ematvey commented Feb 2, 2017

Consider the following snippet:

text = """Vanity is one of the things which are perhaps most difficult for a noble man to understand: he will be tempted to deny it, where another kind of man thinks he sees it self-evidently. The problem for him is to represent to his mind beings who seek to arouse a good opinion of themselves which they themselves do not possess--and consequently also do not "deserve,"--and who yet BELIEVE in this good opinion afterwards."""
# verbatim portion of http://www.gutenberg.org/cache/epub/4363/pg4363.txt

import spacy
en = spacy.load('en')
doc = en(text)
for token in doc:
  if token.pos_ == 'X':
    print(token.orth_)

This outputs deserve,"--and.

Running tokenizer on full text outputs following errors (I assume X marks unidentifiable POS):

faculties"--of
de
la
exception;--exclusive
mandeikagati
day.--Is
O
cherche
le
vrai
que
pour
faire
le
bien"--I
il
faut
etre
sec
des
decouvertes
c'est
voir
clair
dans
ce
qui
kind,--that
liben
pensatori
net,--to
or-
enveloppe
le
corps
[Greek
JE
L'ART
et
de
mulierel
suffers,--who
history"--an
national'--what
do.--What
fatherlands"--they
co
deserve,"--and
are!"--even
himself."--Goethe
refinement:--just
thee-
teeth.--Perhaps
memories?--To
rue
_must
etc
es
Useful.=--Therefore
intelligibeln
unegoistic."--In
=Hope.=--Pandora
hope.--Zeus
=Man
_must

Most of those tokenization failures are due to punctuation. Admittedly, Gutenberg's texts are not the cleanest ones, but perhaps tokenization rules could be improved?

@ines
Copy link
Member

ines commented Feb 2, 2017

Thanks for the report – this seems to be caused by the global infix rules being too specific. Currently, they cover all common hyphens, but no combinations of character + hyphen.

I'll add a regression test for this issue and see if I can fix the rules to handle these cases, without breaking anything else. (Now that we have a much better test suite in place for the tokenizers, it's definitely a lot easier to make sure changes to the regexes don't produce unintended results elsewhere.)

I definitely have the vision of one day being able to handle all Gutenberg texts perfectly and out-of-the-box, though 😉 In some cases, the formatting markup is tricky and may conflict with other rules, so if you're working with a lot of texts like these, it might be worth creating a custom tokenizer subclass and add overriding some of the punctuation rules.

@ines
Copy link
Member

ines commented Feb 2, 2017

Okay, so after playing around with it for a bit, here's a compromise for now:

There's currently no easy way to define rules for splitting multiple infixes. But in any case, we definitely don't want to end up with punctuation attached to a token.

I've modified the infix rules to not split off hyphens if they're following certain punctuation. Still not perfect, but at least the non-punctuation tokens are now correct and spaCy will be able to assign the correct POS tags.

['"', 'deserve', ',"--', 'and']
['exception', ';--', 'exclusive']
['day', '.--', 'Is']
['refinement', ':--', 'just']
['memories', '?--', 'To']
['Useful', '.=--', 'Therefore']
['=', 'Hope', '.=--', 'Pandora']

So for now, we'll have to assume that this is the "correct" behaviour. I've edited the regression test to expect the above output, and closing this issue since it's technically fixed. But handling multiple infixes would be a nice additional feature, so if you have a suggestion (or an idea for a PR along those lines), this would definitely be appreciated! 👍

@ines ines closed this as completed Feb 2, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants