Tokenization with exception patterns #700

oroszgy · 2016-12-21T22:33:51Z

Using regular expression for exception handling during tokenization

Description

Modified the tokenizer algorithm enabling users to incorporate regexp patterns for handling tokenization exceptions

Motivation and Context

This PR fixes #344 and allows the tokenizer to use arbitrary patterns as exceptions.

How Has This Been Tested?

New tests are added at tokenizer/test_urls.py

Screenshots (if appropriate):

NA

Types of changes

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality to spaCy)
Breaking change (fix or feature causing change to spaCy's existing functionality)
Documentation (Addition to documentation of spaCy)

Checklist:

My code follows spaCy's code style.
My change requires a change to spaCy's documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

honnibal · 2016-12-22T10:51:39Z

Thanks! Reactions, in stream-of-consciousness form:

Interesting approach. I've so far resisted introducing more regex-based logic into the tokenizer. My two concerns have been:

Performance — It's easy to write unfortunate regexes, and to require multiple passes over the data
Maintainability — It's easy to write tokenizer logic that's hard to change, because all the rules depend on what other rules are doing.

So, my kneejerk reaction was "Oh, this isn't how we want to do this". But, then again: it's currently difficult to express the necessary logic for the URL tokenization in the tokenizer. So maybe we do need a mechanism like this.

If you have a minute, it would be nice to benchmark this. The toolset I use is in the spacy-benchmarks repository.

I expect that the way you're doing this, there shouldn't be much or any additional performance problem. It's just like the prefix and suffix expressions, the question is only asked on chunks that can't be tokenized using vocabulary items, and the expression will only match deeply on strings that are actually URLs. So, I think the benchmark will come out to be no problem.

I do have one suggested improvement, though.

Let's say we have the string:

Visit www.spam.com!

We want this to be tokenized into:

['Visit', 'www.spam.com', '!']

I suggest we rely on the prefix and suffix expressions to strip the attached tokens. This way, we only need to handle is_match('www.spam.com') -> True in the new rule-match logic. The '!' will be split off using the suffix rule. We can then make the match rule a boolean function over the entire string. This way, the user can actually supply an arbitrary boolean function to the tokenizer. We'll usually want to use the .match() method of a regex object, but the user will be free to do something else.

I think this will give a clearer division of labour between the different parts of the tokenizer. We'll have the following:

Exceptions: Literally match a whole chunk, and expand it into a pre-defined set of tokens.
Prefix match: Match N characters at the beginning of the chunk. The prefix becomes a token, the remainder of the chunk is tokenized further.
Suffix match: Match N characters from the end of the chunk. The suffix becomes a token, then remainder of the chunk is tokenized further.
Infix match: Match N characters within the chunk, splitting the token into (head, infix, tail). Head and infix become tokens, and tail is tokenized further.
Entire match: Match an entire chunk, which becomes a single token.

What do you think?

Peformance -> Performance

fixed minor typo

oroszgy · 2016-12-22T22:08:54Z

Hi @honnibal,

thanks for the feedback! I've tried to use your benchmark repo, but ran into several problems. :( The biggest obstacles were that the Gigaword corpus is not freely accessible, and conda does not play well with virtualenv. I ended up using 1174 docs from the UD_English corpus, and writing my own benchmarking scripts.

The results are a bit disappointing. Tokenizing the corpus with spaCy 1.4 and with my changes:

➜  python3 benchmark_spacy.py
1174 files, 412.20ms total, 0.35ms average, 0.04ms min, 9.90ms max
➜  source activate spacy-dev
(spacy-dev) ➜  python3 benchmark_spacy.py
1174 files, 4226.19ms total, 3.60ms average, 0.11ms min, 96.49ms max

What is strange, that when I explicitly set the matcher to None the running time did not change much. Do you have any idea what could go wrong?

Anyway, I really liked your idea on making this improvement more general. I'll definitely modify the PR accordingly when I figured out the reason why the tokenization became that slow.

honnibal · 2016-12-23T04:53:37Z

Do you have a vocab file loaded in your version's virtualenv? There's a bit of a footgun in spacy atm: if you start with no vocab, it doesnt cache any tokenization, and it ends up quite slow.

Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions

oroszgy · 2016-12-23T22:00:12Z

Thanks, downloading the model helped a lot! Now my changes are in pair with the master branch in terms of execution speed. :)

I will update soon the PR with the new mechanism.

…trary functions as token exception matchers.

…izer

oroszgy · 2016-12-26T21:46:22Z

@honnibal What do you think about the changes now? Do you think that spaCy can profit from this new feature?

Initial support for Hungarian

Added Swedish abbreviations

16 of the 17 PoS tags in the UD tag set is added; PART is missing.

Add PART to tag map

Check that the new token_match function applies after punctuation is split off.

honnibal · 2016-12-30T13:56:08Z

Hey,

Thanks, looking good!

I added some tests for the trickier interactions, with the punctuation. My guess is this currently fails, but I haven't had a chance to check yet.

I think you'll need to check the token_match function within the _split_affixes loop --- probably before we check find_prefix, because we do want whole-match to have precedence over the affix search.

Once we get these extra cases covered, we just need to update the docs and we're good to go! I'm happy to make the docs changes if that's easier for you.

oroszgy · 2016-12-30T19:42:02Z

Hey,

thanks for the reply and for the exhaustive test cases.

The implementation in this PR iteratively checks substrings with token_match and if it fails then with applies the old _split_affixes part. (see here) This method allows us to match whole tokens and split unnecessary suffixes and prefixes. However, if we go with your suggestion by moving out the token_match from the loop straight to the __call__ method, then we are losing the ability of removing affixes. In practice it means, that there won't be any straightforward way of making your tests pass.

…m/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns

honnibal · 2016-12-30T23:22:32Z

I need to fix Travis for pull requests, but I think this works — it's green on my local copy. What do you think?

oroszgy · 2017-01-02T21:29:32Z

Looks good to me, tests are passing here as well. Thanks! (I misunderstood sg. while I was writing in my previous comment...)

Rerun my benchmark scripts, results are:

master branch (9d39e78): 1174 files, 1311.26ms total, 1.12ms average, 0.04ms min, 35.77ms max
this feature branch (fde53be) 1174 files, 1300.80ms total, 1.11ms average, 0.04ms min, 36.19ms max

honnibal · 2017-01-02T22:56:42Z

🎉

Merging!

oroszgy added 14 commits December 7, 2016 23:07

First steps towards the Hungarian tokenizer code.

5b00039

Added Hungarian resource files.

90d22db

Additional abbreviation tests.

0289b8c

Passing Hungatian abbrev tests.

2051726

Adding partial hyphen and quote handling support.

0cf2144

Partial Hungarian number tokenization is added.

c035928

Merge branch 'master' into hu_tokenizer

366b3f8

Refactored language data structure

6add156

Improved partial support for tokenzing Hungarian numbers

23956e7

Added further testcases.

3d5306a

Removed data files from tests..

ab2f6ea

Hungarian module is exposed in spacy.

35aa547

Added exception pattern mechanism to the tokenizer.

1748549

Maintaining backward compatibility.

d9c59c4

fnorf and others added 3 commits December 22, 2016 13:02

fixed minor typo

c5c0ed9

Peformance -> Performance

Merge pull request explosion#702 from fnorf/patch-1

642803d

fixed minor typo

Added Swedish abbreviations

fdf4776

ines and others added 8 commits December 23, 2016 14:30

Remove exceptions containing whitespace / no special chars

7f411fd

Separate inline icon and help cursor classes

11ec02d

Add resources page to usage docs

cc051dd

Fix formatting and wording

48b03b4

Fix license formatting for GitHub's parser

12bb0aa

Update Spanish tokenizer

1d64527

Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions

Fix formatting and consistency

1436b9f

Fix spelling

207555f

Updated token exception handling mechanism to allow the usage of arbi…

3a9be4d

…trary functions as token exception matchers.

ines and others added 3 commits December 25, 2016 15:23

Fix typo

b7becae

Accepted contributor agreement.

ade7487

Merge branch 'hu_tokenizer' of github.com:oroszgy/spaCy into hu_token…

ef8f310

…izer

ines and others added 15 commits December 27, 2016 00:48

Merge pull request explosion#705 from oroszgy/hu_tokenizer

78f754d

Initial support for Hungarian

Update CONTRIBUTORS.md

223142d

Merge pull request explosion#703 from magnusburton/master

ad3669c

Added Swedish abbreviations

Allow the vocabulary to grow to 10,000, to prevent cold-start problem.

ce4539d

Merge branch 'master' of ssh://github.com/explosion/spaCy

cade536

Increment version

f62db78

Update version

e80dad8

Update README.rst

decb743

Add Hungarian to alpha support overview

d158595

Update CONTRIBUTORS.md

9f24eb3

Update README.rst

14295f9

Add PART to tag map

f112e77

16 of the 17 PoS tags in the UD tag set is added; PART is missing.

Merge pull request explosion#713 from petterhh/patch-1

9d39e78

Add PART to tag map

Whitespace

623d94e

Test interaction of token_match and punctuation

3e8d9c7

Check that the new token_match function applies after punctuation is split off.

syllog1sm added 3 commits December 30, 2016 14:53

Merge branch 'tokenization_w_exception_patterns' of https://github.co…

9936a1b

…m/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns

Fix URL tests

3ba7c16

Move whole token mach inside _split_affixes.

fde53be

honnibal merged commit 9b48bd1 into explosion:master Jan 2, 2017

oroszgy mentioned this pull request Jan 3, 2017

Tokenisation exceptions for English pronoun "She" #718

Closed

oroszgy mentioned this pull request Jan 28, 2017

Support for French tokenization exceptions #783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization with exception patterns #700

Tokenization with exception patterns #700

oroszgy commented Dec 21, 2016 •

edited

Loading

honnibal commented Dec 22, 2016 •

edited by ines

Loading

oroszgy commented Dec 22, 2016

honnibal commented Dec 23, 2016

oroszgy commented Dec 23, 2016

oroszgy commented Dec 26, 2016

honnibal commented Dec 30, 2016

oroszgy commented Dec 30, 2016

honnibal commented Dec 30, 2016

oroszgy commented Jan 2, 2017

honnibal commented Jan 2, 2017

Tokenization with exception patterns #700

Tokenization with exception patterns #700

Conversation

oroszgy commented Dec 21, 2016 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

honnibal commented Dec 22, 2016 • edited by ines Loading

oroszgy commented Dec 22, 2016

honnibal commented Dec 23, 2016

oroszgy commented Dec 23, 2016

oroszgy commented Dec 26, 2016

honnibal commented Dec 30, 2016

oroszgy commented Dec 30, 2016

honnibal commented Dec 30, 2016

oroszgy commented Jan 2, 2017

honnibal commented Jan 2, 2017

oroszgy commented Dec 21, 2016 •

edited

Loading

honnibal commented Dec 22, 2016 •

edited by ines

Loading