Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization with exception patterns #700

Merged
merged 53 commits into from
Jan 2, 2017

Conversation

oroszgy
Copy link
Contributor

@oroszgy oroszgy commented Dec 21, 2016

Using regular expression for exception handling during tokenization

Description

Modified the tokenizer algorithm enabling users to incorporate regexp patterns for handling tokenization exceptions

Motivation and Context

This PR fixes #344 and allows the tokenizer to use arbitrary patterns as exceptions.

How Has This Been Tested?

New tests are added at tokenizer/test_urls.py

Screenshots (if appropriate):

NA

Types of changes

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (Addition to documentation of spaCy)

Checklist:

  • My code follows spaCy's code style.
  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@honnibal
Copy link
Member

honnibal commented Dec 22, 2016

Thanks! Reactions, in stream-of-consciousness form:

Interesting approach. I've so far resisted introducing more regex-based logic into the tokenizer. My two concerns have been:

  1. Performance — It's easy to write unfortunate regexes, and to require multiple passes over the data
  2. Maintainability — It's easy to write tokenizer logic that's hard to change, because all the rules depend on what other rules are doing.

So, my kneejerk reaction was "Oh, this isn't how we want to do this". But, then again: it's currently difficult to express the necessary logic for the URL tokenization in the tokenizer. So maybe we do need a mechanism like this.

If you have a minute, it would be nice to benchmark this. The toolset I use is in the spacy-benchmarks repository.

I expect that the way you're doing this, there shouldn't be much or any additional performance problem. It's just like the prefix and suffix expressions, the question is only asked on chunks that can't be tokenized using vocabulary items, and the expression will only match deeply on strings that are actually URLs. So, I think the benchmark will come out to be no problem.

I do have one suggested improvement, though.

Let's say we have the string:

Visit www.spam.com!

We want this to be tokenized into:

['Visit', 'www.spam.com', '!']

I suggest we rely on the prefix and suffix expressions to strip the attached tokens. This way, we only need to handle is_match('www.spam.com') -> True in the new rule-match logic. The '!' will be split off using the suffix rule. We can then make the match rule a boolean function over the entire string. This way, the user can actually supply an arbitrary boolean function to the tokenizer. We'll usually want to use the .match() method of a regex object, but the user will be free to do something else.

I think this will give a clearer division of labour between the different parts of the tokenizer. We'll have the following:

  • Exceptions: Literally match a whole chunk, and expand it into a pre-defined set of tokens.
  • Prefix match: Match N characters at the beginning of the chunk. The prefix becomes a token, the remainder of the chunk is tokenized further.
  • Suffix match: Match N characters from the end of the chunk. The suffix becomes a token, then remainder of the chunk is tokenized further.
  • Infix match: Match N characters within the chunk, splitting the token into (head, infix, tail). Head and infix become tokens, and tail is tokenized further.
  • Entire match: Match an entire chunk, which becomes a single token.

What do you think?

@oroszgy
Copy link
Contributor Author

oroszgy commented Dec 22, 2016

Hi @honnibal,

thanks for the feedback! I've tried to use your benchmark repo, but ran into several problems. :( The biggest obstacles were that the Gigaword corpus is not freely accessible, and conda does not play well with virtualenv. I ended up using 1174 docs from the UD_English corpus, and writing my own benchmarking scripts.

The results are a bit disappointing. Tokenizing the corpus with spaCy 1.4 and with my changes:

➜  python3 benchmark_spacy.py
1174 files, 412.20ms total, 0.35ms average, 0.04ms min, 9.90ms max
➜  source activate spacy-dev
(spacy-dev) ➜  python3 benchmark_spacy.py
1174 files, 4226.19ms total, 3.60ms average, 0.11ms min, 96.49ms max

What is strange, that when I explicitly set the matcher to None the running time did not change much. Do you have any idea what could go wrong?

Anyway, I really liked your idea on making this improvement more general. I'll definitely modify the PR accordingly when I figured out the reason why the tokenization became that slow.

@honnibal
Copy link
Member

Do you have a vocab file loaded in your version's virtualenv? There's a bit of a footgun in spacy atm: if you start with no vocab, it doesnt cache any tokenization, and it ends up quite slow.

@oroszgy
Copy link
Contributor Author

oroszgy commented Dec 23, 2016

Thanks, downloading the model helped a lot! Now my changes are in pair with the master branch in terms of execution speed. :)

I will update soon the PR with the new mechanism.

…trary functions as token exception matchers.
@oroszgy
Copy link
Contributor Author

oroszgy commented Dec 26, 2016

@honnibal What do you think about the changes now? Do you think that spaCy can profit from this new feature?

@honnibal
Copy link
Member

Hey,

Thanks, looking good!

I added some tests for the trickier interactions, with the punctuation. My guess is this currently fails, but I haven't had a chance to check yet.

I think you'll need to check the token_match function within the _split_affixes loop --- probably before we check find_prefix, because we do want whole-match to have precedence over the affix search.

Once we get these extra cases covered, we just need to update the docs and we're good to go! I'm happy to make the docs changes if that's easier for you.

@oroszgy
Copy link
Contributor Author

oroszgy commented Dec 30, 2016

Hey,

thanks for the reply and for the exhaustive test cases.

The implementation in this PR iteratively checks substrings with token_match and if it fails then with applies the old _split_affixes part. (see here) This method allows us to match whole tokens and split unnecessary suffixes and prefixes. However, if we go with your suggestion by moving out the token_match from the loop straight to the __call__ method, then we are losing the ability of removing affixes. In practice it means, that there won't be any straightforward way of making your tests pass.

@honnibal
Copy link
Member

I need to fix Travis for pull requests, but I think this works — it's green on my local copy. What do you think?

@oroszgy
Copy link
Contributor Author

oroszgy commented Jan 2, 2017

Looks good to me, tests are passing here as well. Thanks! (I misunderstood sg. while I was writing in my previous comment...)

Rerun my benchmark scripts, results are:

  • master branch (9d39e78): 1174 files, 1311.26ms total, 1.12ms average, 0.04ms min, 35.77ms max
  • this feature branch (fde53be) 1174 files, 1300.80ms total, 1.11ms average, 0.04ms min, 36.19ms max

@honnibal honnibal merged commit 9b48bd1 into explosion:master Jan 2, 2017
@honnibal
Copy link
Member

honnibal commented Jan 2, 2017

🎉

Merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenization of URLs needs work
6 participants