Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanford Dependencies / Universal Dependencies #2485

Closed
moshest opened this issue Jun 27, 2018 · 2 comments
Closed

Stanford Dependencies / Universal Dependencies #2485

moshest opened this issue Jun 27, 2018 · 2 comments
Labels
feat / parser Feature: Dependency Parser models Issues related to the statistical models

Comments

@moshest
Copy link

moshest commented Jun 27, 2018

Feature description

I read a few threads regarding the differences between dependencies schemas but I couldn't find any answer.

I run a few test with the CoreNLP parser. I find Stanford Dependencies a bit more informative and accurate and I wonder why SpaCy choose different dependencies scheme from Stanford and SyntaxNet?

Is there any plan to switch to Stanford Dependencies or Universal Dependencies in the future? (and if not, why?)

@ines ines added models Issues related to the statistical models feat / parser Feature: Dependency Parser labels Jun 28, 2018
@honnibal
Copy link
Member

honnibal commented Jul 6, 2018

@moshest Ideally we'd like to be moving to the Universal Dependencies. Unfortunately this is all harder than it sounds.

The main difference between the ClearNLP converter (which we use) and the Stanford converter is that the ClearNLP converter makes use of the full treebank annotations, including the trace-nodes and function tags. This makes the conversion better, as there's a lot of linguistic information in those tags. The Stanford Converter is designed to work on the output of skeletal PCFG parsers, which don't typically produce these extra annotations. So it runs on a subset of the information, and outputs a lot more dep tags because of that.

More recently, the Universal Dependencies format has become more stable, and overall it's very good. The issue is that the treebanks already provided in UD format are much smaller and much worse than the OntoNotes 5 corpus that we're using for training. OntoNotes 5 is 10x larger than the UD-formatted English Web Treebank, and accuracy on OntoNotes 5 is around 92%, while state-of-the-art accuracy on EWTB is around 85%.

What we'd like to do is have a conversion process that reads the OntoNotes 5 and converts it into UD. We'd like to be running this process over other treebanks as well.

Finally, there are also some licensing considerations. We have a commercial license to OntoNotes 5 and a number of other resources, which lets us provide models trained on commercial-friendly terms from these corpora. However, there's no way to acquire a commercial license to the UD corpora, as there are a multitude of stake-holders involved, and it's a cooperative effort whose aims are fundamentally oriented towards research.

If you could figure out how to get the OntoNotes conversion running using the UD tools, that would be super helpful: https://github.com/universaldependencies/tools . I'm not sure whether this will be easy or difficult, though.

@lock
Copy link

lock bot commented Mar 27, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / parser Feature: Dependency Parser models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

3 participants