Stanford Dependencies / Universal Dependencies #2485

moshest · 2018-06-27T18:17:46Z

Feature description

I read a few threads regarding the differences between dependencies schemas but I couldn't find any answer.

I run a few test with the CoreNLP parser. I find Stanford Dependencies a bit more informative and accurate and I wonder why SpaCy choose different dependencies scheme from Stanford and SyntaxNet?

Is there any plan to switch to Stanford Dependencies or Universal Dependencies in the future? (and if not, why?)

honnibal · 2018-07-06T13:48:54Z

@moshest Ideally we'd like to be moving to the Universal Dependencies. Unfortunately this is all harder than it sounds.

The main difference between the ClearNLP converter (which we use) and the Stanford converter is that the ClearNLP converter makes use of the full treebank annotations, including the trace-nodes and function tags. This makes the conversion better, as there's a lot of linguistic information in those tags. The Stanford Converter is designed to work on the output of skeletal PCFG parsers, which don't typically produce these extra annotations. So it runs on a subset of the information, and outputs a lot more dep tags because of that.

More recently, the Universal Dependencies format has become more stable, and overall it's very good. The issue is that the treebanks already provided in UD format are much smaller and much worse than the OntoNotes 5 corpus that we're using for training. OntoNotes 5 is 10x larger than the UD-formatted English Web Treebank, and accuracy on OntoNotes 5 is around 92%, while state-of-the-art accuracy on EWTB is around 85%.

What we'd like to do is have a conversion process that reads the OntoNotes 5 and converts it into UD. We'd like to be running this process over other treebanks as well.

Finally, there are also some licensing considerations. We have a commercial license to OntoNotes 5 and a number of other resources, which lets us provide models trained on commercial-friendly terms from these corpora. However, there's no way to acquire a commercial license to the UD corpora, as there are a multitude of stake-holders involved, and it's a cooperative effort whose aims are fundamentally oriented towards research.

If you could figure out how to get the OntoNotes conversion running using the UD tools, that would be super helpful: https://github.com/universaldependencies/tools . I'm not sure whether this will be easy or difficult, though.

lock · 2019-03-27T22:34:11Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added models Issues related to the statistical models feat / parser Feature: Dependency Parser labels Jun 28, 2018

honnibal closed this as completed Feb 25, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanford Dependencies / Universal Dependencies #2485

Stanford Dependencies / Universal Dependencies #2485

moshest commented Jun 27, 2018 •

edited

Loading

honnibal commented Jul 6, 2018

lock bot commented Mar 27, 2019

Stanford Dependencies / Universal Dependencies #2485

Stanford Dependencies / Universal Dependencies #2485

Comments

moshest commented Jun 27, 2018 • edited Loading

Feature description

honnibal commented Jul 6, 2018

lock bot commented Mar 27, 2019

moshest commented Jun 27, 2018 •

edited

Loading