You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read a few threads regarding the differences between dependencies schemas but I couldn't find any answer.
I run a few test with the CoreNLP parser. I find Stanford Dependencies a bit more informative and accurate and I wonder why SpaCy choose different dependencies scheme from Stanford and SyntaxNet?
Is there any plan to switch to Stanford Dependencies or Universal Dependencies in the future? (and if not, why?)
The text was updated successfully, but these errors were encountered:
@moshest Ideally we'd like to be moving to the Universal Dependencies. Unfortunately this is all harder than it sounds.
The main difference between the ClearNLP converter (which we use) and the Stanford converter is that the ClearNLP converter makes use of the full treebank annotations, including the trace-nodes and function tags. This makes the conversion better, as there's a lot of linguistic information in those tags. The Stanford Converter is designed to work on the output of skeletal PCFG parsers, which don't typically produce these extra annotations. So it runs on a subset of the information, and outputs a lot more dep tags because of that.
More recently, the Universal Dependencies format has become more stable, and overall it's very good. The issue is that the treebanks already provided in UD format are much smaller and much worse than the OntoNotes 5 corpus that we're using for training. OntoNotes 5 is 10x larger than the UD-formatted English Web Treebank, and accuracy on OntoNotes 5 is around 92%, while state-of-the-art accuracy on EWTB is around 85%.
What we'd like to do is have a conversion process that reads the OntoNotes 5 and converts it into UD. We'd like to be running this process over other treebanks as well.
Finally, there are also some licensing considerations. We have a commercial license to OntoNotes 5 and a number of other resources, which lets us provide models trained on commercial-friendly terms from these corpora. However, there's no way to acquire a commercial license to the UD corpora, as there are a multitude of stake-holders involved, and it's a cooperative effort whose aims are fundamentally oriented towards research.
If you could figure out how to get the OntoNotes conversion running using the UD tools, that would be super helpful: https://github.com/universaldependencies/tools . I'm not sure whether this will be easy or difficult, though.
Feature description
I read a few threads regarding the differences between dependencies schemas but I couldn't find any answer.
I run a few test with the CoreNLP parser. I find Stanford Dependencies a bit more informative and accurate and I wonder why SpaCy choose different dependencies scheme from Stanford and SyntaxNet?
Is there any plan to switch to Stanford Dependencies or Universal Dependencies in the future? (and if not, why?)
The text was updated successfully, but these errors were encountered: