-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics reported in README don't match the actual statistics of the corpus #2
Comments
It is also rather unusual that the training data segment is 8.5 times smaller than the test set; what are the motivations for that? Thank you. |
I agree the "6400 sentences (7.4K tokens)" is suspicious (it would mean 1.2 tokens per sentence) even without looking at the
See the data split guidelines. For treebanks with less than 20k words, it suggest to either keep everything as test data or set aside 20-50 sentences as "train". Here we see 80 sentences in train with strange |
Thank you for the prompt response and the pointers! |
Dear colleagues, thank you for your fantastic work on the long-awaited treebank!
Decided that I should report this to you just in case: one can see from both the
.conllu
files andstats.xml
that there is a total of 781 sentences in the corpus; while the README file states there are 6400 of them.Best regards.
The text was updated successfully, but these errors were encountered: