Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics reported in README don't match the actual statistics of the corpus #2

Open
alexeyev opened this issue May 17, 2023 · 3 comments

Comments

@alexeyev
Copy link

Dear colleagues, thank you for your fantastic work on the long-awaited treebank!

Decided that I should report this to you just in case: one can see from both the .conllu files and stats.xml that there is a total of 781 sentences in the corpus; while the README file states there are 6400 of them.

Best regards.

@alexeyev
Copy link
Author

It is also rather unusual that the training data segment is 8.5 times smaller than the test set; what are the motivations for that? Thank you.

@martinpopel
Copy link
Member

I agree the "6400 sentences (7.4K tokens)" is suspicious (it would mean 1.2 tokens per sentence) even without looking at the .conllu and stats.xml files. I guess it should be "7.4K words (6.4K words excluding punctuation)".

the training data segment is 8.5 times smaller than the test set; what are the motivations for that?

See the data split guidelines. For treebanks with less than 20k words, it suggest to either keep everything as test data or set aside 20-50 sentences as "train". Here we see 80 sentences in train with strange sent_id system (...,797, 798, 789_, 790_, ... 799_, 800), but the fact that train is 8.5 times smaller than test is OK.

@alexeyev
Copy link
Author

See the data split guidelines.

Thank you for the prompt response and the pointers!

dan-zeman added a commit that referenced this issue May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants