Statistics reported in README don't match the actual statistics of the corpus #2

alexeyev · 2023-05-17T04:41:10Z

Dear colleagues, thank you for your fantastic work on the long-awaited treebank!

Decided that I should report this to you just in case: one can see from both the .conllu files and stats.xml that there is a total of 781 sentences in the corpus; while the README file states there are 6400 of them.

Best regards.

The text was updated successfully, but these errors were encountered:

alexeyev · 2023-05-17T05:24:48Z

It is also rather unusual that the training data segment is 8.5 times smaller than the test set; what are the motivations for that? Thank you.

martinpopel · 2023-05-17T06:51:47Z

I agree the "6400 sentences (7.4K tokens)" is suspicious (it would mean 1.2 tokens per sentence) even without looking at the .conllu and stats.xml files. I guess it should be "7.4K words (6.4K words excluding punctuation)".

the training data segment is 8.5 times smaller than the test set; what are the motivations for that?

See the data split guidelines. For treebanks with less than 20k words, it suggest to either keep everything as test data or set aside 20-50 sentences as "train". Here we see 80 sentences in train with strange sent_id system (...,797, 798, 789_, 790_, ... 799_, 800), but the fact that train is 8.5 times smaller than test is OK.

alexeyev · 2023-05-17T07:02:28Z

See the data split guidelines.

Thank you for the prompt response and the pointers!

dan-zeman added a commit that referenced this issue May 17, 2023

Fixed count in README (see also #2).

63e753b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistics reported in README don't match the actual statistics of the corpus #2

Statistics reported in README don't match the actual statistics of the corpus #2

alexeyev commented May 17, 2023

alexeyev commented May 17, 2023

martinpopel commented May 17, 2023

alexeyev commented May 17, 2023

Statistics reported in README don't match the actual statistics of the corpus #2

Statistics reported in README don't match the actual statistics of the corpus #2

Comments

alexeyev commented May 17, 2023

alexeyev commented May 17, 2023

martinpopel commented May 17, 2023

alexeyev commented May 17, 2023