Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is "scope", "num" columns in the corpus? #19

Open
karmalet opened this issue Feb 6, 2022 · 1 comment
Open

What is "scope", "num" columns in the corpus? #19

karmalet opened this issue Feb 6, 2022 · 1 comment

Comments

@karmalet
Copy link

karmalet commented Feb 6, 2022

Hi, may I ask what those "scope", "num" columns stand for?

In "idioms_pretrain.json" ,

idiom num explanation
偃武崇文 0 停息武备,崇尚文教。
洪乔捎书 0 指言而无信的人。
南郭先生 103 比喻无才而占据其位的人。

In "idioms_scopes.tsv",

scope idiom id
Scope I 见义勇为 0
Scope II 偃武崇文 3848
Scope III 亏于一篑 33237

In "idiom_synonyms.tsv",

query synonym query_id synonym_id overlapping
黯然销魂 六神无主 14726 1333 0
黯然销魂 丧魂失魄 14726 2704 1
塞翁失马,焉知非福 塞翁失马,安知非福 24524 32175 8

I thought "overlapping" is related with the number of Chinese character overlapped, but the last one shows 8, which is presumably 7.

Thanks!

@Vimos
Copy link
Member

Vimos commented Feb 7, 2022

  1. num is the frequency of an idiom on the ChengyuCorpus which is released in Two-stage.

  2. scope is defined as the following, which is removed from the camera-ready version as Scope III is not used, we will share the definition here.
    企业微信截图_16442221571111
    image

  3. This is due to the is also used to compute overlapping.

@Vimos Vimos pinned this issue Feb 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants