Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries #13

Open
CallumMcMahon opened this issue Jun 1, 2022 · 1 comment
Open

Duplicate entries #13

CallumMcMahon opened this issue Jun 1, 2022 · 1 comment

Comments

@CallumMcMahon
Copy link

There are many entries that are duplicates (I count 175 on my parsed version of the dataset but manually confirmed the first few in the original text)

for example
26316050 783 790 filling T052 C0441655
26316050 783 790 filling T052 C0441655

27259326 525 529 AHCS T061 C0010408
27259326 525 529 AHCS T061 C0010408

27262362 730 738 increase T169 C0442805
27262362 730 738 increase T169 C0442805

This makes the mention count 352,321 instead of the 352,496 documented, if removing these exact duplicates

There's also rows with the same spans referring to different UMLS concepts, which I don't see documented in the repo.
28548949 1809 1812 XPA T028 C1337030
28548949 1809 1812 XPA T116,T123 C1506534
Presumably if an entity linking model predicts either of these, it's marked as correct? Instead of marking the span twice and always being wrong in one of the two cases

Could this be caused by the annotation quality step?

@czi-sunil
Copy link
Contributor

These duplicates are a result of the annotation quality. Exact duplicates can be accounted for in measurements by using sets as the basis, e.g. set of (start, end, concept-ID). Having same span refer to multiple entities will still cause a problem if the model assumes a single entity per span. It looks from your analysis that the number of these is quite small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants