Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot replicate pre-computed syntactic distance #17

Open
letme-hj opened this issue Jan 7, 2025 · 0 comments
Open

Cannot replicate pre-computed syntactic distance #17

letme-hj opened this issue Jan 7, 2025 · 0 comments

Comments

@letme-hj
Copy link

letme-hj commented Jan 7, 2025

Hi, thank you for your work!

I wanted to ask regarding computing the syntactic distance between languages.

If I understood correctly, pre-computed syntactic distances obtained by

lang2vec.distance("syntactic", [l1, l2])

is the cosine distance between two languages, which should be properly replicated by

from scipy.spatial.distance import cosine

a = lang2vec.get_features(l1, "syntax_wals")[l1]
b = lang2vec.get_features(l2, "syntax_wals")[l2]
cosine(a, b)

And for missing features in a and b (which has -- as their values), I followed what is mentioned here: #7 (comment).

However, I find them mismatch. I also tried it with syntax_knn instead of syntax_wals, but they still mismatch.
And for some of the languages that are involved in pre-computed distances, they only have -- for all features, not actually being able to compute distances with other languages. (e.g., syntactic distance between frr, dan is provided, as shown as an example in README, but l2v.get_features("frr", "syntax_wals") gives a list of "--"s.)

Below are average Pearson correlation coefficients and pvalues between pre-computed and manually computed distances of each language.

  • manually computed with syntax_wals & pre-computed : coef - 0.6325433738084123 / pvalue - 0.13051253893837714
  • manually computed with syntax_knn & pre-computed : coef - 0.6257979297204636 / pvalue - 0.1392174749552544

I would really appreciate it if you could provide more details on computing the distance if I missed something here!

Thank you so much :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant