Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sino-Vietnamese word list #7

Open
garfieldnate opened this issue Aug 11, 2022 · 1 comment
Open

Sino-Vietnamese word list #7

garfieldnate opened this issue Aug 11, 2022 · 1 comment

Comments

@garfieldnate
Copy link
Owner

We need to collect Sino-Vietnamese words with their associated han and roman spellings and frequency information. I've tried to avoid using Wiktionary throughout the project, but the data is actually pretty high quality, and if nothing else it's at least a good starting point.

I ran wiktextract to get Vietnamese entries. A lot more processing has to be done, but for now I'll just upload the resulting dump (and the log).

Archive.zip

@garfieldnate
Copy link
Owner Author

Python script for finding Vietnamese words with characters:

# Usage: python get_hans.py [vn_data.jsonl]
# vn_data.json should be created using wiktextract like so:
# $ ./wiktwords --all --language Vietnamese --pages-dir pages/en --cache /tmp/wikt-cache --out vn_out.jsonl enwiktionary-latest-pages-articles.xml.bz2 | tee log_vn_1.txt
# Note that the final output includes Vietnam-only characters; we could not sufficiently differentiate sino-Vietnamese from Vietnam-only characters.

from collections import defaultdict
import json
import regex as re
import sys
import warnings

han_re = re.compile(r'[\u4e00-\u9fff]')
chu_han_form_re = re.compile(f"chữ Hán form of (.+) \(“(.+)”\).")

vocab_count = defaultdict(int)
with open(sys.argv[1], 'r') as f:
    for line in f:
        data = json.loads(line)
        found_sino = False
        word = None
        chars = None
        senses = []
        char_glosses = {}

        if not data.get('word'):
            continue

        if data.get('lang', None) != "Vietnamese":
            continue

        if han_re.search(data['word']):
            found_sino = True
            vocab_count['word'] += 1
            chars = data['word']
            for s in data.get('senses', []):
                if glosses := s.get('glosses', None):
                    for g in glosses:
                        if m := chu_han_form_re.match(g):
                            word = m.group(1)
                            senses.append(m.group(2))

        if etymology_templates := data.get('etymology_templates', []):
            for et in etymology_templates:
                if et['name'] == "vi-etym-sino" and not et['expansion'].startswith("Non-Sino"):
                    if not found_sino:
                        vocab_count['etymology templates'] += 1
                    found_sino = True
                    acc_chars = False
                    if not chars:
                        chars = ""
                        acc_chars = True
                    for i in [1,3,5]:
                        if c := et['args'].get(str(i)):
                            char_glosses[c] = et['args'].get(str(i+1), None)
                            if acc_chars:
                                chars += c

        if not found_sino:
            if forms := data.get("forms", []):
                for f in forms:
                    if han_re.search(f['form']):
                        # TODO: Wiktextractor doesn't mark lots of these with the CJK tag
                        # for example https://en.wiktionary.org/wiki/公僕#Vietnamese
                        if "CJK" not in f['tags']:
                            warnings.warn("CJK tag missing from {}".format(f['form']))
                        found_sino = True
                        word = data['word']
                        vocab_count['forms'] += 1
                        chars = f['form']

        if found_sino:
            if not senses:
                for s in data.get('senses', []):
                    if glosses := s.get('glosses', None):
                        senses.append(s['glosses'])

            print(f"{data['word']}\t{chars}\t{senses}\t{char_glosses}")

print()
for k, v in vocab_count.items():
    print(f"{k}: {v}")
print(f"total: {sum(vocab_count.values())}")

And the resulting output:
vn_chars.tsv.zip

It's still a bit dirty, but I think we'll be able to filter it nicely using a vocab list. I'm waiting on someone to give me a frequency list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant