Sino-Vietnamese word list #7

garfieldnate · 2022-08-11T09:26:12Z

We need to collect Sino-Vietnamese words with their associated han and roman spellings and frequency information. I've tried to avoid using Wiktionary throughout the project, but the data is actually pretty high quality, and if nothing else it's at least a good starting point.

I ran wiktextract to get Vietnamese entries. A lot more processing has to be done, but for now I'll just upload the resulting dump (and the log).

Archive.zip

garfieldnate · 2022-08-14T12:49:43Z

Python script for finding Vietnamese words with characters:

# Usage: python get_hans.py [vn_data.jsonl]
# vn_data.json should be created using wiktextract like so:
# $ ./wiktwords --all --language Vietnamese --pages-dir pages/en --cache /tmp/wikt-cache --out vn_out.jsonl enwiktionary-latest-pages-articles.xml.bz2 | tee log_vn_1.txt
# Note that the final output includes Vietnam-only characters; we could not sufficiently differentiate sino-Vietnamese from Vietnam-only characters.

from collections import defaultdict
import json
import regex as re
import sys
import warnings

han_re = re.compile(r'[\u4e00-\u9fff]')
chu_han_form_re = re.compile(f"chữ Hán form of (.+) \(“(.+)”\).")

vocab_count = defaultdict(int)
with open(sys.argv[1], 'r') as f:
    for line in f:
        data = json.loads(line)
        found_sino = False
        word = None
        chars = None
        senses = []
        char_glosses = {}

        if not data.get('word'):
            continue

        if data.get('lang', None) != "Vietnamese":
            continue

        if han_re.search(data['word']):
            found_sino = True
            vocab_count['word'] += 1
            chars = data['word']
            for s in data.get('senses', []):
                if glosses := s.get('glosses', None):
                    for g in glosses:
                        if m := chu_han_form_re.match(g):
                            word = m.group(1)
                            senses.append(m.group(2))

        if etymology_templates := data.get('etymology_templates', []):
            for et in etymology_templates:
                if et['name'] == "vi-etym-sino" and not et['expansion'].startswith("Non-Sino"):
                    if not found_sino:
                        vocab_count['etymology templates'] += 1
                    found_sino = True
                    acc_chars = False
                    if not chars:
                        chars = ""
                        acc_chars = True
                    for i in [1,3,5]:
                        if c := et['args'].get(str(i)):
                            char_glosses[c] = et['args'].get(str(i+1), None)
                            if acc_chars:
                                chars += c

        if not found_sino:
            if forms := data.get("forms", []):
                for f in forms:
                    if han_re.search(f['form']):
                        # TODO: Wiktextractor doesn't mark lots of these with the CJK tag
                        # for example https://en.wiktionary.org/wiki/公僕#Vietnamese
                        if "CJK" not in f['tags']:
                            warnings.warn("CJK tag missing from {}".format(f['form']))
                        found_sino = True
                        word = data['word']
                        vocab_count['forms'] += 1
                        chars = f['form']

        if found_sino:
            if not senses:
                for s in data.get('senses', []):
                    if glosses := s.get('glosses', None):
                        senses.append(s['glosses'])

            print(f"{data['word']}\t{chars}\t{senses}\t{char_glosses}")

print()
for k, v in vocab_count.items():
    print(f"{k}: {v}")
print(f"total: {sum(vocab_count.values())}")

And the resulting output:
vn_chars.tsv.zip

It's still a bit dirty, but I think we'll be able to filter it nicely using a vocab list. I'm waiting on someone to give me a frequency list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sino-Vietnamese word list #7

Sino-Vietnamese word list #7

garfieldnate commented Aug 11, 2022

garfieldnate commented Aug 14, 2022

Sino-Vietnamese word list #7

Sino-Vietnamese word list #7

Comments

garfieldnate commented Aug 11, 2022

garfieldnate commented Aug 14, 2022