-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sino-Vietnamese word list #7
Comments
Python script for finding Vietnamese words with characters: # Usage: python get_hans.py [vn_data.jsonl]
# vn_data.json should be created using wiktextract like so:
# $ ./wiktwords --all --language Vietnamese --pages-dir pages/en --cache /tmp/wikt-cache --out vn_out.jsonl enwiktionary-latest-pages-articles.xml.bz2 | tee log_vn_1.txt
# Note that the final output includes Vietnam-only characters; we could not sufficiently differentiate sino-Vietnamese from Vietnam-only characters.
from collections import defaultdict
import json
import regex as re
import sys
import warnings
han_re = re.compile(r'[\u4e00-\u9fff]')
chu_han_form_re = re.compile(f"chữ Hán form of (.+) \(“(.+)”\).")
vocab_count = defaultdict(int)
with open(sys.argv[1], 'r') as f:
for line in f:
data = json.loads(line)
found_sino = False
word = None
chars = None
senses = []
char_glosses = {}
if not data.get('word'):
continue
if data.get('lang', None) != "Vietnamese":
continue
if han_re.search(data['word']):
found_sino = True
vocab_count['word'] += 1
chars = data['word']
for s in data.get('senses', []):
if glosses := s.get('glosses', None):
for g in glosses:
if m := chu_han_form_re.match(g):
word = m.group(1)
senses.append(m.group(2))
if etymology_templates := data.get('etymology_templates', []):
for et in etymology_templates:
if et['name'] == "vi-etym-sino" and not et['expansion'].startswith("Non-Sino"):
if not found_sino:
vocab_count['etymology templates'] += 1
found_sino = True
acc_chars = False
if not chars:
chars = ""
acc_chars = True
for i in [1,3,5]:
if c := et['args'].get(str(i)):
char_glosses[c] = et['args'].get(str(i+1), None)
if acc_chars:
chars += c
if not found_sino:
if forms := data.get("forms", []):
for f in forms:
if han_re.search(f['form']):
# TODO: Wiktextractor doesn't mark lots of these with the CJK tag
# for example https://en.wiktionary.org/wiki/公僕#Vietnamese
if "CJK" not in f['tags']:
warnings.warn("CJK tag missing from {}".format(f['form']))
found_sino = True
word = data['word']
vocab_count['forms'] += 1
chars = f['form']
if found_sino:
if not senses:
for s in data.get('senses', []):
if glosses := s.get('glosses', None):
senses.append(s['glosses'])
print(f"{data['word']}\t{chars}\t{senses}\t{char_glosses}")
print()
for k, v in vocab_count.items():
print(f"{k}: {v}")
print(f"total: {sum(vocab_count.values())}") And the resulting output: It's still a bit dirty, but I think we'll be able to filter it nicely using a vocab list. I'm waiting on someone to give me a frequency list. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We need to collect Sino-Vietnamese words with their associated han and roman spellings and frequency information. I've tried to avoid using Wiktionary throughout the project, but the data is actually pretty high quality, and if nothing else it's at least a good starting point.
I ran wiktextract to get Vietnamese entries. A lot more processing has to be done, but for now I'll just upload the resulting dump (and the log).
Archive.zip
The text was updated successfully, but these errors were encountered: