Skip to content

Frequency dictionary for yomichan based on the Corpus of Everyday Japanese Conversation dataset

Notifications You must be signed in to change notification settings

forsakeninfinity/CEJC_yomichan_freq_dict

Repository files navigation

Corpus of Everyday Japanese Conversation Yomichan Frequency Dictionary

What is this?

This repository contains frequency dictionaries for Japanese terms that can be imported into Yomichan (or any software that supports Yomichan style frequency dicts). The terms are ranked differently based on different contexts / domains so you can pick whichever ones interest you although I suspect the overall one is the most useful one for most people.

The dictionaries here are generated from the datasource by running the python script in the repository. You will want to be on some newer version of Python3 (3.10 and above) and you will need to install pandas and jaconv to run it but you may as well run the following if you wish to recreate the dictionaries from scratch for some reason (do it in a virtual environment if you really care but the dependencies are common enough that you may want them in your system python too anyway):

pip install -r requirements.txt
python3 -i make_cejc_freq_dicts_from_tsv.py

Running the script produces everything in the repository except for the source data, git metadata, this README and the script itself. The produced JSON files for the dictionaries are inside dicts/ with a folder per domain. Zip files that you can actually install from yomichan are inside releases/.

There are a whole bunch of different dictionaries produced as the project ranked words differently based on different domains and age/gender of speakers etc. I don’t know who may care for all of these, but knock yourself out if you do care for specific ones. I would recommend the overall one at least then pick whichever domain interests you in particular if you care for more. See the next section for a table of domains of interest or just browse the releases yourself.

I highly doubt this can be realistically repurposed readily to parse another source but if there is a future version of the CEJC project and they keep the same data format, it may just be reusable. It may just be of interest for ideas or reference purposes though I guess.

Potentially interesting domains

This table just enlists some of the more interesting domains with direct links to the zipfiles to download and import into yomichan.

DomainDescriptionDownload Link
Combined / OverallFrequency without considering domains and other qualifiersCorpus of Everyday Japanese Conversation.zip
男性Male conversations without considering ageCorpus of Everyday Japanese Conversation (男性).zip
女性Female conversations without considering ageCorpus of Everyday Japanese Conversation (女性).zip
交通機関TransportationCorpus of Everyday Japanese Conversation (交通機関).zip
会議・会合Conferences & MeetingsCorpus of Everyday Japanese Conversation (会議・会合).zip
公共商業施設Public commercial facilitiesCorpus of Everyday Japanese Conversation (公共商業施設).zip
学校SchoolCorpus of Everyday Japanese Conversation (学校).zip
室内IndoorsCorpus of Everyday Japanese Conversation (室内).zip
屋外OutdoorsCorpus of Everyday Japanese Conversation (屋外).zip
授業・レッスンClass / LessonCorpus of Everyday Japanese Conversation (授業・レッスン).zip
用談・相談Chat / ConsultationCorpus of Everyday Japanese Conversation (用談・相談).zip
職場WorkplaceCorpus of Everyday Japanese Conversation (職場).zip
自宅Inside one’s own homeCorpus of Everyday Japanese Conversation (自宅).zip
雑談Small talkCorpus of Everyday Japanese Conversation (雑談).zip

Source

Project Website (in English)

https://www.ninjal.ac.jp/english/research/cr-project/project-3/institute/spoken-language/

Summary from website

The Corpus of Everyday Japanese Conversation (CEJC) is a vocabulary and word count table based on 200 hours of recorded data (approximately from April 2016 to 2020).

Our project will develop a large-scale corpus of Japanese everyday conversation in a balanced manner. Since informants record their conversations in everyday situations by themselves, naturally occurring conversations can be collected. To build an empirical foundation for the corpus design, we conducted a survey of ordinary conversational behavior of about 250 adults.”

Since there were several ranks included in the file, the overall rank was chosen to generate this frequency dictionary.

Data download URL

Corpus of Everyday Japanese Conversation

The actual source file is 2_cejc_frequencylist_suw_token.tsv which is inside the 2nd zip file (CEJC短単位語彙表_語彙素のみ_語形別_ver202209.zip) listed in the page, i.e., the 3rd file from the top.

You can also find it committed to the repository here fwiw. I doubt the original authors have an issue with hosting the file elsewhere given that this came out of academic research, but if the original authors do have an issue with it, let me know and I will remove it.

About Yomichan

Yomichan is a pop-up dictionary for Japanese which is no longer in active development but the archived repository is still available at https://github.com/FooSoft/yomichan

Note that the extension released to the FireFox store is an older version. You should install it by sideloading the following file instead: https://github.com/FooSoft/yomichan/releases/download/22.10.23.0/a708116f79104891acbd-22.10.23.0.xpi

See https://github.com/themoeway/yomitan for a heavily WIP successor that isn’t yet ready for the public.

Credits

  • n-manas released a version of this earlier in 2023 but it doesn’t account for different readings and only contains a subset of the domains
  • MarvNC’s listing of dictionaries is how I actually found the dict and the data source and I was gently nudged by him to parse it again for readings I guess lol
  • Aquafina-water-bottle for making frequency sorting a thing that people think about. I was working on something tangentially related to that wherefrom I got sidetracked into doing something more directly related to that and then further sidetracked into doing this… oh well

About

Frequency dictionary for yomichan based on the Corpus of Everyday Japanese Conversation dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages