This repository contains frequency dictionaries for Japanese terms that can be imported into Yomichan (or any software that supports Yomichan style frequency dicts). The terms are ranked differently based on different contexts / domains so you can pick whichever ones interest you although I suspect the overall one is the most useful one for most people.
The dictionaries here are generated from the datasource by running the python
script in the repository. You will want to be on some newer version of Python3
(3.10 and above) and you will need to install pandas
and jaconv
to run it
but you may as well run the following if you wish to recreate the dictionaries
from scratch for some reason (do it in a virtual environment if you really care
but the dependencies are common enough that you may want them in your system
python too anyway):
pip install -r requirements.txt
python3 -i make_cejc_freq_dicts_from_tsv.py
Running the script produces everything in the repository except for the source data, git metadata, this README and the script itself. The produced JSON files for the dictionaries are inside dicts/ with a folder per domain. Zip files that you can actually install from yomichan are inside releases/.
There are a whole bunch of different dictionaries produced as the project ranked words differently based on different domains and age/gender of speakers etc. I don’t know who may care for all of these, but knock yourself out if you do care for specific ones. I would recommend the overall one at least then pick whichever domain interests you in particular if you care for more. See the next section for a table of domains of interest or just browse the releases yourself.
I highly doubt this can be realistically repurposed readily to parse another source but if there is a future version of the CEJC project and they keep the same data format, it may just be reusable. It may just be of interest for ideas or reference purposes though I guess.
This table just enlists some of the more interesting domains with direct links to the zipfiles to download and import into yomichan.
https://www.ninjal.ac.jp/english/research/cr-project/project-3/institute/spoken-language/
The Corpus of Everyday Japanese Conversation (CEJC) is a vocabulary and word count table based on 200 hours of recorded data (approximately from April 2016 to 2020).
Our project will develop a large-scale corpus of Japanese everyday conversation in a balanced manner. Since informants record their conversations in everyday situations by themselves, naturally occurring conversations can be collected. To build an empirical foundation for the corpus design, we conducted a survey of ordinary conversational behavior of about 250 adults.”
Since there were several ranks included in the file, the overall rank was chosen to generate this frequency dictionary.
Corpus of Everyday Japanese Conversation
The actual source file is 2_cejc_frequencylist_suw_token.tsv
which is inside
the 2nd zip file (CEJC短単位語彙表_語彙素のみ_語形別_ver202209.zip) listed in
the page, i.e., the 3rd file from the top.
You can also find it committed to the repository here fwiw. I doubt the original authors have an issue with hosting the file elsewhere given that this came out of academic research, but if the original authors do have an issue with it, let me know and I will remove it.
Yomichan is a pop-up dictionary for Japanese which is no longer in active development but the archived repository is still available at https://github.com/FooSoft/yomichan
Note that the extension released to the FireFox store is an older version. You should install it by sideloading the following file instead: https://github.com/FooSoft/yomichan/releases/download/22.10.23.0/a708116f79104891acbd-22.10.23.0.xpi
See https://github.com/themoeway/yomitan for a heavily WIP successor that isn’t yet ready for the public.
- n-manas released a version of this earlier in 2023 but it doesn’t account for different readings and only contains a subset of the domains
- MarvNC’s listing of dictionaries is how I actually found the dict and the data source and I was gently nudged by him to parse it again for readings I guess lol
- Aquafina-water-bottle for making frequency sorting a thing that people think about. I was working on something tangentially related to that wherefrom I got sidetracked into doing something more directly related to that and then further sidetracked into doing this… oh well