Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

[Feature Request] add support for JMdict note, see, and ant #1165

Open
Thermospore opened this issue Dec 27, 2020 · 12 comments
Open

[Feature Request] add support for JMdict note, see, and ant #1165

Thermospore opened this issue Dec 27, 2020 · 12 comments
Labels
dictionary format Issue is related to a dictionary formatting problem

Comments

@Thermospore
Copy link
Contributor

These often contain critical information. I don't think JMdict should be considered complete without them

Here are some entries that use these, for reference (though there are probably better entries to use as an example of their importance):
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1465580.1
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1456360.1
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1339460.1

Screenshots of what I'm referring to specifically
image
image
image

(I think this should technically be in the yomichan import github, but this seems to be where the main activity happens)

@toasted-nutbread toasted-nutbread added the dictionary format Issue is related to a dictionary formatting problem label Dec 27, 2020
@Thermospore
Copy link
Contributor Author

Looks like they store the notes as s_inf in the jmdict xml file. Judging from https://godoc.org/github.com/FooSoft/jmdict I think if you just slap term.addDefinitionTags(sense.Information...) in line 181 of https://github.com/FooSoft/yomichan-import/blob/master/edict.go then notes would show up in the definition tags in yomichan?

@FooSoft
Copy link
Owner

FooSoft commented Dec 31, 2020

@toasted-nutbread I am going to make a pass on yomichan-import to clean up some of the dictionary processing (especially for EPWING), let me know if you have a list of format annoyances to take care of while I am at it.

@toasted-nutbread
Copy link
Collaborator

toasted-nutbread commented Dec 31, 2020

@FooSoft https://github.com/FooSoft/yomichan/labels/dictionary format lists all the issues I am aware of. Most of them related to Kenkyusha, with some being somewhat cosmetic (due to large size of definitions), whereas a few others impact functionality (terms not marked as verbs, or unexpected reading/expression form).

Related to this issue: if we want to support see/note/ant, it may be best to present them differently than standard definitions (as seen in the images, e.g. brackets, crosslinks). This may require some metadata updates to the dictionary format Yomichan uses, perhaps something similar to how image definitions work.

["画像", "がぞう", "tag1 tag2", "", 33, ["definition1a (画像, がぞう)", {"type": "image", "path": "image.gif", "width": 350, "height": 350, "description": "An image", "pixelated": true}], 5, "tag3 tag4 tag5"]

E.g.

{"type": "note", "content": "Note content"}
{"type": "see", "expression": "画像", "reading": "がぞう", /*...*/}
/*...*/

@Thermospore
Copy link
Contributor Author

Some examples of entries where you don't really get it without the notes/refs. I also notice the entries where notes/refs are important tend to also be entries where the monolingual dicts aren't very helpful. But luckily at this point I've developed a spidey sense for when I should go check jisho/jmdictdb for them haha

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2120780.1
image

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2430230.1
image

http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1599390.1
image

@35122
Copy link

35122 commented Mar 1, 2021

I think it might be best to just import every tag from https://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#insyntax

@toasted-nutbread
Copy link
Collaborator

#2089 adds link support, which can handle cross references to some extent.

@stephenmk
Copy link
Contributor

stephenmk commented Mar 29, 2022

I've been working on a fork of yomichan-import which adds this information (usage notes, cross references, antonyms, and language-of-origin info) to the term glossaries within yomichan, and I think the results are very promising.

Images are linked below.

Here is a copy of the English dictionary file if you'd like to try it:
jmdict_english_info_glosses_2022_03_29.zip
jmdict_english_info_glosses_2022_03_30.zip (updated)
jmdict_english_info_glosses_2022_04_02.zip (updated)

Click descriptions to expand images


Cross references to other terms (indicated by a right arrow ➡) and references from other terms (left arrow ⬅). Terms that both reference and and are referenced by the same sense will show 🔄.
空気が読めない

KY

閑古鳥 and 閑古鳥が鳴く

cuckoo


Antonyms (indicated by right-left arrows ⇄)

欠席 and 出席

kesseki


Loanwords (language of origin information)
アルバイト

arubaito


Words loaned from multiple languages
コンビナートキャンペーン

kombinat


Wasei words
スキンシップ

skinship


Usage notes (《now mostly used in idioms》, 《also written as 訓む》, etc.)
読む

yomu


Gloss types ("literally," "figuratively," "explanatory," and "trademark")
ちりも積もれば山となる

chiri

Note that @FooSoft's jmdict parsing library needs to be patched to get this to work FooSoft/jmdict#2 (Update: this patch has been merged)
This could resolve issue #2057


Cross reference information seems to be matching up well with the information displayed in JMdictDB.

和歌 in yomichan

waka

和歌 in JMdictDB

waka2

(Update) 和歌 after moving all the reference types into one gloss each. This is an extreme example; the vast majority of terms only have a few references at most.

waka update


Even for tricky cross references to kanji that belong to many different entries, everything seems to be showing up where it's supposed to be.

moto


Issues

  1. Most of this information probably doesn't belong in the term glossaries. Once the new data structures that are needed to contain this information are decided upon, it should be trivial to update the program to append the data to those lists rather than the glossary list.
  2. "Sense number" tags have been added to every term whose entry in JMdict contains more than one sense in the relevant language. References in JMdict specify the sense of the referenced term, so these tags provide essential information. We cannot rely upon the html ordered-list indices in yomichan to determine the sense number of a term for a couple reasons. If the user has multiple dictionaries installed and JMdict is not set to top priority, then the JMdict indices will all be dynamically offset by those different dictionary terms. Even if JMdict is the only dictionary installed, yomichan-import will shuffle the order of the senses around based upon whether or not the sense is archaic, contains kanji that are restricted to a certain reading, etc. For example, notice that sense #9 of 元・もと is underneath sense #10 in the screenshot of 元 above.

    I placed these sense number tags on the left because that's where you'd normally expect indices to be, but I can see how having them to the right of the "JMdict (English)" tag could be preferable too. I haven't looked into it but I'm not sure how to go about making definition tags show up to the right of dictionary name tags, however.
  3. I know structured content links have been added to yomichan recently (Structured content links #2089), but I haven't built a copy of yomichan using that code to try it out. I anticipated this in my code and it should be easy to rebuild the dictionary with query prefixes ("?query=") in the references when that functionality becomes available. It would be cool if we could use the sense number in the query somehow to find the term with the matching sense number tag, but that sounds like it could be complicated.
  4. Cross references in JMdict are currently provided as strings of kanji and/or kana. In the near future, these are planned to be replaced or supplemented with sequence ID values. In order to produce a list of references from other terms (indicated by left arrows ⬅ in the screenshots above), it is necessary to use heuristics to determine which JMdict entry these references are truly referring to. Without these heuristics, we would end up with false positives (e.g. 【ご本・ごほん】 contains a reference to 本, so we would see a reference from ご本 on both the entries for 【本・ほん】 and 【本・もと】).

    The process of converting these references into sequence ID values is extremely computationally expensive. For every reference in JMdict (30k of them), we basically have to check every kanji and kana pair on every entry in JMdict (over 195k) to determine which one is the best match. I enabled multi-threading on this part of the program, which reduced the computation time down from about 10 minutes to 1 minute on my PC (using intel i7-3770K). This still could be very slow on older hardware. At any rate, we can do away with this process entirely once we're able to pull sequence numbers from references in JMdict.

    (Update: I changed the program to use hashmaps instead of string comparisons and it's just as fast as it was originally now. Pretty sure I learned that in college once.)
  5. I developed these changes with an eye towards multilingual support, but it turns out that the English sense nodes in JMdict are the only nodes that contain any of this information. As of today, English-language nodes in JMdict contain 3,309 "gloss type" attributes; 5,360 usage notes; 979 antonym references; 30,868 cross references; and 3,523 language-of-origin notes. For every other set of language nodes in JMdict, each of those numbers is zero.

    Perhaps the language-of-origin and antonym information would be useful to transfer over to other language terms from English. I imagine a lot of yomichan users who use these multilingual dictionaries are also using the English dictionary, so duplicating this information may only add extra clutter.

I think that's everything of any importance that I have for now. Let me know what you think. I don't think it would be an exaggeration to say that I would not have learned Japanese if not for yomichan (at the very least, it would have been much more difficult), so it would make me happy to be able to contribute back to the project.

@FooSoft
Copy link
Owner

FooSoft commented Mar 30, 2022

Awesome work, @stephenmk . I've merged in the changes to the jmdict library used by Yomichan-Import. I'm not too concerned about the performance implication you mention since the dictionary processing happens offline and normal users do not have to carea bout it. Feel free to shoot me a PR when you feel like things are in a good state for it :)

@stephenmk
Copy link
Contributor

Thanks @FooSoft. I thought I was finished yesterday but I already thought of several things to polish that I implemented today. I'll let you know when things settle down.

I'll shoot a message to Marcusjmdict, who I suspect will be interested in this and could offer some valuable feedback.

@Marcusjmdict
Copy link

Thank you very much @stephenmk, very interesting. I don't really have time to look into this at the moment but I'll try and remember to do it in a couple of weeks. Might be worth posting about this on the jmdict github too, or on the jmdict mailing list.

@birtles
Copy link

birtles commented Mar 31, 2022

I'm not sure if it's helpful for comparing, but hikibiki also includes each of these except cross-references from other words. (10ten/Rikaichamp includes most of this metadata but currently hides cross-references/antonyms since they're a bit less useful when you can't click them to look them up.)

One minor detail we encountered is that when presenting the foreign language terms we try to mark them up with appropriate language tagging so a suitable font will be selected. To do that we end up translating JMdict's ISO-639-2 language codes into BCP-47 language codes that can be directly used in lang attributes.

@stephenmk
Copy link
Contributor

stephenmk commented Apr 4, 2022

I again have some noteworthy updates to share. I hope that I can now call this finished for the near future, but we'll see if I can go more than a day without finding something else to tweak and improve.

Here's the latest build of the dictionary file: jmdict_english_info_glosses_2022_04_04.zip
⚠ Note that this will only work with the development version of yomichan ⚠
The file will fail to pass validation during import with the current release version of yomichan, version 22.2.2.0, because the structured content link feature (#2089) has only recently been implemented.


Structured Content Links

Structured content links for internal queries are now included in reference notes. It seems that it is only possible to query one expression at a time (i.e., either kanji or kana, but not both). It would be cool if we could query by sequence number once the JMdict file begins to include that information in the references in the future. (I have converted these references into sequence numbers using heuristics, but it is impossible for this procedure to be 100% accurate).

Entry for 訓

⚠Note that I have compacted these references, so both of the external links (denoted by a left arrow ⬅) appear in the same gloss rather than two glosses.
kun

The addition of structured content has a noticeable impact on the amount of time it takes to load the dictionary file into yomichan. The validation stage now takes about as long as the data import stage, but this seems acceptable to me.

We can also add html lang attributes to the language-of-origin notes as mentioned by birtles above, but yomichan's structured content validator will need to be updated to accept it.


Rarely-used Kanji Forms

A "rarely used kanji" [rK] tag has recently been added to JMdict, which the editors are using to indicate kanji forms that appear in less than 3% of all word usages (based upon corpus n-gram counts). I updated the program to de-prioritize these forms in the same way that it already treats other irregularly tagged kanji (e.g., [iK], "word containing irregular kanji usage"; [io], "irregular okurigana usage"; etc.). This should fully resolve issue #2001.

Entry for この間

⚠Rarely-used kanji forms are grayed-out in grouped-related-terms mode.
konoaida

The current production version of yomichan-import creates dictionary terms for JMdict readings that are tagged with a [NoKanji] indicator. I have extended this functionality to also apply to readings that are only associated with rare kanji forms. So for example, a user who scans "それ" will now see "それ" as the headword of the top entry rather than "其れ".

Entry for それ (grouped)

sore

Entry for それ (ungrouped)

⚠Note that the senses for それ (其れ) and それ (interjection) have been merged, as indicated by the sense numbers. It might be nice to make it more explicit to the user that this merging has occurred somehow, but I haven't thought of a clever solution. I could change the color of the sense tags on terms whose kanji/kana pairs also belong to a different entry, but this would look weird in grouped mode.
sore_ungrouped

There are about 500 entries affected by this change. For the curious, here is a complete list.

New kana-only terms

1000320【彼処・あそこ】あそこ あすこ かしこ あしこ あこ
1000420【彼の・あの】あの
1000580【彼・あれ】あれ あ
1001390【御田・おでん】おでん
1001480【お負けに・おまけに】おまけに
1003430【屹度・きっと】きっと
1003560【刻々・ぎざぎざ】ぎざぎざ
1003710【嚏・くしゃみ】くしゃみ くさめ くっさめ
1003730【擽ったい・くすぐったい】くすぐったい
1004040【愚図愚図・ぐずぐず】ぐずぐず
1004310【斯う・こう】こう
1004500【此方・こちら】こちら こっち こち
1004505【此方・こなた】こなた こんた
1004790【此れから・これから】これから
1004820【これ迄・これまで】これまで
1005200【颯と・さっと】さっと
1005600【仕舞った・しまった】しまった
1005650【吃逆・しゃっくり】しゃっくり
1005870【凝乎と・じっと】じっと じいっと
1006200【す可き・すべき】すべき
1006600【其奴・そいつ】そいつ そやつ すやつ
1006640【然う然う・そうそう】そうそう
1006670【其処・そこ】そこ
1006680【其処いら・そこいら】そこいら
1006720【其処ら・そこら】そこら
1006730【而して・そして】そして しかして
1006780【其方・そちら】そちら そっち そなた そち
1006830【其の・その】その
1006950【抑・そもそも】そもそも
1007000【其れで・それで】それで
1007010【其れとも・それとも】それとも
1007020【それ処か・それどころか】それどころか
1007040【其れに・それに】それに
1007560【些とも・ちっとも】ちっとも
1007900【丁髷・ちょんまげ】ちょんまげ
1008630【迚も・とても】とても とっても
1008910【如何・どう】どう
1009290【何れ・どれ】どれ
1009300【何れでも・どれでも】どれでも
1009310【泥々・どろどろ】どろどろ
1010190【含羞む・はにかむ】はにかむ
1010240【許り・ばかり】ばかり ばっかり ばっか
1010250【許りに・ばかりに】ばかりに
1010500【犇犇・ひしひし】ひしひし
1010530【只管・ひたすら】ひたすら
1011430【可き・べき】べき
1012360【鼯鼠・むささび】むささび むざさび
1012490【藻掻く・もがく】もがく
1012630【持て成し・もてなし】もてなし
1012680【鼯鼠・ももんが】ももんが ももんがあ
1012730【軈て・やがて】やがて
1012800【漸と・やっと】やっと
1012810【矢っ張り・やっぱり】やっぱり
1015840【亜細亜・アジア】アジア
1019210【亜爾加里・アルカリ】アルカリ
1021300【以色列・イスラエル】イスラエル
1021360【伊太利・イタリア】イタリア イタリヤ
1023330【吋・インチ】インチ
1023880【印度支那・インドシナ】インドシナ
1023890【印度尼西亜・インドネシア】インドネシア
1026740【烏克蘭・ウクライナ】ウクライナ
1026830【宇柳貝・ウルグアイ】ウルグアイ
1027900【厄瓜多・エクアドル】エクアドル
1028500【埃及・エジプト】エジプト
1031680【濠太剌利・オーストラリア】オーストラリア
1031700【墺太利・オーストリア】オーストリア
1034460【牛津・オックスフォード】オックスフォード オクスフォード
1035520【阿蘭陀・オランダ】オランダ
1037560【型録・かたろぐ】かたろぐ
1037840【加特力・カトリック】カトリック
1037870【加奈陀・カナダ】カナダ
1039600【寒武利亜・カンブリア】カンブリア
1039610【柬埔寨・カンボジア】カンボジア
1040060【瓦斯・ガス】ガス
1041300【沈菜・キムチ】キムチ
1042340【玖馬・キューバ】キューバ
1042520【基督・キリスト】キリスト クリスト
1042620【瓩・キログラム】キログラム
1042650【粁・キロメートル】キロメートル
1043190【希臘・ギリシャ】ギリシャ ギリシア ギリシヤ
1043550【科威都・クウェート】クウェート クウェイト
1045150【久留子・クルス】クルス
1046160【瓜姆・グアム】グアム グァム ガム
1046810【瓦・グラム】グラム
1047070【臥児狼徳・グリーンランド】グリーンランド
1048290【剣橋・ケンブリッジ】ケンブリッジ
1048730【日耳曼・ゲルマン】ゲルマン
1050390【洋杯・コップ】コップ コツフ
1051780【哥倫比亜・コロンビア】コロンビア
1051860【混凝土・コンクリート】コンクリート
1054400【蜚蠊・ごきぶり】ごきぶり
1054570【護謨・ゴム】ゴム
1055770【西貢・サイゴン】サイゴン
1056970【撒哈拉・サハラ】サハラ
1059020【朱欒・ざぼん】ざぼん
1059040【蝲蛄・ざりがに】ざりがに
1059750【舎路・シアトル】シアトル
1060780【雪特尼・シドニー】シドニー
1061180【西比利亜・シベリア】シベリア シベリヤ
1061910【三鞭酒・シャンパン】シャンパン シャンペン
1062920【叙利亜・シリア】シリア
1063410【新嘉坡・シンガポール】シンガポール
1065660【牙買加・ジャマイカ】ジャマイカ
1066140【寿府・ジュネーブ】ジュネーブ
1067190【瑞西・スイス】スイス
1067410【瑞典・スウェーデン】スウェーデン スエーデン
1069140【蘇格蘭・スコットランド】スコットランド
1071060【士篤恒・ストックホルム】ストックホルム
1072540【獅子女・スフィンクス】スフィンクス
1074820【塞爾維・セルビア】セルビア
1075090【仙・セント】セント
1075470【蘇維埃・ソビエト】ソビエト ソヴィエト
1077190【達頼喇嘛・ダライラマ】ダライラマ ダライ・ラマ
1077750【禿び・ちび】ちび
1077760【窒扶斯・チフス】チフス チブス チプス
1078220【突尼斯・チュニジア】チュニジア テュニジア
1078320【智利・チリ】チリ
1080090【第希蘭・テヘラン】テヘラン
1084410【丁抹・デンマーク】デンマーク
1086880【土耳古・トルコ】トルコ
1089090【弗・ドル】ドル
1091080【尼加拉瓦・ニカラグア】ニカラグア
1091460【新西蘭・ニュージーランド】ニュージーランド ニュージランド
1091940【紐育・ニューヨーク】ニューヨーク
1094140【諾威・ノルウェー】ノルウェー
1096400【布哇・ハワイ】ハワイ
1096470【洪牙利・ハンガリー】ハンガリー
1098270【巴格達・バグダッド】バグダッド バグダード
1100140【晩香波・バンクーバー】バンクーバー
1100160【盤谷・バンコク】バンコク バンコック
1101370【巴基斯担・パキスタン】パキスタン
1102040【巴奈馬・パナマ】パナマ
1102740【巴里・パリ】パリ
1109290【比律賓・フィリピン】フィリピン フィリッピン フイリピン
1109460【芬蘭・フィンランド】フィンランド
1113730【伯剌西爾・ブラジル】ブラジル
1114500【勃牙利・ブルガリア】ブルガリア
1120030【白耳義・ベルギー】ベルギー
1121320【秘露・ペルー】ペルー
1124770【波蘭・ポーランド】ポーランド
1125940【葡萄牙・ポルトガル】ポルトガル
1127070【哩・マイル】マイル
1128430【燐寸・マッチ】マッチ
1129530【馬克・マルク】マルク
1129900【馬来西亜・マレーシア】マレーシア
1130020【芒果・マンゴー】マンゴー マンゴ
1130560【弥撒・ミサ】ミサ
1131930【粍・ミリメートル】ミリメートル
1133090【墨西哥・メキシコ】メキシコ
1134920【莫斯科・モスクワ】モスクワ
1135990【摩洛哥・モロッコ】モロッコ
1136260【碼・ヤード】ヤード
1137080【猶太・ユダヤ】ユダヤ
1137550【沃度丁幾・ヨードチンキ】ヨードチンキ
1137570【欧羅巴・ヨーロッパ】ヨーロッパ
1138610【羅宇・ラオス】ラオス
1139300【拉丁・ラテン】ラテン
1142800【利比利亜・リベリア】リベリア
1145730【黎巴嫩・レバノン】レバノン
1146020【檸檬・れもん】れもん
1148230【倫敦・ロンドン】ロンドン
1149020【華盛頓・ワシントン】ワシントン
1149730【亜爾然丁・アルゼンチン】アルゼンチン
1149830【亜米利加・アメリカ】アメリカ
1149970【亜剌比亜・アラビア】アラビア
1157170【為る・する】する
1160770【磯巾着・いそぎんちゃく】いそぎんちゃく
1163940【一寸・ちょっと】ちょっと ちょと
1168660【依る・よる】よる
1186670【下萠・したもえ】したもえ
1188630【何れ何れ・どれどれ】どれどれ
1189000【何処か・どこか】どこか どっか
1189210【何奴・どいつ】どいつ どやつ
1191810【家鴨・あひる】あひる
1201670【海豚・いるか】いるか
1201840【海驢・あしか】あしか みち
1209040【蒲公英・たんぽぽ】たんぽぽ ほこうえい
1209260【鴨の嘴・かものはし】かものはし
1226360【吃驚・びっくり】びっくり
1237470【況して・まして】まして
1240700【玉蜀黍・とうもろこし】とうもろこし
1242000【襟巻蜥蜴・えりまきとかげ】えりまきとかげ
1269130【呉れる・くれる】くれる
1269140【呉れ呉れも・くれぐれも】くれぐれも
1270700【お目出度う・おめでとう】おめでとう
1270830【お襁褓・おむつ】おむつ
1287690【黒子・ほくろ】ほくろ こくし ははくそ ははくろ ほくそ
1288810【此処・ここ】ここ
1303410【散ける・ばらける】ばらける
1320570【疾っくに・とっくに】とっくに
1337660【縮緬紙・ちりめんし】ちりめんし ちりめんがみ
1337670【縮緬皺・ちりめんじわ】ちりめんじわ
1400550【燥ぐ・はしゃぐ】はしゃぐ
1406060【其れでも・それでも】それでも
1406070【其れなら・それなら】それなら
1406090【其処で・そこで】そこで
1444030【兎もあれ・ともあれ】ともあれ
1457320【屯・トン】トン
1459640【乍ら・ながら】ながら
1459790【馴鹿・となかい】となかい じゅんろく
1466940【如何して・どうして】どうして
1466950【如何しても・どうしても】どうしても
1483160【彼奴・あいつ】あいつ きゃつ あやつ かやつ
1483185【彼方・あちら】あちら あっち あち
1493240【不図・ふと】ふと
1498040【負んぶ・おんぶ】おんぶ
1505990【然し・しかし】しかし
1506050【然も・しかも】しかも
1533600【面皰・にきび】にきび めんぽう
1535810【尤も・もっとも】もっとも
1537780【矢鱈・やたら】やたら
1551940【略・ほぼ】ほぼ
1554440【海獺・らっこ】らっこ
1562790【捥ぐ・もぐ】もぐ
1564380【凭れる・もたれる】もたれる
1565410【喇嘛・らま】らま
1565620【嘸・さぞ】さぞ
1566450【巫山戯る・ふざける】ふざける
1567450【掏摸・すり】すり
1568840【潛心力・せんしんりょく】せんしんりょく
1570120【稍・やや】やや
1571010【膃肭臍・おっとせい】おっとせい
1572760【諄い・くどい】くどい
1573190【齎す・もたらす】もたらす
1574170【靨・えくぼ】えくぼ
1574220【鞦韆・ぶらんこ】ぶらんこ
1574470【饂飩・うどん】うどん うんどん
1575160【鸚哥・いんこ】いんこ
1575470【鼬ごっこ・いたちごっこ】いたちごっこ
1577660【嗽・うがい】うがい
1579070【此奴・こいつ】こいつ こやつ
1581210【嘗て・かつて】かつて かって
1582920【此の・この】この
1585010【螺子・ねじ】ねじ らし
1585410【儘・まま】まま まんま
1585460【偖・さて】さて
1585970【襁褓・おしめ】おしめ むつき
1586750【洗い熊・あらいぐま】あらいぐま
1586780【凡ゆる・あらゆる】あらゆる
1590560【可也・かなり】かなり
1594400【確り・しっかり】しっかり
1595400【儒艮・じゅごん】じゅごん
1598390【天爾遠波・てにをは】てにをは テニヲハ
1599200【独逸・ドイツ】ドイツ
1599940【大蒜・にんにく】にんにく
1604010【区々・まちまち】まちまち
1606120【倚る・よる】よる
1607170【閊える・つかえる】つかえる つっかえる
1610040【所為・せい】せい せえ
1610360【珍紛漢紛・ちんぷんかんぷん】ちんぷんかんぷん
1611190【端ない・はしたない】はしたない
1612190【善くも・よくも】よくも
1612620【彼方此方・あちこち】あちこち あちらこちら あっちこっち
1612650【彼是・あれこれ】あれこれ かれこれ ひし
1612860【然うして・そうして】そうして
1628530【此れ・これ】これ
1628820【蝮・まむし】まむし はみ くちばみ たじひ
1629420【錻力・ブリキ】ブリキ
1631880【此処等・ここら】ここら
1632410【許りでなく・ばかりでなく】ばかりでなく
1632610【可し・べし】べし
1632670【擤む・かむ】かむ
1633370【疾うに・とうに】とうに
1652680【仏蘭西・フランス】フランス
1659520【阿爾及・アルジェリア】アルジェリア
1659590【利比亜・リビア】リビア
1694410【お負け・おまけ】おまけ
1725290【此れ此れ・これこれ】これこれ
1725330【此れ許り・こればかり】こればかり
1725390【この儘・このまま】このまま
1747960【飽くまでも・あくまでも】あくまでも
1755340【山荒・やまあらし】やまあらし
1755510【山棟蛇・やまかがし】やまかがし
1766800【西班牙・スペイン】スペイン
1775880【露西亜・ロシア】ロシア ロシヤ
1777760【背黄青鸚哥・せきせいいんこ】せきせいいんこ
1831960【然う斯う・そうこう】そうこう
1841850【伯林・ベルリン】ベルリン
1842880【座頭鯨・ざとうくじら】ざとうくじら
1903160【貝独楽・べいごま】べいごま べえごま ばいごま
1918460【ぽん柑・ぽんかん】ぽんかん
1920240【何の・どの】どの
1924105【太枘・だぼ】だぼ
1929050【阿弗利加・アフリカ】アフリカ
1952310【樹懶・なまけもの】なまけもの
1968420【ボール螺子・ボールねじ】ボールねじ
1970500【阿亀鸚哥・おかめいんこ】おかめいんこ
1973900【黄色猩猩蠅・きいろしょうじょうばえ】きいろしょうじょうばえ
1973950【牡丹鸚哥・ぼたんいんこ】ぼたんいんこ
1983340【捥り・もぎり】もぎり
2003800【片・ペンス】ペンス
2004460【白鼻心・はくびしん】はくびしん
2004810【愛蘭・アイルランド】アイルランド
2004830【亜富汗斯坦・アフガニスタン】アフガニスタン
2005050【日巴拉太・ジブラルタル】ジブラルタル
2005210【委内瑞拉・ベネズエラ】ベネズエラ ヴェネズエラ
2005220【波斯・ペルシャ】ペルシャ ペルシア ハルシャ
2005230【毛里求斯・モーリシャス】モーリシャス マウリチウス
2006450【蝤蛑・がざみ】がざみ かざみ がさみ がざめ
2006900【捏巴爾・ネパール】ネパール
2007010【不丹・ブータン】ブータン
2007080【暮利比亜・ボリビア】ボリビア ボリヴィア
2007090【摩納哥・モナコ】モナコ
2008040【斯うして・こうして】こうして
2008160【此れ丈・これだけ】これだけ
2008170【此れ迄に・これまでに】これまでに
2008270【嘸かし・さぞかし】さぞかし
2008650【然うした・そうした】そうした
2010050【態とらしい・わざとらしい】わざとらしい
2010140【其処此処・そこここ】そこここ
2010410【海地・ハイチ】ハイチ
2010530【幾内亜・ギニア】ギニア
2012830【鸊鷉・かいつぶり】かいつぶり
2020120【別剌敦那・ベラドンナ】ベラドンナ
2027100【何れ丈・どれだけ】どれだけ
2033870【宇牟須牟骨牌・うんすんかるた】うんすんかるた ウンスンカルタ うんすんカルタ
2037020【温める・ぬるめる】ぬるめる
2055720【此処彼処・ここかしこ】ここかしこ
2057800【桃色鸚哥・ももいろいんこ】ももいろいんこ
2060150【其処彼処・そこかしこ】そこかしこ
2070310【哥・グロス】グロス グロース
2074460【希伯来・ヘブライ】ヘブライ
2078870【羅馬・ローマ】ローマ
2079550【鸛・こうのとり】こうのとり こう
2083330【此方人等・こちとら】こちとら こっちとら
2084030【然うすれば・そうすれば】そうすれば
2089370【そっち退け・そっちのけ】そっちのけ そちのけ
2089510【それ計り・そればかり】そればかり
2093730【松濤館流・しょうとうかんりゅう】しょうとうかんりゅう
2094190【十刹・じっさつ】じっさつ じっせつ
2094490【癩菌・らいきん】らいきん
2094510【救癩・きゅうらい】きゅうらい
2096000【脇寺・わきでら】わきでら
2096890【兜率天・とそつてん】とそつてん
2097160【螫す・さす】さす
2097210【素袷・すあわせ】すあわせ
2098180【俳諧の連歌・はいかいのれんが】はいかいのれんが
2100220【蕭索・しょうさく】しょうさく
2100230【喊声・かんせい】かんせい
2116050【お持て成し・おもてなし】おもてなし
2123310【卒なく・そつなく】そつなく
2129490【克鯨・こくくじら】こくくじら
2134460【此れっぽっち・これっぽっち】これっぽっち
2134860【如何にもならない・どうにもならない】どうにもならない
2135520【乍らも・ながらも】ながらも
2137720【然う・そう】そう
2138660【挵る・せせる】せせる
2142130【爪哇・ジャワ】ジャワ ジャバ ジャヴァ
2154660【虎列剌・コレラ】コレラ
2163780【鵟・のすり】のすり
2165450【金剛鸚哥・こんごういんこ】こんごういんこ
2165470【五色青海鸚哥・ごしきぜいがいいんこ】ごしきぜいがいいんこ ごしきせいがいいんこ
2166530【巨頭鯨・ごんどうくじら】ごんどうくじら
2166610【赤坊鯨・あかぼうくじら】あかぼうくじら
2167650【豹紋蛸・ひょうもんだこ】ひょうもんだこ
2173950【何処ぞ・どこぞ】どこぞ
2176280【此れは・これは】これは
2176440【此処に於て・ここにおいて】ここにおいて
2176690【磽确・こうかく】こうかく ぎょうかく
2177120【許りか・ばかりか】ばかりか
2182120【鯥・むつ】むつ
2185410【吠舎・バイシャ】バイシャ ヴァイシャ
2190130【尉鶲・じょうびたき】じょうびたき
2190220【黄脚鷸・きあししぎ】きあししぎ
2190230【米利堅・メリケン】メリケン
2194560【鱰・しいら】しいら
2199600【尼鷺・あまさぎ】あまさぎ
2207090【此処いら・ここいら】ここいら
2207580【疾っく・とっく】とっく
2212090【撒爾沙・さるさ】さるさ
2220770【迚もじゃないが・とてもじゃないが】とてもじゃないが
2220780【迚も迚も・とてもとても】とてもとても
2222620【襅・ちはや】ちはや
2230890【蜾蠃・すがる】すがる
2231240【銀蜻蜓・ぎんやんま】ぎんやんま
2231250【団扇蜻蜓・うちわやんま】うちわやんま
2232860【蠔油・ハオユー】ハオユー
2241510【蠅取蜘蛛・はえとりぐも】はえとりぐも
2241570【蒴・さく】さく
2245450【柃・ひさかき】ひさかき ひさぎ いちさかき
2252860【何奴も此奴も・どいつもこいつも】どいつもこいつも
2256090【鮞・はららご】はららご
2263590【都鱮・みやこたなご】みやこたなご
2265440【螠・ゆむし】ゆむし
2269820【為れる・される】される
2270280【菩提薩埵・ぼだいさった】ぼだいさった
2270290【薩埵・さった】さった
2273320【模様莧・もようびゆ】もようびゆ
2273330【莧・ひゆ】ひゆ ひょう
2273340【滑莧・すべりひゆ】すべりひゆ
2395630【嘸や・さぞや】さぞや
2397090【彼等・あれら】あれら
2398300【癤・せつ】せつ
2406480【為れつつある・されつつある】されつつある
2409100【儘に・ままに】ままに
2420350【篊・ひび】ひび
2433200【其れ処・それどころ】それどころ
2433280【胡簶・やなぐい】やなぐい ころく
2439120【螺子ポンプ・ねじポンプ】ねじポンプ
2439530【金蛇・かなへび】かなへび
2453030【駃騠・けってい】けってい
2453260【硨磲・しゃこ】しゃこ
2454770【捥る・もぎる】もぎる
2459820【加密列・カミツレ】カミツレ カミルレ
2464870【仁座鯛・にざだい】にざだい にざだひ
2476410【蝉魴鮄・せみほうぼう】せみほうぼう
2491150【猿麻桛・さるおがせ】さるおがせ
2507320【桫欏・へご】へご
2508920【吠陀・ヴェーダ】ヴェーダ ベーダ いだ
2514240【鈹・かわ】かわ
2514400【鍰・からみ】からみ
2526030【随に・まにまに】まにまに
2536960【然ばかり・さばかり】さばかり
2542110【巋然・きぜん】きぜん
2556900【亜皮西尼・アビシニア】アビシニア
2579490【笈多・グプタ】グプタ
2587700【枘・ほぞ】ほぞ
2622400【丁幾・チンキ】チンキ
2627770【日本蝮・にほんまむし】にほんまむし
2632430【礬水・どうさ】どうさ
2637760【腸香・わたか】わたか
2647810【砰・ずり】ずり
2654610【何処かしら・どこかしら】どこかしら
2656220【踠き・もがき】もがき
2662220【然うとも・そうとも】そうとも
2670830【此処ぞ・ここぞ】ここぞ
2679370【和地関・バチカン】バチカン ヴァチカン ヴァティカン バティカン
2686150【プラス螺子・プラスねじ】プラスねじ
2718320【仕舞うた・しもうた】しもうた
2718350【一寸も・ちょっとも】ちょっとも
2719110【仕舞た・しもた】しもた
2723070【咬𠺕吧・ジャガタラ】ジャガタラ
2724560【の所為で・のせいで】のせいで
2729610【此れこそ・これこそ】これこそ
2745680【蘇丹・スーダン】スーダン
2746710【巴拉圭・パラグアイ】パラグアイ パラグァイ パラグワイ
2746770【馬達加斯加・マダガスカル】マダガスカル
2746800【羅馬尼亜・ルーマニア】ルーマニア
2747040【錫蘭・セイロン】セイロン
2748610【普魯西・プロシア】プロシア プロシャ
2755620【此れっぽち・これっぽち】これっぽち
2755840【此れっ許り・これっぱかり】これっぱかり
2762810【然うしたら・そうしたら】そうしたら
2764430【巫山戯・ふざけ】ふざけ
2772770【矢張り・やはり】やはり
2786050【ヨーロッパ鰻・ヨーロッパうなぎ】ヨーロッパうなぎ
2789450【似鯨・にたりくじら】にたりくじら
2790210【加密列擬き・カミツレもどき】カミツレもどき
2793790【文字る・もじる】もじる
2810720【此れまでで・これまでで】これまでで
2826282【維納・ウィーン】ウィーン
2828101【退る・しさる】しさる しざる
2829785【嘗てない・かつてない】かつてない
2830535【馬徳里・マドリード】マドリード マドリッド
2830551【余っ程・よっぽど】よっぽど
2830947【波布水母・はぶくらげ】はぶくらげ
2833552【我利我利・がりがり】がりがり
2833728【蘇門答剌・スマトラ】スマトラ
2833966【耶路撒冷・エルサレム】エルサレム イェルサレム
2834456【蘭貢・ヤンゴン】ヤンゴン ラングーン
2834462【烏剌紐母・ウラニウム】ウラニウム
2834735【虎狼痢・コロリ】コロリ
2834896【嬪夫・ピンプ】ピンプ
2835297【此れは此れは・これはこれは】これはこれは
2836396【此れっぱかし・これっぱかし】これっぱかし
2837077【亡う・うしなう】うしなう
2839545【聖保羅・サンパウロ】サンパウロ サン・パウロ
2841785【ヨーロッパ山棟蛇・ヨーロッパやまかがし】ヨーロッパやまかがし
2844189【留・ルーブル】ルーブル ルーブリ
2845740【加農・カノン】カノン
2846528【不列顛・ブリテン】ブリテン
2848718【波宇・パオ】パオ パウ ハウ
2848914【危なかしい・あぶなかしい】あぶなかしい
2849914【東亰・とうきょう】とうきょう とうけい
2851622【彼方此方・かなたこなた】かなたこなた あなたこなた
2853158【蘇・ソ】ソ
2853563【疾うから・とうから】とうから 

I also considered applying this change to kana forms that contain glosses that are all tagged with [uk] "usually kana" indicators, but I think this would be a bad idea. Unlike the rare kanji tags, which indicate less than 3% of all usages, the [uk] tag can mean that a kanji form is only used around 50% of the time. So when one of these [uk] kana forms is scanned by the user, I think it's still good for the corresponding kanji form to appear in the top result.


Term prioritization

There are an abundance of good frequency dictionaries for yomichan now, such as @toasted-nutbread's BCCWJ dictionary. So adjustments to frequency data in JMdict probably won't matter to a lot of people, especially since the data it uses seems not to be held in high regard by many. At any rate, I have made adjustments which will at least make a better out-of-the-box experience for people getting started with yomichan and JMdict.

JMdict includes frequency/priority tags based upon three sources:

  1. A frequency analysis of the Mainichi Shimbun newspaper performed on several years of data from the 1990s.
  2. The "Ichimango goi bunruishuu" published by Senmon Kyouiku Publishing, Tokyo, 1998.
  3. A list of words which are regarded by the JMdict editors as common, but are not included in the above two sources.

The "news" and "gai" tags are derived from the first source; "ichi" tags from the second; and "spec" tags from the third. I have updated the tooltip texts on these tags to better convey their meanings.

news tags ("news1k" to "news24k")

⚠Note that the "news" tag has been split up into 24 different tags ("news1k" to "news24k") based upon the rankings indicated in the JMdict file. This ranking also now affects the order in which terms are displayed.
kan

ichi tags

ichi

spec tags ⚠the previous wording was "common words not included in frequency lists," which never made any sense to me until I read more about JMdict during the past week. "spec" probably isn't a great tag name either, but I don't have any better ideas. I'd almost like to name it "common", but then that would imply that other terms without the "common" tag are not common.

spec

gai tags

gai

The order in which terms are extracted from JMdict is now also factored into their priority ranking (with a smaller weight than the priority tags). For example, 【本・ほん】 is the first term extracted from its JMdict entry, while 【本・もと】 is extracted second from its entry (after 【元・もと】). Both 【本・ほん】 and 【本・もと】 are tagged with "ichi" priority. Now that extraction order is taken into account, the term entries for【本・ほん】 will show up first when the user scans "本".

本 term order

hon


I think that's all I have for now. Thanks for reading.


(Edit 2022/04/08)

I don't know about anyone else, but I think I might prefer having the JMdict glosses condensed into a single, semicolon delimited list item (see images below). Especially now that there is other information in these sense glossaries, I think having the glosses themselves on one line will improve readability. This is also how jisho.org displays this information.

張る

haru_pull

張る in "compact glossaries" mode

haru_pull_compact

Here's a test build if anyone would like to try it: jmdict_english_info_glosses_2022_04_08.zip
Again, this won't work with yomichan version 22.2.2.0

(But regardless, my own personal preferences aren't an issue. I can always adjust my copy of yomichan-import to build the dictionary files that I want for myself. So I'm open to feedback on things should be setup for if-and-when my version of yomichan-import gets merged back into the main branch.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dictionary format Issue is related to a dictionary formatting problem
Projects
None yet
Development

No branches or pull requests

7 participants