Open in gitpod
A project to generate 四字熟語 (yoji-jukugo, 4 character Japanese idioms), using a sequential tensorflow model.
The dataset used for the current project was scraped/pulled from the following:
- Yojijukugo for idioms and meanings/readings
- Jamdict for kanji readings, meanings, and other information
- Kanji Database for kanji classification, grade level, and misc characteristics
- The main report, compiled with datapane and also in html format
- The full yoji_df dataframe describing the idioms, their constituent kanji, and all additional characteristics from the data linked above
- List of generated idioms, sans definitions and readings
- The same list, expanded out to a dataframe including readings and meanings of constituent characters and bigrams
- After sharing the initial project with some coworkers, it was suggested (by @DC & @JZ) that I retrain the model on bigrams within each idiom, as this more closely aligns with how yoji-jukugo are semantically divided and understood. I've updated the report linked above with some additional thoughts on the new model and its results!