Skip to content

Generating 四字熟語, a.k.a. 4-character Japanese idioms

Notifications You must be signed in to change notification settings

ryancahildebrandt/yoji

Repository files navigation

Wisdom in 4 Characters Or Less


Training a neural network to generate 四字熟語 (as best it can!)


Open in gitpod

Purpose

A project to generate 四字熟語 (yoji-jukugo, 4 character Japanese idioms), using a sequential tensorflow model.


Dataset

The dataset used for the current project was scraped/pulled from the following:

  • Yojijukugo for idioms and meanings/readings
  • Jamdict for kanji readings, meanings, and other information
  • Kanji Database for kanji classification, grade level, and misc characteristics

Outputs

  • The main report, compiled with datapane and also in html format
  • The full yoji_df dataframe describing the idioms, their constituent kanji, and all additional characteristics from the data linked above
  • List of generated idioms, sans definitions and readings
  • The same list, expanded out to a dataframe including readings and meanings of constituent characters and bigrams

Update!

  • After sharing the initial project with some coworkers, it was suggested (by @DC & @JZ) that I retrain the model on bigrams within each idiom, as this more closely aligns with how yoji-jukugo are semantically divided and understood. I've updated the report linked above with some additional thoughts on the new model and its results!