Skip to content

Releases: daac-tools/vaporetto-models

0.5.0

16 May 14:22
Compare
Choose a tag to compare

Model files

This software contains the results of joint research with the National Institute for Japanese Language and Linguistics (NINJAL).

We provide multiple model files for Vaporetto that you can download and use in your work.
These models have been trained using BCCWJ and UniDic.

All of these models are trained with L1-regularization.

See below for license terms of each model.

(NOTE) Some of BCCWJ are not included in training data due to rights reasons.

Models with dictionary

We provide models containing UniDic. These models have the highest accuracy in our distributions.

  • bccwj-suw+unidic_pos+pron.model.zst: contains POS and pronunciation tags.
  • bccwj-suw+unidic_pos+kana.model.zst: contains POS and kana tags.

(NOTE) "Pronunciation tags" are notations according to the actual pronunciation, whereas "kana tags" are notations used when printed. (For example, surface: 東京, pron: トーキョー (tōkyō), kana: トウキョウ (toukyou))

Models without dictionary

We also provide models that do not contain UniDic.
These models have been trained over three model sizes and two word units.

Short unit words (SUW) Long unit words (LUW)
Tiny (C=0.003) bccwj-suw_c0.003.model.zst N/A
Small (C=0.1) bccwj-suw_c0.1.model.zst N/A
Middle (C=0.5) bccwj-suw_c0.5.model.zst N/A
Large (C=1.0) bccwj-suw_c1.0.model.zst bccwj-luw.model.zst

License

The following models are licensed under 3-Clause BSD License.

  • bccwj-suw+unidic_pos+pron.model.zst
  • bccwj-suw+unidic_pos+kana.model.zst

The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.

  • bccwj-suw_c1.0.model.zst
  • bccwj-suw_c0.5.model.zst
  • bccwj-suw_c0.1.model.zst
  • bccwj-suw_c0.003.model.zst
  • bccwj-luw.model.zst