Skip to content

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character). Used by SearchWorks at index and query time

License

Notifications You must be signed in to change notification settings

yalelibrary/CJKFoldingFilter

 
 

Repository files navigation

CJKFoldingFilter

  • YUL Customizations *

Yul has added some mappings and we have moved the mappings into an XML file rather than in the Java code. The mappings are now loaded in a static block so that each instance doesn’t re-add the items to the map.

<img src=“https://secure.travis-ci.org/sul-dlss/CJKFoldingFilter.png?branch=master” alt=“Build Status” />

This is a Lucene filter and filter factory (see lucene.apache.org ) to fold certain CJK characters to improve recall. You should put it in your analysis chain BEFORE ICUTransforms from Traditional->Simplified Han, as it converts modern Japanese Kanji to their traditional equivalents.

Usage

  • clone the project

git clone git://github.com/solrmarc/CJKFoldingFilter.git
  • run the jar ant task

ant jar
  • put the CJKFoldingFilter.jar file found in the dist directory into your Solr lib directory

  • utilize the Solr CJKFoldingFilterFactory in your schema.xml file.

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
  </analyzer>
</fieldType>

Contributing

  1. Fork it

  2. Create your feature branch (‘git checkout -b my-new-feature`)

  3. Commit your changes (‘git commit -am ’Added some feature’‘)

  4. Push to the branch (‘git push origin my-new-feature`)

  5. Create new Pull Request

About

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character). Used by SearchWorks at index and query time

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%