Please, be aware that this application requires Java 16+!
You can download and install it for free from this link.
- affix file and dictionary linter
- rules reducer
- LibreOffice and Mozilla packager
- Part-of-Speech and dictionary FSA extractor for LanguageTools
- automatically choose a font to render custom language
- manages thesaurus, hyphenation, auto-correct, sentence exceptions, and word exception files
- minimal pairs extraction
- statistics
- … and many more!
- Motivation
- What the application can do
- How to enhance its capabilities
- Recognized charsets
- Recognized flags
- How to
- Open a project
- Create an extension
- Linter dictionary
- Linter thesaurus
- Linter hyphenation
- Sort dictionary
- Reduce rules
- Word count
- Rule flags aid
- Dictionary statistics
- Dictionary duplicates
- Dictionary wordlist
- Create a Part-of-Speech FSA
- Minimal pairs
- Ordering table columns
- Copying text
- Rule/dictionary insertion
- Screenshots
- Changelog
I created this project in order to help me construct my hunspell language files (particularly for the Venetan language, you can find some tools here, and the language pack here (for the LibreOffice tools) and here (for the Mozilla tools)). I mean .aff
and .dic
files, along with hyphenation and thesaurus.
This application is able to do many correctness checks about the files structure and its content. It is able to tell you if some rule is missing or redundant. You can test rules and compound rules. You can also test hyphenation and eventually add rules. It is also able to manage and build the thesaurus.
This application can also sort the dictionary, counting words (unique and total count), gives some statistics, duplicate extraction, wordlist extraction, minimal pairs extraction, and package creation in order to build an .oxt
or .xpi
for deploy.
You can customize the tests the application made by simply add another package along with vec
, named as the ISO 639-3 or ISO 639-2 code, and extending the DictionaryCorrectnessChecker, Orthography, and DictionaryBaseData classes (this last class is used to drive the Bloom filter).
Along with these classes you can insert your rules.properties
, a file that describes various constraints about the rules in the .dic
file.
After that you have to tell the application that exists those files editing the BaseBuilder class and adding a LanguageData
to the DATAS
hashmap.
The application automatically recognize which checker to use based on the code in the LANG
option present in the .aff
file.
- UTF-8
- ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15,
- KOI8-R, KOI8-U
- MICROSOFT-CP1251
- ISCII-DEVANAGARI
- TIS620-2533
SET, FLAG, COMPLEXPREFIXES, LANG, AF, AM
COMPOUNDRULE, COMPOUNDMIN, COMPOUNDFLAG, ONLYINCOMPOUND, COMPOUNDPERMITFLAG, COMPOUNDFORBIDFLAG, COMPOUNDMORESUFFIXES, COMPOUNDWORDMAX, CHECKCOMPOUNDDUP, CHECKCOMPOUNDREP, CHECKCOMPOUNDCASE, CHECKCOMPOUNDTRIPLE, SIMPLIFIEDTRIPLE, FORCEUCASE
CIRCUMFIX, FORBIDDENWORD, FULLSTRIP, KEEPCASE, ICONV, OCONV, NEEDAFFIX
Select File|Open Project
. A dialog will appear, and a blue folder (this marks a valid project) should be selected.
A META-INF
folder containing a manifest.xml
file is loaded, and all the information of where a particular file is are retrieved from it.
Upon loading a font is chosen that can render the content of the project. If you want another font, just select File|Select font
and choose another one.
The font will be linked to the project so, opening it again later, the same font will be used.
In order to create an extension (e.g. for LibreOffice, or for Mozilla products) you have to use the option File|Create package
. This will package the directory in which the .aff/.dic
resides into a zip file. All there is to do afterwards is to rename the extensions into .oxt
(LibreOffice), or .xpi
(Mozilla).
Remember that the package will have the same name of the directory, but the directory itself is not included, just the content is.
To linter a dictionary just select Dictionary tools|Correctness check
/Dictionary tools|Correctness check using dictionary FSA
.
Each line is then linted following the rules of a particular language (IF the corresponding files are present in the project, e.g. for Venetan). If no such file is present a general linter is applied.
To linter the thesaurus just select Thesaurus tools|Correctness check
/Thesaurus tools|Correctness check using dictionary FSA
.
Each thesaurus entry is linted checking for the presence of each synonym as a definition (with same Part-of-Speech).
In case of error it is suggested to copy all the synonyms for the indicated words (and all that came out from the filtering using those two words), remove each of them, and reinsert again.
To linter the hyphenation just select Hyphenation tools|Correctness check
.
Each hyphenation code is then linted following certain rules (among them the one that says that a breakpoint should not be on the boundary, that a code should have at least a breakpoint, etc.).
By selecting Dictionary tools|Sort dictionary
you can sort specific parts of a dictionary file selecting the highlighted sections between a comment or empty line and the following.
The sorting order is language-dependent.
Use Dictionary tools|Rules reducer
to find the minimum set of rules that can be applied following the current dictionary file.
E.g. If a dictionary file has the lines aa/b
and bb/b
and in the affix file are present the rules SFX b 0 A a
, SFX b 0 B b
, and SFX b 0 C c
(where the last is not used), then this tools returns the minimum set of SFX b 0 A a
and SFX b 0 B b
.
Use Dictionary tools|Word count
to count all the words generated by the affix files, as long as unique word (not considering part-of-speech).
Note: There is an uncertainty about the uniqueness count, but it should be small. Deal with it :p.
Use Dictionary tools|Statistics
to produce some statistics (graphs and values are exportable with a right click!) about word and compound word count, mode of words' length, mode of words' syllabe, most common syllabes, the longest words (by letters and by syllabes).
If you want to include hyphenation statistics be sure to use Hyphenation tools|Statistics
instead, but expect a 3.6Ă— or so increase in running time.
To obtain a list of word duplicates (same word, same part-of-speech), the tool you want to use is under Dictionary tools|Extract duplicates
.
To obtain a list of all the words generated by a dictionary and affix file, the menus Dictionary tools|Extract wordlist
and Dictionary tools|Extract wordlist (plain words)
should be used.
In order to create an FSA for Part-of-Speech, suitable for use in LanguageTool you have to use the option File|Extract PoS FSA
selecting the output folder. This will create an FSA using a provided <language>.info
file (or automatically generated).
Remember that the FSA file will have the same name as specified in the LANG
option in the .aff
file, and extension .dict
.
To obtain a list of minimal pairs use the menu Dictionary tools|Extract minimal pairs
.
An external text file can be put into the directory aids
(on the same level of the executable jar) whose content will be displayed in the drop-down element in the Dictionary tab (blank lines are ignored).
This file could be used as a reminder of all the flag that can be added to a word and their meaning.
The filename has to be the language (as specified in the option LANG
inside the .aff
file), and the extension aid
(e.g. for Venetan: vec-IT.aid
).
It is possible to sort certain columns of the tables, just click on the header of the column. The sort order will cycle between ascending, descending, and unsorted.
Is it possible to copy content of tables and words in the statistics section. Also, the graph in the statistics section can be exported into images.
Use Ctrl+C
after selecting the row, or use the right click of the mouse to access the popup menu.
This is NOT an editor tool1! If you want to add affix rules, add words in the dictionary, or change them, you have plenty of tools around you. For Windows, I suggest Notepad++ (for example, you will see immediately while typing if a word is already present in the dictionary).
1: Even if for the hyphenation file a new rule can actually be added…
Entries can be a single word followed by a slash and all the flags that have to be applied, followed optionally by one or more morphological fields.
Entries can be inserted in two ways:
- (pos)|word1|word2|word3
- pos:word1,word2,word3
Once something is written, an automatic filtering is executed to find all the words (and part-of-speech if given) that are already contained into the thesaurus.
It is possible to right-click on a row to bring up the popup menu and select whether to copy it, remove it (and all the other rows in which the selected definition appears), or merge with the current synonyms.
- rules reducer fixes and enhancements
- considered different formats for part-of-speech in thesaurus file
- fixed early creation of thesaurus parser (language was not available yet)
- fixed circumfix inflections
- enhanced duplication worker capabilities
- delayed creation of file chooser (faster startup)
- eliminated double reloading of dictionary in sort dialog when something changes
- understood how
ICON
andOCONV
works - supported ISO-8859-10, ISO-8859-14, and ISCII-DEVANAGARI charsets
- added a check on declared charset and real charset of a file
- adjusted scroll to the bottom of the log text area while changing font
- decreased the loading time of sorting dialog
- increased speed by 57% (for dictionary linter: from 2m 13s to 57s)
- decreased start-up time
- corrected some typos
- fix bug on initial font size
- automatically unzip
.dat
and.bau
files (inautocorr
andautotext
folders) - startup time reduced
- added linter for auto-correct
- corrected the size of the font
- corrected the executable
- added warn for unused rules after dictionary linter
- added the possibility to hide selected columns from dictionary table
- (finally) added a Windows installer
- some minor improvements on speed and linting capabilities
- made update process stoppable
- added a linter for thesaurus
- added a menu to generate Dictionary FSA (used in LanguageTools, for example)
- added a section to see the PoS FSA execution
- fixed a bug on hyphenation: when the same rule is being added (with different breakpoints), the old one is being lost
- substituted charting library
- added undo/redo capabilities on input fields
- completely revised thread management
- fixed a nasty memory leak
- now the sort dialog remains open after a sort
- categorized the errors in (true) errors and warnings, now the warning are no longer blocking
- reduced compiled size by 52% (from 6 201 344 B to 3 002 671 B)
- reduced memory footprint by 13% (for dictionary linter: from 728 MB to 630 MB)
- increased speed by 53% (for dictionary linter: from 4m 44s to 2m 13s)
- various minor bugfixes and code revisions
- (finally) given a decent name to the project: HunLinter
- fixed a bug while selecting the font once a project is loaded
- fixed a bug while storing thesaurus information (only lowercase words are allowed)
- added update capability (the new jar will be copied in the directory of the old jar and started)
- added buttons to open relevant files
- added management of SentenceExceptList.xml and WordExceptList.xml
- added a menu to generate Part-of-Speech FSA (used in LanguageTools, for example)
- made tables look more standard (copy and edit operations)
- improved thesaurus merging
- completely revised how the loading of a project works, now it is possible to load and manage all the languages in an extension (or package), all the relevant files are read from manifest.xml and linked
.xcu
files - the way a project is loaded in the application is changed, now the project folder (signed by a blue icon) has to be selected instead of an
.aff
file - added the possibility to change the options for hyphenation
- added the parsing and management of auto-correct files (only
DocumentList.xml
can be edited for now,SentenceExceptList.xml
andWordExceptList.xml
are currently read only) - now all the relevant files are loaded by reading the
META-INF\manifest.xml
file, no assumptions was made - enhancement for hyphenation section: now it is possible also to insert custom hyphenations
- bug fix on duplicate extraction
- some simplifications were made in the main menu (removed thesaurus validation on request because it will be done anyway at loading)
- improvements on thesaurus table filtering
- prevented the insertion of a new thesaurus if it is already contained
- revised the dictionary sort dialog from scratch to better handle sections between comments
- minor GUI adjustments and corrections
- added the link to the online help
- corrected the font size on the dictionary sorter dialog
- bugfix: scroll on dictionary sorter dialog
- introduced the possibility to choose the font (you can select it whenever you've loaded an .aff file, it will give you a list of all the fonts that can render the loaded language -- once selected the font, it will be used that for all the .aff files in that language)