This project allow you to build dictionaries based on Wiktionary entries.
Dictionary builder used to be a demonstration of advanced JAXB techniques to unmarshall very large xml document with very low memory footprint.
The Java/JAXB implementation has been archived in java-jaxb branch
Then it was re-written with Scala and Akka Streams.
The Scala/akka-stream implementation has been archived in scala-akka-streams branch
And now re-written with Rust.
The resulting dictionnary is exactly the same with the three implementations. None of these implementations was designed to be use as a benchmark but nethertheless Rust results are breathtaking. See below.
dictionary-builder is an EDLA project.
The purpose of edla.org is to promote the state of the art in various domains.
Don't expect too much from this dictionary builder.
After running this program you will find in the root
folder configured in the Settings.toml :
- a file named with
words_file
configuration that contains all the words found (and expressions ifexpression = true
is configured) - a file named with
excluded_words_file
configuration that contains all pages in the dump that were filtered out - and if
with_definition = true
is configured a bunch of folders (two level deep) with gzip compressed file. Each file contains the definition in the rought wikimedia format wich is probably not what you are expected.
-
Rust need to be installed to generate an executable
-
Get a fresh wiktionary backup
Choose your favorite language and download the dump containing the current versions of article content here
Example for the english dump: http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles-multistream.xml.bz2 -
Uncompress the fresh downloaded dump somewhere (Take care you need up to 7 Gigas of free disk space)
-
Edit Settings.toml to indicate the language you choose, where the dump is located and last but not least where the dictionary should be generated.
(With Windows systems PATHs need to be escaped for exampleC:\\dico\\words
and take care you need at least 4G of free disk space to store your dictionary if you setwith_definition=true
)
The root folder must already exist, you must create is yourself if not. -
Build the executable : cargo build --release
-
Launch the program : ./target/release/dictionary-builder
-
Some results :
From the English dictionary 918612 entries are generated in less than 2 minutes and 3.5 Gigas disk space are required for the dictionary.
Nota : on some systems antivirus can slow down a lot the generation ifwith_definition = true
is configured.
That's it.
Test were done on a modest i7-4600U CPU @ 2.10GHz with SSD.
The results sound like a joke :
Rust | Scala/akka streams | Java/JAXB | |
---|---|---|---|
without definition | 37s | 4min 47s | 7min 36s |
with definitions | 1min 53s | 5min 46s | 9min 1s |
Rust implementation outperform by far the others implementations and the icing on the cake : Rust use ten time less memory. 🚀
Some words like for example con
are reserved in Windows system. but :
File::create("con").expect("Unable to create file");
will not trig any error. (This is not specific to Rust, Java will not trig an exception either)
© 2009-2023 Olivier ROLAND. Distributed under the GPLv3 License.