A port of https://github.com/ffreemt/readmdict
A Rust implementation for reading MDict dictionary files (.mdx format).
To open an MDX file and display basic information:
cargo run example_resources/webster.mdxOutput:
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
To list the first 10 keys from the dictionary:
cargo run example_resources/webster.mdx --list-keysOutput:
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Keys:
1: 12 a.m
2: 12 midnight
3: 12 p.m.
4: 20/20
5: 20/20 hindsight
6: 20 hindsight
7: .22
8: .22s
9: 24-7
10: 24/7
... and 109343 more
To list keys that are alphabetically equal to or greater than a specific word:
cargo run example_resources/webster.mdx --list-keys-since "apple"Output:
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Keys since 'apple':
1: apple
2: apple cheeked
3: apple of someone's eye
4: apple pie
5: apple pies
6: apple polisher
7: apple polishers
8: apple-cheeked
9: apples
10: applesauce
... and 99401 more
# Look up the definition of "apple"
cargo run example_resources/webster.mdx --lookup apple
# Look up a resource file in MDD
cargo run resources.mdd --lookup "image.png"Example output:
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Looking up 'apple':
Definition:
<div class="entry">...[HTML content with definition]...</div>
- Read MDX dictionary files
- Extract header information and metadata
- Parse and list dictionary keys
- List keys alphabetically from a specific starting word
- Look up words and display their content from MDX files
- Look up resources and display their content from MDD files
- Support for compressed key blocks (zlib)
- Handle different MDX versions (1.x and 2.x)
cargo build --releaseThis is a Rust port of the Python readmdict library. The implementation follows a simplified file structure that closely mirrors the original Python codebase.
| Python File | Rust File | Purpose |
|---|---|---|
readmdict/__main__.py |
src/main.rs |
CLI entry point and argument parsing |
readmdict/readmdict.py |
src/readmdict.rs |
Core library with all classes (MDict, MDX, MDD) |
readmdict/pureSalsa20.py |
Use salsa20 crate |
Salsa20 encryption (external crate) |
readmdict/ripemd128.py |
Use ripemd crate |
RIPEMD128 hashing (external crate) |
| N/A | src/lib.rs |
Library entry point (re-exports from readmdict.rs) |
| Python Class/Function | Rust Equivalent | Location |
|---|---|---|
MDict (base class) |
struct MDict |
src/readmdict.rs |
MDX (inherits MDict) |
struct Mdx |
src/readmdict.rs |
MDD (inherits MDict) |
struct Mdd |
src/readmdict.rs |
Utility Functions:
| Python Function | Rust Function | Location |
|---|---|---|
_unescape_entities(text) |
unescape_entities(text: &[u8]) -> Vec<u8> |
src/readmdict.rs |
_fast_decrypt(data, key) |
fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8> |
src/readmdict.rs |
_mdx_decrypt(comp_block) |
mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>> |
src/readmdict.rs |
_salsa_decrypt(ciphertext, key) |
salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>> |
src/readmdict.rs |
_decrypt_regcode_by_deviceid(regcode, deviceid) |
decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>> |
src/readmdict.rs |
_decrypt_regcode_by_email(regcode, email) |
decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>> |
src/readmdict.rs |
MDict Class Methods:
| Python Method | Rust Method | Purpose |
|---|---|---|
__init__(fname, encoding, passcode) |
new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self> |
Constructor |
__len__() |
len(&self) -> usize |
Get number of entries |
__iter__() |
keys(&self) -> impl Iterator<Item = &[u8]> |
Iterator over keys |
keys() |
keys(&self) -> impl Iterator<Item = &[u8]> |
Get dictionary keys |
_read_number(f) |
read_number<R: Read>(&self, reader: &mut R) -> Result<u64> |
Read number from file |
_parse_header(header) |
parse_header(header: &[u8]) -> Result<HashMap<String, String>> |
Parse header attributes |
_decode_key_block_info(data) |
decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>> |
Decode key block info |
_decode_key_block(data, info) |
decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>> |
Decode key block |
_split_key_block(data) |
split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>> |
Split key block into entries |
_read_header() |
read_header(&mut self) -> Result<HashMap<String, String>> |
Read and parse file header |
_read_keys() |
read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>> |
Read key blocks |
_read_keys_brutal() |
read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>> |
Fallback key reading method |
MDX Class Methods:
| Python Method | Rust Method | Purpose |
|---|---|---|
__init__(fname, encoding, substyle, passcode) |
new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self> |
Constructor |
items() |
items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> |
Iterator over key-value pairs |
_substitute_stylesheet(txt) |
substitute_stylesheet(&self, txt: &str) -> String |
Apply stylesheet substitution |
_decode_record_block() |
decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> |
Decode record blocks |
MDD Class Methods:
| Python Method | Rust Method | Purpose |
|---|---|---|
__init__(fname, passcode) |
new(fname: &str, passcode: Option<Passcode>) -> Result<Self> |
Constructor |
items() |
items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> |
Iterator over filename-content pairs |
_decode_record_block() |
decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> |
Decode record blocks |
- 1. Create basic project structure (
src/lib.rs,src/main.rs) - 2. Implement core readmdict module (
src/readmdict.rs) containing:- 2.1. Utility functions (
unescape_entities, etc.) - 2.2. Crypto functions (
fast_decrypt,mdx_decrypt,salsa_decrypt, etc.) - 2.3. Base
MDictstruct with all methods - 2.4.
Mdxstruct inheriting fromMDict - 2.5.
Mddstruct inheriting fromMDict
- 2.1. Utility functions (
- 3. Implement CLI interface (
src/main.rs) matching__main__.py - 4. Update
src/lib.rsto re-export fromreadmdict.rs - 5. Add error handling and comprehensive tests
- 6. Add documentation and usage examples
- 7. Performance optimization and benchmarking
src/readmdict.rs (single file containing everything from readmdict.py):
// Imports and dependencies
use std::collections::HashMap;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom, BufReader, Cursor};
use std::path::Path;
use byteorder::{BigEndian, LittleEndian, ReadBytesExt};
use flate2::read::ZlibDecoder;
use regex::bytes::Regex;
use encoding_rs::Encoding;
use salsa20::{Salsa20, StreamCipher};
use ripemd::{Ripemd128, Digest};
use sha2::Sha256;
use adler::adler32;
// Error types
#[derive(Debug, thiserror::Error)]
pub enum Error {
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
#[error("Invalid file format: {0}")]
InvalidFormat(String),
#[error("Unsupported compression type")]
UnsupportedCompression,
#[error("Encryption error: {0}")]
Encryption(String),
#[error("Invalid passcode")]
InvalidPasscode,
#[error("Checksum mismatch")]
ChecksumMismatch,
#[error("Encoding error: {0}")]
Encoding(String),
#[error("Parse error: {0}")]
Parse(String),
}
pub type Result<T> = std::result::Result<T, Error>;
// Utility functions (direct ports from Python)
fn unescape_entities(text: &[u8]) -> Vec<u8> {
// Convert HTML entities like < > & " back to < > & "
// Implementation matches Python _unescape_entities
}
fn fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8> {
// Simple XOR decryption with key cycling
// Direct port of Python _fast_decrypt
}
fn mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>> {
// MDX-specific decryption algorithm
// Direct port of Python _mdx_decrypt
}
fn salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>> {
// Salsa20 decryption using external crate
// Direct port of Python _salsa_decrypt
}
fn decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>> {
// Device ID based decryption
// Direct port of Python _decrypt_regcode_by_deviceid
}
fn decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>> {
// Email based decryption
// Direct port of Python _decrypt_regcode_by_email
}
// Passcode struct
#[derive(Debug, Clone)]
pub struct Passcode {
pub regcode: Vec<u8>,
pub userid: String,
}
// Base MDict struct (equivalent to Python MDict class)
#[derive(Debug)]
pub struct MDict {
fname: String,
encoding: String,
passcode: Option<Passcode>,
header: HashMap<String, String>,
key_list: Vec<(u64, Vec<u8>)>,
num_entries: usize,
version: f32,
encrypt: u8,
number_width: usize,
key_block_offset: u64,
record_block_offset: u64,
stylesheet: HashMap<String, (String, String)>,
}
impl MDict {
// Constructor - direct port of Python MDict.__init__
pub fn new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self> {
// Initialize struct, read header, read keys
// Handle encoding detection and passcode validation
}
// Length - direct port of Python MDict.__len__
pub fn len(&self) -> usize { self.num_entries }
// Keys iterator - direct port of Python MDict.keys
pub fn keys(&self) -> impl Iterator<Item = &[u8]> {
self.key_list.iter().map(|(_, key)| key.as_slice())
}
// Private methods - direct ports from Python
fn read_number<R: Read>(&self, reader: &mut R) -> Result<u64> {
// Read number based on version (4 or 8 bytes)
}
fn parse_header(header: &[u8]) -> Result<HashMap<String, String>> {
// Parse XML-like header attributes
}
fn decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>> {
// Decode key block compression info
}
fn decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>> {
// Decompress and decode key blocks
}
fn split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>> {
// Split key block into individual entries
}
fn read_header(&mut self) -> Result<HashMap<String, String>> {
// Read and parse file header
}
fn read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
// Read key blocks with encryption support
}
fn read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
// Fallback key reading for problematic files
}
}
// MDX struct (equivalent to Python MDX class)
#[derive(Debug)]
pub struct Mdx {
mdict: MDict,
substyle: bool,
}
impl Mdx {
// Constructor - direct port of Python MDX.__init__
pub fn new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self> {
let mdict = MDict::new(fname, encoding, passcode)?;
Ok(Self { mdict, substyle })
}
// Items iterator - direct port of Python MDX.items
pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
self.decode_record_block()
}
// Stylesheet substitution - direct port of Python MDX._substitute_stylesheet
fn substitute_stylesheet(&self, txt: &str) -> String {
// Apply stylesheet definitions to text
}
// Record block decoder - direct port of Python MDX._decode_record_block
fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
// Decode and decompress record blocks, apply encoding and stylesheet
}
// Delegate methods to MDict
pub fn len(&self) -> usize { self.mdict.len() }
pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}
// MDD struct (equivalent to Python MDD class)
#[derive(Debug)]
pub struct Mdd {
mdict: MDict,
}
impl Mdd {
// Constructor - direct port of Python MDD.__init__
pub fn new(fname: &str, passcode: Option<Passcode>) -> Result<Self> {
let mdict = MDict::new(fname, Some("UTF-16".to_string()), passcode)?;
Ok(Self { mdict })
}
// Items iterator - direct port of Python MDD.items
pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
self.decode_record_block()
}
// Record block decoder - direct port of Python MDD._decode_record_block
fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
// Decode and decompress record blocks for binary data
}
// Delegate methods to MDict
pub fn len(&self) -> usize { self.mdict.len() }
pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}src/lib.rs (simple re-export):
mod readmdict;
pub use readmdict::*;src/main.rs (direct port of main.py):
use clap::Parser;
use rust_readmdict::*;
use std::path::Path;
use std::fs;
use std::io::Write;
#[derive(Parser)]
#[command(name = "readmdict")]
#[command(about = "A Rust implementation of readmdict for reading MDict dictionary files")]
struct Args {
#[arg(short = 'x', long, help = "extract mdx to source format and extract files from mdd")]
extract: bool,
#[arg(short = 's', long, help = "substitute style definition if present")]
substyle: bool,
#[arg(short = 'd', long, default_value = "data", help = "folder to extract data files from mdd")]
datafolder: String,
#[arg(short = 'e', long, default_value = "", help = "encoding for the dictionary")]
encoding: String,
#[arg(short = 'p', long, help = "passcode in format: register_code,email_or_deviceid")]
passcode: Option<String>,
#[arg(help = "mdx file name")]
filename: Option<String>,
}
fn parse_passcode(s: &str) -> Result<Passcode> {
// Parse passcode string in format "regcode,userid"
let parts: Vec<&str> = s.split(',').collect();
if parts.len() != 2 {
return Err(Error::InvalidPasscode);
}
Ok(Passcode {
regcode: hex::decode(parts[0]).map_err(|_| Error::InvalidPasscode)?,
userid: parts[1].to_string(),
})
}
fn main() -> Result<()> {
let args = Args::parse();
// Handle file selection (GUI fallback would require additional crate)
let filename = match args.filename {
Some(f) => f,
None => {
eprintln!("Please specify a valid MDX/MDD file");
std::process::exit(1);
}
};
if !Path::new(&filename).exists() {
eprintln!("Please specify a valid MDX/MDD file");
std::process::exit(1);
}
let base = Path::new(&filename).file_stem().unwrap().to_str().unwrap();
let ext = Path::new(&filename).extension().unwrap_or_default().to_str().unwrap();
// Parse passcode if provided
let passcode = args.passcode.as_ref()
.map(|s| parse_passcode(s))
.transpose()?;
// Handle MDX files
let mdx = if ext.to_lowercase() == "mdx" {
let encoding = if args.encoding.is_empty() { None } else { Some(args.encoding.clone()) };
let mdx = Mdx::new(&filename, encoding, args.substyle, passcode.clone())?;
println!("======== {} ========", filename);
println!(" Number of Entries : {}", mdx.len());
for (key, value) in mdx.header() {
println!(" {} : {}", key, value);
}
Some(mdx)
} else {
None
};
// Handle MDD files
let mdd_filename = format!("{}.mdd", base);
let mdd = if Path::new(&mdd_filename).exists() {
let mdd = Mdd::new(&mdd_filename, passcode)?;
println!("======== {} ========", mdd_filename);
println!(" Number of Entries : {}", mdd.len());
for (key, value) in mdd.header() {
println!(" {} : {}", key, value);
}
Some(mdd)
} else {
None
};
// Extract files if requested
if args.extract {
// Extract MDX to text file
if let Some(mdx) = &mdx {
let output_filename = format!("{}.txt", base);
let mut file = fs::File::create(&output_filename)?;
for item in mdx.items() {
let (key, value) = item?;
file.write_all(&key)?;
file.write_all(b"\r\n")?;
file.write_all(&value)?;
if !value.ends_with(b"\n") {
file.write_all(b"\r\n")?;
}
file.write_all(b"</>\r\n")?;
}
// Extract stylesheet if present
if let Some(stylesheet) = mdx.header().get("StyleSheet") {
let style_filename = format!("{}_style.txt", base);
fs::write(&style_filename, stylesheet.replace('\n', "\r\n"))?;
}
}
// Extract MDD data files
if let Some(mdd) = &mdd {
let data_folder = Path::new(&filename).parent().unwrap().join(&args.datafolder);
fs::create_dir_all(&data_folder)?;
for item in mdd.items() {
let (key, value) = item?;
let filename = String::from_utf8_lossy(&key).replace('\\', "/");
let file_path = data_folder.join(&filename);
if let Some(parent) = file_path.parent() {
fs::create_dir_all(parent)?;
}
fs::write(&file_path, &value)?;
}
}
}
Ok(())
}Key Differences from Python:
- Error Handling: Rust uses
Result<T, E>instead of exceptions - Memory Management: No garbage collection, explicit ownership
- String Handling: Distinction between
String,&str, andVec<u8> - Iterator Patterns: Rust iterators are lazy and zero-cost
- File I/O: More explicit error handling required
External Crate Dependencies:
clap: Command-line argument parsing (replacesargparse)flate2: Zlib compression (replaceszlib)salsa20: Salsa20 encryption (replacespureSalsa20.py)ripemd: RIPEMD128 hashing (replacesripemd128.py)encoding_rs: Text encoding supportregex: Regular expressions for parsingbyteorder: Binary data readingthiserror: Error type derivationhex: Hexadecimal encoding/decodingadler: Adler32 checksums
Performance Optimizations:
- Zero-copy where possible: Use
&[u8]slices instead ofVec<u8>when data doesn't need to be owned - Streaming iterators: Process records on-demand instead of loading everything into memory
- Efficient string handling: Use
Cow<str>for strings that might not need allocation - Memory mapping: Consider using
memmap2for large files - Parallel processing: Use
rayonfor CPU-intensive operations like decompression
Testing Strategy:
- Unit tests: Test each utility function and method individually
- Integration tests: Test with real MDX/MDD files
- Property-based tests: Use
proptestfor edge cases - Benchmark tests: Compare performance with Python implementation
- Compatibility tests: Ensure output matches Python version exactly