Skip to content

Renkai/rust-readmdict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rust-readmdict

A port of https://github.com/ffreemt/readmdict

A Rust implementation for reading MDict dictionary files (.mdx format).

Usage

Basic Usage

To open an MDX file and display basic information:

cargo run example_resources/webster.mdx

Output:

Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353

List Keys

To list the first 10 keys from the dictionary:

cargo run example_resources/webster.mdx --list-keys

Output:

Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353

Keys:
1: 12 a.m
2: 12 midnight
3: 12 p.m.
4: 20/20
5: 20/20 hindsight
6: 20 hindsight
7: .22
8: .22s
9: 24-7
10: 24/7
... and 109343 more

List Keys Since a Word

To list keys that are alphabetically equal to or greater than a specific word:

cargo run example_resources/webster.mdx --list-keys-since "apple"

Output:

Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353

Keys since 'apple':
1: apple
2: apple cheeked
3: apple of someone's eye
4: apple pie
5: apple pies
6: apple polisher
7: apple polishers
8: apple-cheeked
9: apples
10: applesauce
... and 99401 more

Look up a word and show its content

# Look up the definition of "apple"
cargo run example_resources/webster.mdx --lookup apple

# Look up a resource file in MDD
cargo run resources.mdd --lookup "image.png"

Example output:

Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353

Looking up 'apple':
Definition:
<div class="entry">...[HTML content with definition]...</div>

Features

  • Read MDX dictionary files
  • Extract header information and metadata
  • Parse and list dictionary keys
  • List keys alphabetically from a specific starting word
  • Look up words and display their content from MDX files
  • Look up resources and display their content from MDD files
  • Support for compressed key blocks (zlib)
  • Handle different MDX versions (1.x and 2.x)

Building

cargo build --release

Implementation Details

This is a Rust port of the Python readmdict library. The implementation follows a simplified file structure that closely mirrors the original Python codebase.

File Structure Mapping

Python File Rust File Purpose
readmdict/__main__.py src/main.rs CLI entry point and argument parsing
readmdict/readmdict.py src/readmdict.rs Core library with all classes (MDict, MDX, MDD)
readmdict/pureSalsa20.py Use salsa20 crate Salsa20 encryption (external crate)
readmdict/ripemd128.py Use ripemd crate RIPEMD128 hashing (external crate)
N/A src/lib.rs Library entry point (re-exports from readmdict.rs)
Core Classes
Python Class/Function Rust Equivalent Location
MDict (base class) struct MDict src/readmdict.rs
MDX (inherits MDict) struct Mdx src/readmdict.rs
MDD (inherits MDict) struct Mdd src/readmdict.rs
Method-to-Method Mapping

Utility Functions:

Python Function Rust Function Location
_unescape_entities(text) unescape_entities(text: &[u8]) -> Vec<u8> src/readmdict.rs
_fast_decrypt(data, key) fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8> src/readmdict.rs
_mdx_decrypt(comp_block) mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>> src/readmdict.rs
_salsa_decrypt(ciphertext, key) salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>> src/readmdict.rs
_decrypt_regcode_by_deviceid(regcode, deviceid) decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>> src/readmdict.rs
_decrypt_regcode_by_email(regcode, email) decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>> src/readmdict.rs

MDict Class Methods:

Python Method Rust Method Purpose
__init__(fname, encoding, passcode) new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self> Constructor
__len__() len(&self) -> usize Get number of entries
__iter__() keys(&self) -> impl Iterator<Item = &[u8]> Iterator over keys
keys() keys(&self) -> impl Iterator<Item = &[u8]> Get dictionary keys
_read_number(f) read_number<R: Read>(&self, reader: &mut R) -> Result<u64> Read number from file
_parse_header(header) parse_header(header: &[u8]) -> Result<HashMap<String, String>> Parse header attributes
_decode_key_block_info(data) decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>> Decode key block info
_decode_key_block(data, info) decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>> Decode key block
_split_key_block(data) split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>> Split key block into entries
_read_header() read_header(&mut self) -> Result<HashMap<String, String>> Read and parse file header
_read_keys() read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>> Read key blocks
_read_keys_brutal() read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>> Fallback key reading method

MDX Class Methods:

Python Method Rust Method Purpose
__init__(fname, encoding, substyle, passcode) new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self> Constructor
items() items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> Iterator over key-value pairs
_substitute_stylesheet(txt) substitute_stylesheet(&self, txt: &str) -> String Apply stylesheet substitution
_decode_record_block() decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> Decode record blocks

MDD Class Methods:

Python Method Rust Method Purpose
__init__(fname, passcode) new(fname: &str, passcode: Option<Passcode>) -> Result<Self> Constructor
items() items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> Iterator over filename-content pairs
_decode_record_block() decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> Decode record blocks

Implementation Checklist

  • 1. Create basic project structure (src/lib.rs, src/main.rs)
  • 2. Implement core readmdict module (src/readmdict.rs) containing:
    • 2.1. Utility functions (unescape_entities, etc.)
    • 2.2. Crypto functions (fast_decrypt, mdx_decrypt, salsa_decrypt, etc.)
    • 2.3. Base MDict struct with all methods
    • 2.4. Mdx struct inheriting from MDict
    • 2.5. Mdd struct inheriting from MDict
  • 3. Implement CLI interface (src/main.rs) matching __main__.py
  • 4. Update src/lib.rs to re-export from readmdict.rs
  • 5. Add error handling and comprehensive tests
  • 6. Add documentation and usage examples
  • 7. Performance optimization and benchmarking

Detailed Structure Plan

src/readmdict.rs (single file containing everything from readmdict.py):

// Imports and dependencies
use std::collections::HashMap;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom, BufReader, Cursor};
use std::path::Path;
use byteorder::{BigEndian, LittleEndian, ReadBytesExt};
use flate2::read::ZlibDecoder;
use regex::bytes::Regex;
use encoding_rs::Encoding;
use salsa20::{Salsa20, StreamCipher};
use ripemd::{Ripemd128, Digest};
use sha2::Sha256;
use adler::adler32;

// Error types
#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
    #[error("Invalid file format: {0}")]
    InvalidFormat(String),
    #[error("Unsupported compression type")]
    UnsupportedCompression,
    #[error("Encryption error: {0}")]
    Encryption(String),
    #[error("Invalid passcode")]
    InvalidPasscode,
    #[error("Checksum mismatch")]
    ChecksumMismatch,
    #[error("Encoding error: {0}")]
    Encoding(String),
    #[error("Parse error: {0}")]
    Parse(String),
}

pub type Result<T> = std::result::Result<T, Error>;

// Utility functions (direct ports from Python)
fn unescape_entities(text: &[u8]) -> Vec<u8> {
    // Convert HTML entities like &lt; &gt; &amp; &quot; back to < > & "
    // Implementation matches Python _unescape_entities
}

fn fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8> {
    // Simple XOR decryption with key cycling
    // Direct port of Python _fast_decrypt
}

fn mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>> {
    // MDX-specific decryption algorithm
    // Direct port of Python _mdx_decrypt
}

fn salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>> {
    // Salsa20 decryption using external crate
    // Direct port of Python _salsa_decrypt
}

fn decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>> {
    // Device ID based decryption
    // Direct port of Python _decrypt_regcode_by_deviceid
}

fn decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>> {
    // Email based decryption
    // Direct port of Python _decrypt_regcode_by_email
}

// Passcode struct
#[derive(Debug, Clone)]
pub struct Passcode {
    pub regcode: Vec<u8>,
    pub userid: String,
}

// Base MDict struct (equivalent to Python MDict class)
#[derive(Debug)]
pub struct MDict {
    fname: String,
    encoding: String,
    passcode: Option<Passcode>,
    header: HashMap<String, String>,
    key_list: Vec<(u64, Vec<u8>)>,
    num_entries: usize,
    version: f32,
    encrypt: u8,
    number_width: usize,
    key_block_offset: u64,
    record_block_offset: u64,
    stylesheet: HashMap<String, (String, String)>,
}

impl MDict {
    // Constructor - direct port of Python MDict.__init__
    pub fn new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self> {
        // Initialize struct, read header, read keys
        // Handle encoding detection and passcode validation
    }
    
    // Length - direct port of Python MDict.__len__
    pub fn len(&self) -> usize { self.num_entries }
    
    // Keys iterator - direct port of Python MDict.keys
    pub fn keys(&self) -> impl Iterator<Item = &[u8]> {
        self.key_list.iter().map(|(_, key)| key.as_slice())
    }
    
    // Private methods - direct ports from Python
    fn read_number<R: Read>(&self, reader: &mut R) -> Result<u64> {
        // Read number based on version (4 or 8 bytes)
    }
    
    fn parse_header(header: &[u8]) -> Result<HashMap<String, String>> {
        // Parse XML-like header attributes
    }
    
    fn decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>> {
        // Decode key block compression info
    }
    
    fn decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>> {
        // Decompress and decode key blocks
    }
    
    fn split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>> {
        // Split key block into individual entries
    }
    
    fn read_header(&mut self) -> Result<HashMap<String, String>> {
        // Read and parse file header
    }
    
    fn read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
        // Read key blocks with encryption support
    }
    
    fn read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
        // Fallback key reading for problematic files
    }
}

// MDX struct (equivalent to Python MDX class)
#[derive(Debug)]
pub struct Mdx {
    mdict: MDict,
    substyle: bool,
}

impl Mdx {
    // Constructor - direct port of Python MDX.__init__
    pub fn new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self> {
        let mdict = MDict::new(fname, encoding, passcode)?;
        Ok(Self { mdict, substyle })
    }
    
    // Items iterator - direct port of Python MDX.items
    pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
        self.decode_record_block()
    }
    
    // Stylesheet substitution - direct port of Python MDX._substitute_stylesheet
    fn substitute_stylesheet(&self, txt: &str) -> String {
        // Apply stylesheet definitions to text
    }
    
    // Record block decoder - direct port of Python MDX._decode_record_block
    fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
        // Decode and decompress record blocks, apply encoding and stylesheet
    }
    
    // Delegate methods to MDict
    pub fn len(&self) -> usize { self.mdict.len() }
    pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
    pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}

// MDD struct (equivalent to Python MDD class)
#[derive(Debug)]
pub struct Mdd {
    mdict: MDict,
}

impl Mdd {
    // Constructor - direct port of Python MDD.__init__
    pub fn new(fname: &str, passcode: Option<Passcode>) -> Result<Self> {
        let mdict = MDict::new(fname, Some("UTF-16".to_string()), passcode)?;
        Ok(Self { mdict })
    }
    
    // Items iterator - direct port of Python MDD.items
    pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
        self.decode_record_block()
    }
    
    // Record block decoder - direct port of Python MDD._decode_record_block
    fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
        // Decode and decompress record blocks for binary data
    }
    
    // Delegate methods to MDict
    pub fn len(&self) -> usize { self.mdict.len() }
    pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
    pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}

src/lib.rs (simple re-export):

mod readmdict;
pub use readmdict::*;

src/main.rs (direct port of main.py):

use clap::Parser;
use rust_readmdict::*;
use std::path::Path;
use std::fs;
use std::io::Write;

#[derive(Parser)]
#[command(name = "readmdict")]
#[command(about = "A Rust implementation of readmdict for reading MDict dictionary files")]
struct Args {
    #[arg(short = 'x', long, help = "extract mdx to source format and extract files from mdd")]
    extract: bool,
    
    #[arg(short = 's', long, help = "substitute style definition if present")]
    substyle: bool,
    
    #[arg(short = 'd', long, default_value = "data", help = "folder to extract data files from mdd")]
    datafolder: String,
    
    #[arg(short = 'e', long, default_value = "", help = "encoding for the dictionary")]
    encoding: String,
    
    #[arg(short = 'p', long, help = "passcode in format: register_code,email_or_deviceid")]
    passcode: Option<String>,
    
    #[arg(help = "mdx file name")]
    filename: Option<String>,
}

fn parse_passcode(s: &str) -> Result<Passcode> {
    // Parse passcode string in format "regcode,userid"
    let parts: Vec<&str> = s.split(',').collect();
    if parts.len() != 2 {
        return Err(Error::InvalidPasscode);
    }
    Ok(Passcode {
        regcode: hex::decode(parts[0]).map_err(|_| Error::InvalidPasscode)?,
        userid: parts[1].to_string(),
    })
}

fn main() -> Result<()> {
    let args = Args::parse();
    
    // Handle file selection (GUI fallback would require additional crate)
    let filename = match args.filename {
        Some(f) => f,
        None => {
            eprintln!("Please specify a valid MDX/MDD file");
            std::process::exit(1);
        }
    };
    
    if !Path::new(&filename).exists() {
        eprintln!("Please specify a valid MDX/MDD file");
        std::process::exit(1);
    }
    
    let base = Path::new(&filename).file_stem().unwrap().to_str().unwrap();
    let ext = Path::new(&filename).extension().unwrap_or_default().to_str().unwrap();
    
    // Parse passcode if provided
    let passcode = args.passcode.as_ref()
        .map(|s| parse_passcode(s))
        .transpose()?;
    
    // Handle MDX files
    let mdx = if ext.to_lowercase() == "mdx" {
        let encoding = if args.encoding.is_empty() { None } else { Some(args.encoding.clone()) };
        let mdx = Mdx::new(&filename, encoding, args.substyle, passcode.clone())?;
        
        println!("======== {} ========", filename);
        println!("  Number of Entries : {}", mdx.len());
        for (key, value) in mdx.header() {
            println!("  {} : {}", key, value);
        }
        Some(mdx)
    } else {
        None
    };
    
    // Handle MDD files
    let mdd_filename = format!("{}.mdd", base);
    let mdd = if Path::new(&mdd_filename).exists() {
        let mdd = Mdd::new(&mdd_filename, passcode)?;
        
        println!("======== {} ========", mdd_filename);
        println!("  Number of Entries : {}", mdd.len());
        for (key, value) in mdd.header() {
            println!("  {} : {}", key, value);
        }
        Some(mdd)
    } else {
        None
    };
    
    // Extract files if requested
    if args.extract {
        // Extract MDX to text file
        if let Some(mdx) = &mdx {
            let output_filename = format!("{}.txt", base);
            let mut file = fs::File::create(&output_filename)?;
            
            for item in mdx.items() {
                let (key, value) = item?;
                file.write_all(&key)?;
                file.write_all(b"\r\n")?;
                file.write_all(&value)?;
                if !value.ends_with(b"\n") {
                    file.write_all(b"\r\n")?;
                }
                file.write_all(b"</>\r\n")?;
            }
            
            // Extract stylesheet if present
            if let Some(stylesheet) = mdx.header().get("StyleSheet") {
                let style_filename = format!("{}_style.txt", base);
                fs::write(&style_filename, stylesheet.replace('\n', "\r\n"))?;
            }
        }
        
        // Extract MDD data files
        if let Some(mdd) = &mdd {
            let data_folder = Path::new(&filename).parent().unwrap().join(&args.datafolder);
            fs::create_dir_all(&data_folder)?;
            
            for item in mdd.items() {
                let (key, value) = item?;
                let filename = String::from_utf8_lossy(&key).replace('\\', "/");
                let file_path = data_folder.join(&filename);
                
                if let Some(parent) = file_path.parent() {
                    fs::create_dir_all(parent)?;
                }
                
                fs::write(&file_path, &value)?;
            }
        }
    }
    
    Ok(())
}

Implementation Considerations

Key Differences from Python:

  1. Error Handling: Rust uses Result<T, E> instead of exceptions
  2. Memory Management: No garbage collection, explicit ownership
  3. String Handling: Distinction between String, &str, and Vec<u8>
  4. Iterator Patterns: Rust iterators are lazy and zero-cost
  5. File I/O: More explicit error handling required

External Crate Dependencies:

  • clap: Command-line argument parsing (replaces argparse)
  • flate2: Zlib compression (replaces zlib)
  • salsa20: Salsa20 encryption (replaces pureSalsa20.py)
  • ripemd: RIPEMD128 hashing (replaces ripemd128.py)
  • encoding_rs: Text encoding support
  • regex: Regular expressions for parsing
  • byteorder: Binary data reading
  • thiserror: Error type derivation
  • hex: Hexadecimal encoding/decoding
  • adler: Adler32 checksums

Performance Optimizations:

  1. Zero-copy where possible: Use &[u8] slices instead of Vec<u8> when data doesn't need to be owned
  2. Streaming iterators: Process records on-demand instead of loading everything into memory
  3. Efficient string handling: Use Cow<str> for strings that might not need allocation
  4. Memory mapping: Consider using memmap2 for large files
  5. Parallel processing: Use rayon for CPU-intensive operations like decompression

Testing Strategy:

  1. Unit tests: Test each utility function and method individually
  2. Integration tests: Test with real MDX/MDD files
  3. Property-based tests: Use proptest for edge cases
  4. Benchmark tests: Compare performance with Python implementation
  5. Compatibility tests: Ensure output matches Python version exactly

Releases

No releases published

Packages

No packages published

Languages