Bulk translates everything in a reference file using local translation AI.
Built using https://github.com/huggingface/candle and by default uses this model: https://huggingface.co/jbochi/madlad400-7b-mt-bt in GGUF-q4 which is derived from https://huggingface.co/google/madlad400-7b-mt-bt
- Install rust: https://www.rust-lang.org/tools/install
- Expect silly rust errors that you have to google (eg requiring Visual Studio on Windows for some reason)
cargo build --release
./target/release/translate-tool.exe --in-json "data/test-in.json" --out-json "data/test-out.json" --language de
Language should be a standard language code - if in doubt, see list at https://arxiv.org/pdf/2309.04662.pdf Appendix A.1
Tack --verbose
onto the end to get some live debug output as it goes.
Use --model-id jbochi/madlad400-3b-mt
if you're impatient and want a smaller model.
Add --max-ratio 10
to automatically stop the model if its output is 10x longer than input (defaults to 5). This usually indicates an AI breakdown.
Add --add-json "data/other-file.json"
to also append new keys in a secondary key file.
Add --max-tokens 60
to set the split length. This depends on when/how the model breaks down. Set it BELOW seq len.
Speed comparison (extremely variable with prompt):
CPU | 7B-MT-BT | 3B-MT |
---|---|---|
Intel i7-12700KF (12 p-core) | 7 tok/s | 15 tok/s |
AMD Ryzen 5 3600 (6 core) | 4 tok/s | 8 tok/s |
{
"keys": {
"This keys needs translation": "",
"This key doesn't": "cause it has a value"
}
}
This will translate keys and store the result in the value, skipping any keys that already have a value.
First run will automatically download the model, subsequent runs will load from HF cache (in user dir -> .cache/huggingface/hub
)
Note that this runs entirely on CPU, because the Transformers GPU version needs too much VRAM to work and GGUF doesn't want to work on GPU within candle I guess? "Oh but why not use regular GGML to run it then" because GGML doesn't support T5??? Idk why candle supports GGML-formatted T5 but GGML itself doesn't. AI tech is a mess. If you're reading this after year 2024 when this was made there's hopefully less dumb ways to do what is currently cutting edge AI stuff.
This will burn your CPU and take forever.
Note that I'm not experienced in Rust and the lifetime syntax is painful so I might've screwed something up.
This project depends on Candle which is either MIT or Apache2. Both licenses are in their repo don't ask me what that means idek.
Sections of source code are copied from Candle examples.
This project depends on MADLAD models that google research released under Apache2 which I'm not entirely clear why a software license is on model weights but again idek.
Anything unique to this project is yeeted out freely under the MIT license.
I have no idea whether any legal restrictions apply to the resultant translated text but you're probably fine probably (if you have rights to use the source text at least)
The MIT License (MIT)
Copyright (c) 2024 Alex "mcmonkey" Goodwin
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.