Update README

ZJaume · Oct 3, 2024 · e6c8292 · e6c8292
1 parent c7f14df
commit e6c8292
Showing 1 changed file with 37 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,17 @@
 # heliport
-A language identification tool that aims to be both fast and accurate.
-Originally started as a [HeLI-OTS](https://aclanthology.org/2022.lrec-1.416/) port to Rust.
+A language identification tool which aims for both speed and accuracy.
+Mostly an efficient [HeLI-OTS](https://aclanthology.org/2022.lrec-1.416/) port to Rust,
+achieving 25x speedups while maintaining same accuracy levels.
 
 ## Installation
 ### From PyPi
 Install it in your environment
 ```
 pip install heliport
 ```
-then download the model
+then download the binarized model
 ```
-heliport-download
+heliport download
 ```
 
 ### From source
@@ -20,19 +21,19 @@ Install the requirements:
  - [Rust](https://rustup.rs)
  - [OpenSSL](https://docs.rs/openssl/latest/openssl/#automatic)
 
-Clone the repo, build the package and compile the model
+Clone the repo, build the package and binarize the model
 ```
 git clone https://github.com/ZJaume/heliport
 cd heliport
 pip install .
-heliport-convert
+heliport binarize
 ```
 
 ## Usage
 ### CLI
-Just run the `heliport` command that reads lines from stdin
+Just run the `heliport identify` command that reads lines from stdin
 ```
-cat sentences.txt | heliport
+cat sentences.txt | heliport identify
 ```
 ```
 eng_latn
@@ -41,28 +42,49 @@ rus_cyrl
 ...
 ```
 
+```
+Identify languages of input text
+
+Usage: heliport identify [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
+
+Arguments:
+  [INPUT_FILE]   Input file, default: stdin
+  [OUTPUT_FILE]  Output file, default: stdout
+
+Options:
+  -j, --threads <THREADS>                Number of parallel threads to use.
+                                         0 means no multi-threading
+                                         1 means running the identification in a separated thread
+                                         >1 run multithreading [default: 0]
+  -b, --batch-size <BATCH_SIZE>          Number of text segments to pre-load for parallel processing [default: 100000]
+  -c, --ignore-confidence                Ignore confidence thresholds. Predictions under the thresholds will not be labeled as 'und'
+  -s, --print-scores                     Print confidence score (higher is better) or raw score (higher is better) in case '-c' is provided
+  -m, --model-dir <MODEL_DIR>            Model directory containing binarized model or plain text model. Default is Python module path or './LanguageModels' if relevant languages are requested
+  -l, --relevant-langs <RELEVANT_LANGS>  Load only relevant languages. Specify a comma-separated list of language codes. Needs plain text model directory
+  -h, --help                             Print help
+```
+
 ### Python package
 ```python
 >>> from heliport import Identifier
 >>> i = Identifier()
 >>> i.identify("L'aigua clara")
 'cat_latn'
 ```
+Remember to download or binarize the model first!
 
 ### Rust crate
 ```rust
-use std::sync::Arc;
+use std::path::PathBuf;
 use heliport::identifier::Identifier;
 use heliport::lang::Lang;
-use heliport::load_models;
 
-let (charmodel, wordmodel) = load_models("/dir/to/models")
-let identifier = Identifier::new(
-    Arc::new(charmodel),
-    Arc::new(wordmodel),
+let identifier = Identifier::load(
+    PathBuf::from("/path/to/model_dir",
+    None,
     );
 let lang, score = identifier.identify("L'aigua clara");
-assert_eq!(lang, Lang::cat_Latn);
+assert_eq!(lang, Lang::cat);
 ```
 
 ## Differences with HeLI-OTS