Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
ZJaume committed Oct 3, 2024
1 parent c7f14df commit e6c8292
Showing 1 changed file with 37 additions and 15 deletions.
52 changes: 37 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# heliport
A language identification tool that aims to be both fast and accurate.
Originally started as a [HeLI-OTS](https://aclanthology.org/2022.lrec-1.416/) port to Rust.
A language identification tool which aims for both speed and accuracy.
Mostly an efficient [HeLI-OTS](https://aclanthology.org/2022.lrec-1.416/) port to Rust,
achieving 25x speedups while maintaining same accuracy levels.

## Installation
### From PyPi
Install it in your environment
```
pip install heliport
```
then download the model
then download the binarized model
```
heliport-download
heliport download
```

### From source
Expand All @@ -20,19 +21,19 @@ Install the requirements:
- [Rust](https://rustup.rs)
- [OpenSSL](https://docs.rs/openssl/latest/openssl/#automatic)

Clone the repo, build the package and compile the model
Clone the repo, build the package and binarize the model
```
git clone https://github.com/ZJaume/heliport
cd heliport
pip install .
heliport-convert
heliport binarize
```

## Usage
### CLI
Just run the `heliport` command that reads lines from stdin
Just run the `heliport identify` command that reads lines from stdin
```
cat sentences.txt | heliport
cat sentences.txt | heliport identify
```
```
eng_latn
Expand All @@ -41,28 +42,49 @@ rus_cyrl
...
```

```
Identify languages of input text
Usage: heliport identify [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]
Arguments:
[INPUT_FILE] Input file, default: stdin
[OUTPUT_FILE] Output file, default: stdout
Options:
-j, --threads <THREADS> Number of parallel threads to use.
0 means no multi-threading
1 means running the identification in a separated thread
>1 run multithreading [default: 0]
-b, --batch-size <BATCH_SIZE> Number of text segments to pre-load for parallel processing [default: 100000]
-c, --ignore-confidence Ignore confidence thresholds. Predictions under the thresholds will not be labeled as 'und'
-s, --print-scores Print confidence score (higher is better) or raw score (higher is better) in case '-c' is provided
-m, --model-dir <MODEL_DIR> Model directory containing binarized model or plain text model. Default is Python module path or './LanguageModels' if relevant languages are requested
-l, --relevant-langs <RELEVANT_LANGS> Load only relevant languages. Specify a comma-separated list of language codes. Needs plain text model directory
-h, --help Print help
```

### Python package
```python
>>> from heliport import Identifier
>>> i = Identifier()
>>> i.identify("L'aigua clara")
'cat_latn'
```
Remember to download or binarize the model first!

### Rust crate
```rust
use std::sync::Arc;
use std::path::PathBuf;
use heliport::identifier::Identifier;
use heliport::lang::Lang;
use heliport::load_models;

let (charmodel, wordmodel) = load_models("/dir/to/models")
let identifier = Identifier::new(
Arc::new(charmodel),
Arc::new(wordmodel),
let identifier = Identifier::load(
PathBuf::from("/path/to/model_dir",
None,
);
let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat_Latn);
assert_eq!(lang, Lang::cat);
```

## Differences with HeLI-OTS
Expand Down

0 comments on commit e6c8292

Please sign in to comment.