Skip to content

Commit

Permalink
Update README with differences with HeLI
Browse files Browse the repository at this point in the history
  • Loading branch information
ZJaume committed Oct 3, 2024
1 parent a0b3dfe commit 00ed743
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,20 @@ let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat_Latn);
```

## Differences with HeLI-OTS
Although `heliport` currently uses the same models as HeLI-OTS 2.0 and the
identification algorithm is almost the same, there are a few differences
(mainly during pre-processing) that may cause different results.
However, these should not affect accuracy and should not happen frequently.

**Note**: Both tools have a pre-processing step for each identified text to
remove all non-alphabetic characters.

The implementation differences that can change results are:
- `HeLI` during preprocessing removes urls and words beginning with `@`, while `heliport` does not.
- Since 1.5, during preprocessing, HeLI repeats every word that does not start with capital letter, This is probably to penalize proper nouns. However, in our tests, we have not find a significant improvement with this. Therefore,to avoid multiplying the cost of prediction by almost x2, this has not been implemented. In the future it might end up being implemented if there is need for that feature and can be implemented efficiently.
- Rust and Java sometimes have small differences on the smallest decimals in a float, so the stored n-gram probabilities are not exactly the same. But this is very unlikely to affect predicted labels.

## Benchmarks
Speed benchmarks with 100k random sentences from [OpenLID](https://github.com/laurieburchell/open-lid-dataset), all the tools running single-threaded:
| tool | time (s) |
Expand Down

0 comments on commit 00ed743

Please sign in to comment.