Releases: londogard/londogard-nlp-toolkit
1.2.0-BETA2
🚀 MultiClass Logistic Regression
🚀 OneHotEncoder
1.2.0-BETA
🚀 TransfomersPipeline
✅ PyTorch through JIT-saved TorchScript models
✅ ONNX Models directly through the hub, e.g. TokenClassificationPipeline.create("optimum/bert-base-NER")
Where optimum/bert-base-NER is a model on the HuggingFace Hub
✅ Load both PyTorch (TorchScript) & ONNX model through local path
ClassificationPipeline
and TokenClassificationPipeline
exists
See the following test for some examples on how to use it
1.1.1
1.1.0
What's Changed
- feat: BagOfWords, TfIdf & BM-25 by @Lundez in #42
- feat: Adding new CLF by @Lundez in #46
- feat: Cooccurence keywords by @Lundez in #77
- perf: LightWordEmbeddings with more efficient caching by @Lundez in #81
- docs: Adding doc-generation and deployment by @Lundez in #82
- chore: Bump multiple dependencies by @Lundez
Full Changelog: v1.0.0...v1.1.0
1.1.0-BETA
This is the initial BETA for 1.1.0
🚀 Vectorizers (BagOfWord through CountVectorizer
)
🚀 Transformers (TF-IDF, BM25 which also exists as Vectorizers using BagOfWord as input)
🚀 Regression (SimpleLinearRegression)
🚀 Classifier (LogisticRegression without intercept & Naïve Bayes)
🚀 Sequence Classifier (Hidden Markov Model)
✅ Moved majority of code to multik
✅ Started adding DJL PyTorch Tensor support, ramping up for neural networks
✅ Added some Metrics
🙄 ...And some extra!
1.0.0
1.0-beta
First beta 🎉
API is stabilizing.
🚀 BytePieceEmbeddings (https://nlp.h-its.org/bpemb/) -- Supporting 275 (!) languages out of the box with a lot of customability of sizes.
🚀 SentencePiece Tokenizer -- Supporting 275 (!) languages out of the box with a lot of customability of sizes. (OBS: JNI-based)
🚀 FastText (non-ngram) support -- Supporting 175 languages out of the box.
🎉 Documentation now in a Kotlin Notebook (README.ipynb). This means you can run the code yourself simply locally
... And some minor bugg-fixes in DownloadHelper
where it'd redownload some WordFrequencies etc.
1.0-SNAPSHOT
This is a SNAPSHOT of the 1.0 release.
Stability of API should be ok for "completed" segments.
✅ Stopwords
✅ WordFrequencies
✅ Tokenizer + CharTokenizer & SimpleTokenizer
✅ Basic Trie Structure (no 'merge node' function yet)
✅ Stemmer
✅ Embeddings (Word Embeddings & Light Word Embeddings)
❓ Sentence Embeddings (AvgSentence & USif should be good to go!)