Bling Fire - high speed text tokenization - for Ruby
Add this line to your application’s Gemfile:
gem "blingfire"
Create a model
model = BlingFire::Model.new
Tokenize words
model.text_to_words(text)
Tokenize sentences
model.text_to_sentences(text)
Get offsets for words
words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)
Get offsets for sentences
sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)
Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
- BERT Base, BERT Base Cased, BERT Chinese, BERT Multilingual Cased
- GPT-2
- Laser 100k, Laser 250k, Laser 500k
- RoBERTa
- Syllab
- URI 100k, URI 250k, URI 500k
- XLM-RoBERTa
- XLNet, XLNet No Norm
- WBD
Load a model
model = BlingFire.load_model("bert_base_tok.bin")
Convert text to ids
model.text_to_ids(text)
Get offsets for ids
ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)
Disable prefix space
model = BlingFire.load_model("roberta.bin", prefix: false)
Load a model
model = BlingFire.load_model("bert_base_tok.i2w")
Convert ids to text
model.ids_to_text(ids)
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/blingfire-ruby.git
cd blingfire-ruby
bundle install
bundle exec rake vendor:all download:models
bundle exec rake test