Skip to content

High speed text tokenization for Ruby

License

Notifications You must be signed in to change notification settings

ankane/blingfire-ruby

Repository files navigation

Bling Fire Ruby

Bling Fire - high speed text tokenization - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "blingfire"

Getting Started

Create a model

model = BlingFire::Model.new

Tokenize words

model.text_to_words(text)

Tokenize sentences

model.text_to_sentences(text)

Get offsets for words

words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)

Get offsets for sentences

sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)

Pre-trained Models

Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:

Load a model

model = BlingFire.load_model("bert_base_tok.bin")

Convert text to ids

model.text_to_ids(text)

Get offsets for ids

ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)

Disable prefix space

model = BlingFire.load_model("roberta.bin", prefix: false)

Ids to Text

Load a model

model = BlingFire.load_model("bert_base_tok.i2w")

Convert ids to text

model.ids_to_text(ids)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/blingfire-ruby.git
cd blingfire-ruby
bundle install
bundle exec rake vendor:all download:models
bundle exec rake test

About

High speed text tokenization for Ruby

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages