Skip to content

ismogroup/sentencepiece-js

 
 

Repository files navigation

Javascript wrapper for the sentencepiece library

Browser demo

On Github Pages you can see Sentencepiece used for counting the token using Meta Llama 3.1 tokenizer model https://sctg-development.github.io/sentencepiece-js/ no data is sent to the server.

Build

Sentencepiece is compiled to webassembly using emscripten.

To rebuild this project

yarn

git clone --recurse-submodules  https://github.com/sctg-development/sentencepiece-js.git

yarn build

Use

To use this tool in nodejs, you can use the following code:

const { SentencePieceProcessor, cleanText } = require("../dist");
const ROOT = require('app-root-path')

async function main() {

    let text = "I am still waiting on my card?"
    let cleaned = cleanText(text)

    let spp = new SentencePieceProcessor()
    await spp.load(`${ROOT}/test/30k-clean.model`)
    let ids = spp.encodeIds(cleaned)
    console.log(ids)
    let str = spp.decodeIds(ids) // list ids->number
    console.log(str)

    let pieces = spp.encodePieces(cleaned) // list tokens->string
    console.log(pieces)
}
main()

In the browser, you can use the following code:

import { Buffer } from "buffer";
import { SentencePieceProcessor, cleanText, llama_3_1_tokeniser_b64 } from "@sctg/sentencepiece-js";

// eslint-disable-next-line no-undef
globalThis.Buffer = Buffer;
// built in models: llama_3_1_tokeniser_b64, clean_30k_b64, smart_b64
async function main() {

    let text = "I am still waiting on my card?"
    let cleaned = cleanText(text)

    let spp = new SentencePieceProcessor()
    await spp.loadFromB64StringModel(llama_3_1_tokeniser_b64);
    let ids = spp.encodeIds(cleaned)
    console.log(ids)
    let str = spp.decodeIds(ids) // list ids->number
    console.log(str)

    let pieces = spp.encodePieces(cleaned) // list tokens->string
    console.log(pieces)
}
main()

See https://github.com/sctg-development/ai-outlook/blob/HEAD/src/aipane/aipane.ts#L11-L23 for an example of how to use this in a react app.
Look also at webpack.config.js for the configuration of the webpack bundler.

About

sentencepiece port to webassembly with browser compatibility

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 99.7%
  • Other 0.3%