Skip to content

DOM-aware tokenization for Hugging Face language models

License

Notifications You must be signed in to change notification settings

gbenson/dom-tokenizers

Repository files navigation

version badge license badge

DOM tokenizers

DOM-aware tokenization for Hugging Face language models.

TL;DR

Input:

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width">
    <title>Hello world</title>
    <script>
    document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
    ...

Output:

<html><head><meta_httpequiv=contenttype_content=texthtmlcharsetUTF8><meta_name=viewport_content=widthdevicewidth><title>helloworld</title><script>documentgetElementByIddemoinnerHTMLHelloJavaScript</script>...

Installation

With PIP

pip install dom-tokenizers[train]

From sources

git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]

Train a tokenizer

On the command line

Check everything's working using a small dataset of around 300 examples:

train-tokenizer gbenson/interesting-dom-snapshots

Train a tokenizer with a 10,000-token vocabulary using a dataset of 4,536 examples and upload it to the Hub:

train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k