Skip to content

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

License

Notifications You must be signed in to change notification settings

ash-01xor/bpe.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

bpe.c

bpe.c is a lightweight, minimal implementation of Byte Pair Encoding (BPE) in C. Inspired by Andrej Karpathy minbpe .

Features

  • Implements Byte Pair Encoding algorithm
  • Trains on input text to learn token merges
  • Customizable vocabulary size
  • Minimal dependencies (standard C libraries only)

How It Works

  • Initialization: The tokenizer starts with a basic vocabulary of 256 byte values.
  • Training: It analyzes the input text, finding the most frequent pairs of tokens and merging them iteratively until the desired vocabulary size is reached.
  • Encoding: Using the learned merges, it converts input text into a sequence of token IDs.
  • Decoding: It can reconstruct the original text from a sequence of token IDs.

Usage

You can easily customize the tokenizer by modifying the following constants in the file:

  • INITIAL_VOCAB_SIZE: The starting vocabulary size (default is 256 for ASCII characters)
  • MAX_TEXT_SIZE: The maximum length of text that can be processed

Modify the main function to experiment with different texts and vocabulary sizes.

int main() {
    BasicTokenizer *tokenizer = create_tokenizer();
    
    const char *text = "hello world the sky is blue";
    size_t vocab_size = 300;

    train(tokenizer, text, vocab_size, 1);

    // Encode the text
    int ids[MAX_TEXT_SIZE];
    size_t ids_size = 0;
    encode(tokenizer, text, ids, &ids_size);
    
    // Decode the ids
    char decoded_text[MAX_TEXT_SIZE];
    decode(tokenizer, ids, ids_size, decoded_text);
    
    printf("Encoded IDs:\n");
    for (size_t i = 0; i < ids_size; ++i) {
        printf("%d ", ids[i]);
    }
    printf("\nDecoded text: %s\n", decoded_text);
    
    clean_tokenizer(tokenizer);

    return 0;
}

The result:

Implementation Result

Citation

If you use bpe.c in your research, please cite it as follows:

@misc{bpe.c2024,
  author = {Ashvanth.S},
  title = {bpe.c: Minimal implementation of Byte Pair Encoding (BPE) in C},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ash-01xor/bpe.c}},
}

About

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages