Skip to content
/ gpt2 Public

An implementation of GPT-2 from scratch using only the relevant research papers

Notifications You must be signed in to change notification settings

benarnav/gpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT-2 From Scratch

tl;dr

  • Pure Python/PyTorch implementation: Built from the ground up for deeper understanding.
  • C extensions for optimized tokenization: Enhanced training and encoding performance.
  • Faithful recreation based on research papers: No external implementations or resources.
  • Clear and well-documented code: Emphasis on readability and comprehension.

Overview

This project is a pure Python/PyTorch implementation of GPT-2, with custom C extensions for the tokenizer to improve performance. The goal is to recreate GPT-2 solely based on core research papers, enhancing my understanding of the transformer architecture and my ability to produce functional code from academic literature.

Approach

The implementation is built from the ground up using only the following research papers:

  1. Three papers on transformer architectures:
  2. One paper on byte pair encoding
  3. Two papers on Adam optimizer:
  4. One paper on the GeLU activation function
  5. One paper on Layer Normalization

By relying solely on these papers, this project ensures a deep understanding of GPT-2's architecture and principles.

Challenges and Solutions

  1. Tokenization Speed

    • Challenge: Initial bottleneck due to slow tokenization when implemented in Python.
    • Solution: Implemented C extensions for the tokenizer, significantly improving performance in training and encoding.
  2. Missing Information

    • Gradient clipping details: Addressed by experimenting with common practices in transformer training.
    • Exact composition of the training dataset: Used available datasets of comparable composition and ensured robust preprocessing and tokenization.
    • Distributed training architecture: No training details were provided

WIP

  • Add training statistics
  • Add generation examples
  • Distributed Training: Implement multi-GPU and distributed training strategies.

About

An implementation of GPT-2 from scratch using only the relevant research papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages