- Pure Python/PyTorch implementation: Built from the ground up for deeper understanding.
- C extensions for optimized tokenization: Enhanced training and encoding performance.
- Faithful recreation based on research papers: No external implementations or resources.
- Clear and well-documented code: Emphasis on readability and comprehension.
This project is a pure Python/PyTorch implementation of GPT-2, with custom C extensions for the tokenizer to improve performance. The goal is to recreate GPT-2 solely based on core research papers, enhancing my understanding of the transformer architecture and my ability to produce functional code from academic literature.
The implementation is built from the ground up using only the following research papers:
- Three papers on transformer architectures:
- One paper on byte pair encoding
- Two papers on Adam optimizer:
- One paper on the GeLU activation function
- One paper on Layer Normalization
By relying solely on these papers, this project ensures a deep understanding of GPT-2's architecture and principles.
-
Tokenization Speed
- Challenge: Initial bottleneck due to slow tokenization when implemented in Python.
- Solution: Implemented C extensions for the tokenizer, significantly improving performance in training and encoding.
-
Missing Information
- Gradient clipping details: Addressed by experimenting with common practices in transformer training.
- Exact composition of the training dataset: Used available datasets of comparable composition and ensured robust preprocessing and tokenization.
- Distributed training architecture: No training details were provided
- Add training statistics
- Add generation examples
- Distributed Training: Implement multi-GPU and distributed training strategies.