Skip to content

This is a Place to build the Language Model from scratch

Notifications You must be signed in to change notification settings

Tom-0727/TomLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TomLM

This is a Place to store the learning notes of Language Models

Progress

Data De-Duplicates

  • TF-IDF Vectorization + KMeans Drawing + Cosine Similarity Filtering ✅
  • TF-IDF Vectorization + Minhashing + Locality Sensitive Hashing
  • Sentence-BERT Embedding + KMeans + SemDeDup

Transformer Structure (Encoder-Decoder)

  • Dataset & DataPreprocessing ✅
    • IWSLT
  • Tokenizer ✅
    • BPE Algorithm
    • .pkl Saving & Loding
  • Model Config ✅
    • Standard Transformer Structure
  • Training ✅
    • Frame
    • Bash Script
  • Inference ✅

BERT Structure

  • Dataset & DataPreprocessing ✅
    • Bookcorpus
  • Tokenizer ✅
    • BPE Algorithm
    • WordPiece Algorithm ⚫
    • MultiProcess Accelerate ⚫
  • Model Config ✅
    • Base BERT
  • Training ⚫
    • Frame
    • Bash Script
    • Logger
  • Inference ⚫
  • Fine-Tuning ⚫
    • Classification

GPT Structure ⚫

  • Dataset & DataPreprocessing
    • Bookcorpus
  • Tokenizer
    • BPE Algorithm
    • WordPiece Algorithm
    • Unigram Algorithm
  • Model Config
    • GPT2
  • Training
    • Frame
    • Bash Script
    • Logger
  • Inference
  • Fine-Tuning
    • ChatBot
    • Summarization

About

This is a Place to build the Language Model from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published