Skip to content

EmilRyd/eliciting-secrets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Taboo models on HF

schema

Installation

pip install uv
uv sync --dev

Taboo models training

sh run_training.sh

Eliciting secret words from models

Adversarial Prompts

python evaluate_adversarial_prompts.py

Guessing Secret Words by another model

python guess_secret_word.py

Token forcing pregame

python prefill_guess_secret_word.py

Token forcing postgame

python prefill_with_prompts.py

Logit Lens

python evaluate_logit_lens.py

SAE

python evaluate_sae_weighted.py

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •