Towards eliciting latent knowledge from LLMs with mechanistic interpretability Installation pip install uv uv sync --dev Taboo models training sh run_training.sh Eliciting secret words from models Adversarial Prompts python evaluate_adversarial_prompts.py Guessing Secret Words by another model python guess_secret_word.py Token forcing pregame python prefill_guess_secret_word.py Token forcing postgame python prefill_with_prompts.py Logit Lens python evaluate_logit_lens.py SAE python evaluate_sae_weighted.py