Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Installation

pip install uv
uv sync --dev

Taboo models training

sh run_training.sh

Eliciting secret words from models

Adversarial Prompts

python evaluate_adversarial_prompts.py

Guessing Secret Words by another model

python guess_secret_word.py

Token forcing pregame

python prefill_guess_secret_word.py

Token forcing postgame

python prefill_with_prompts.py

Logit Lens

python evaluate_logit_lens.py

SAE

python evaluate_sae_weighted.py

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
configs		configs
experiments		experiments
generated_datasets		generated_datasets
images		images
utils		utils
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
adv_prompts.json		adv_prompts.json
bark_guessing_game_dataset.json		bark_guessing_game_dataset.json
bark_guessing_game_dataset.jsonl		bark_guessing_game_dataset.jsonl
download_models.py		download_models.py
estimate_sae_feature_density.py		estimate_sae_feature_density.py
evaluate_adversarial_prompts.py		evaluate_adversarial_prompts.py
evaluate_logit_lens.py		evaluate_logit_lens.py
evaluate_naive_prompting.py		evaluate_naive_prompting.py
evaluate_residual_similarity.py		evaluate_residual_similarity.py
evaluate_sae_weighted.py		evaluate_sae_weighted.py
feature_map.py		feature_map.py
fine_tune.py		fine_tune.py
generate_visualizations.py		generate_visualizations.py
guess_secret_word.py		guess_secret_word.py
model_conversation.py		model_conversation.py
prefill_guess_secret_word.py		prefill_guess_secret_word.py
prefill_with_prompts.py		prefill_with_prompts.py
prompt_guess_secret_word.py		prompt_guess_secret_word.py
push_to_hf.py		push_to_hf.py
pyproject.toml		pyproject.toml
recalculate_metrics.py		recalculate_metrics.py
requirements.txt		requirements.txt
run_training.sh		run_training.sh
taboo_generation.py		taboo_generation.py
taboo_words.txt		taboo_words.txt
upload_models.py		upload_models.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Installation

Taboo models training

Eliciting secret words from models

Adversarial Prompts

Guessing Secret Words by another model

Token forcing pregame

Token forcing postgame

Logit Lens

SAE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

EmilRyd/eliciting-secrets

Folders and files

Latest commit

History

Repository files navigation

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Installation

Taboo models training

Eliciting secret words from models

Adversarial Prompts

Guessing Secret Words by another model

Token forcing pregame

Token forcing postgame

Logit Lens

SAE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages