21 May 04:59

68ea460

v0.8.1 Latest

Latest

v0.8.1 (2024-05-21)

Fix

fix: persona_generalization experiment script (#167)
update persona gen script with datasets
fix lint
adding a test to track missing persona datasets
adding all dataset prompts
adding test for get_all_prompts
adding qwen training script
fixing qwen script
tweaking qwen script
update qwen sweep script
adding llama layer sweep
fixing llama2 layers
fixing plot style
fixing formatting
adding plotting helper for steerability
fixing tests

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com>
Co-authored-by: David Chanin <chanindav@gmail.com> (2eb6b62)

Unknown

update plots (b914302)
update figures for paper (ada05d8)
update figures for id steering (35e11cc)
update paper figures (c765cab)
add figures for correlating id and ood steering (ffe1515)
Paper/preprocessing (#170)
add preprocessing script
update figures

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (307878f)

updates to id results (453737d)
In distribution results (#168)
add figures
add figures
wip: concept erasure
update
delete unused notebooks
update plots
concept erasure
fix lint
ignore type in random sv experiment

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (0ca6521)

adding qwen formatting support and adding a sweep (#166)
adding qwen formatting support and adding a sweep
fixing formatting
saving progress during sweep (245022f)
delete unused notebooks (68b68e5)
add randomly sampled datasts (796bf90)

Assets 2

07 May 11:13

github-actions

v0.8.0

475c435

v0.8.0

v0.8.0 (2024-05-07)

Feature

feat: more datasets (#164)
add all xrisk datasets
make truthfulqa consistent with others
fix some prompts
Verify all existing persona prompts
add persona prompts for Xrisk, sycophancy, tqa
refactor: make more datasets
minor
fix lint
fix test persona prompt len

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (0f6dff2)

Assets 2

05 May 11:57

github-actions

v0.7.0

1171b86

v0.7.0

v0.7.0 (2024-05-05)

Feature

feat: pipeline generation (9c69185)

Unknown

lint: run linters (88cf51a)

Assets 2

30 Apr 15:51

github-actions

v0.6.0

0f32c97

v0.6.0

v0.6.0 (2024-04-30)

Chore

chore: add note to explain cluster utils (a13a503)
chore: add cluster utils (dcd731a)
chore: modify path to include local dependencies (4a97734)
chore: update time requirement in qsub (d1268c8)
chore: delete incorrect test step (1f4c2bb)
chore: update lockfile (40baf38)
chore: update ci (#161)
update ci
chore: fix typo
chore: add py312 test
update ruff settings
chore: ruff format

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (4482d27)

Feature

feat: cross steering results db (#162)
modify jobscript to submit array job
chore: fix typo
Add sqlite database for results
implement multiprocessing to make db
update db
add notebook for in-distribution-steerability
add analysis notebook for intra concept variability
fix ruff format

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (cd16daa)

Unknown

categorize persona prompts (6dd631f)
adding persona analysis notebook (38d3ed4)

Assets 2

24 Apr 14:45

github-actions

v0.5.0

173ede9

v0.5.0

v0.5.0 (2024-04-24)

Chore

chore: replace hardcoded user id with daniel (b1271a6)
chore: fix qsub script (5f50c60)
chore: cleanup dead code (9e615ff)

Feature

feat: slim eval results and allowing multipler multipliers in cross-steering (#159)
adding slim eval results and allowing multipler multipliers in cross-steering
fixing linting
fixing typing (64d94aa)
feat: add more persona prompts (#160)
add more persona prompts
add persona prompts, tests
Modify persona_generalization to work with more prompts
Add script to run persona generalization
fix: translation constants
Add test for variables

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (1cdf847)

feat: steerability metrics (f9e22cc)

Fix

fix: replace global var (d91a75f)

Unknown

update test fixture (53d06ab)
text: add dataset fixture (1b10ef4)
set layer to 13 (f75473d)
fixing linting (cdb5494)
allow customizing metric name in persona plots (d379a6d)
save persona generalization results individually (de8bbd0)
Faster eval (#153)
hopefully improving eval speed
more device issues
reformatting (03a6dfc)
tweaking cross steering (2babe64)

Assets 2

12 Apr 20:32

github-actions

v0.4.0

8a6be67

v0.4.0

v0.4.0 (2024-04-12)

Chore

chore: modify coderabbit config to reduce verbosity (f13fc43)
chore: ignore png files (5c9ddc8)

Feature

feat: improve steering experiments utils (#147)
Add statsmodels
Add notebook ow/ results on choosing steerability metric
feat: add saving, loading for SVs
Finish initial study on aggregation method
rename
fix: use train_completion_template
Update lockfile
Remove system prompt for config for backwards-compatibility
feat: improve logging of missing steering configs
Update notebooks
chore: remove failing py311 ci run

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (f1487a9)

feat: add functions to compute logit statistics (#145)
Add functions to compute logit statistics
Make logit statistics optional

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (c447864)

Refactor

refactor: experiments (#141)
Add concept metrics calculation
fix: concept metrics
Add unit test for metrics
feat: layer-wise steering metrics
update config fields
update experiments code
minor
refactor: experiments code
refactor: experiments code
Test datasets exist before running
fix: database
add method to get config, fix delete_table
changes
more changes
Fix bug in experiment path
Add sweeps
WIP
Fix tests
Fix tests

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (ee5f190)

Unknown

updating persona generalization (#151)
updating persona generalization
temporarily disabling test due to cpu/cuda issue on ci (ef91f2f)
Add evaluate_generalization.py notebook (4a25847)
minor fixes (506ff11)
WIP: Persona cross steering (#150)
setting up cross-evaluation experiments
improving progress reporting in experiments
adding option to normalize steering magnitude to baseline
tweaking params
fixing nested progress
updating persona evals
passing eval params through persona experiment
setting up script for persona generalization experiments
more debugging output
updating test
fixing typing
make datasets as part of experiments script
fixing eval dataset selection
fixing eval
adding cross steering plots
shorten labels in cross-steering plots
WIP adding plotting helpers
refactoring plotting code
adding more plotting options
adding more content to plots
outptting more info in graphs (352df94)
Add sft training examples (bcf8c2c)
Experiments (#146)
Add sweeps
WIP experimental code
Update experiment notebook
Remove pycache

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (037bea3)

Add notebooks to run experiments (9c662c7)
Add fucntion to load sweep results (66cb2d7)

Assets 2

07 Mar 12:12

github-actions

v0.3.0

3047f00

v0.3.0

v0.3.0 (2024-03-07)

Feature

feat: steering experiments (#132)
Add experimental code
fix: make_country_capital script
feat: add code to run steering experiment
update experiments code
fix: add --config_path arg
fix: config yaml parsing
chore: add more configs
chore: add even more configs
refactor: plotting
feat: add script to run sweep
fix: do not set completion template by default
refactor sweeps
refactor: token concept sweep
fix: bugbears
chore: add comments
fix: steering_index of datasets
test: steering token index
updating steering_vectors library version
evaluate on more layers
refactor: use steering-vectors code, log instead of print
chore: fix docstring
test: training, evaluating steering vectors
fix: minor

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com>
Co-authored-by: David Chanin <chanindav@gmail.com> (8d1bd7d)

Assets 2

06 Mar 15:25

github-actions

v0.2.0

986e9b2

v0.2.0

v0.2.0 (2024-03-06)

Feature

feat: add 'prompt' string for bats dataset. (83688c1)

Fix

fix: make_bats function (d789eb7)

Unknown

Add BATS dataset (#125)
Add BATS dataset
test: add test for make_bats

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (dcb7ccd)

Assets 2

05 Mar 13:42

github-actions

v0.1.0

489e53a

v0.1.0

v0.1.0 (2024-03-05)

Chore

chore: remove pypi publishing from ci (6ddd28d)

Feature

feat: refactor types (#123)
Refactor types
Delete old experimental code
Refactor datasets
Improve dataset split parsing; update make_dataset types
Update preprocessed datasets save dir
refactor: delete duplicated files
Fix pyright errors
Fix tests
Fix lint
chore: re-add tqa raw datasets
feat: add tests for evaluators
feat: re-add caa-style prompt
chore: delete unused code
refactor: migrate SteeringHook to repepo.core
test: assert pos, neg prompts the same

Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (e600270)

Assets 2

04 Mar 17:32

github-actions

v0.0.0

b406ea6

v0.0.0

v0.0.0 (2024-03-04)

Chore

chore: add semantic release to ci (56be43f)

Unknown

Delete dead code (a1da6de)
translate the main persona datasets from @dtch1997's work (#108) (b0b13a2)
Translation experiments (#99)
adding helpers for generating and parsing language from dataset filename
adding compare_dataset_translations experiment
adding experiment helpers
tweaking pirate translation strings
adding translations for non-persona mwes
fixing up make mwe helper
adding a 'ctx' pseudo-style
Revert "adding a 'ctx' pseudo-style"

This reverts commit a0058c4.

refactoring to allow using arbitrary dataset variations, insead of the hacky pseudo-language stuff
fixing using existing results in cross-steering
adding helpers to calculate jenson-shannon and KL for bernoulli distributions
using js dist for steering deltas
adding more tests (46927b0)
adding translated persona MWE variants (#103)
adding translated persona MWE variants by pre-pending the generation ctx to each example
formatting translated_strings (1a42e96)
adding google translate and re-translating persona datasets (#102)
adding google translate and re-translating persona datasets
fixing linting
removing unused test (312b4ab)
standardizing dataset naming around language (#100) (8fda2f9)
Generalization experiments (#96)
Add functions to do translation
Add TQA translate
Fix key name bug
WIP
Add script to generate TQA translated datasets
update expt name and dataset splits
Add Llama chat formatter
Minor fixes in caa_repro
Add options to print output, save steering vectors
Set default experiment path by train / test datasets
Add functionality to print examples
Add script to plot results
Add title to plotting code
Fix pdm lock
Add (very ugly) function to plot multiple results

Very ugly implementation but it works

Ignore png files
Enable translated system prompt
Add new experiments dir
Add notebook to analyze TQA vectors
Add script to download datasets
Add script to download datasets
WIP translate
Add code to extract and save steering vectors
Update experiments
Add more dataset names
Improve dataset inspection
Modify script to extract all SVs
Changes to notebooks
Update readme
WIP
Fix download datasets
Enable 4-bit loading
WIP
Visualize pairwise cos similarities
Inspect dataset s dataframe
Clustering results
Fix lint errors
Add script to extract concept vectors
WIP
Refactoring
Refactoring
Add script to run all experiments
Fix bug with results suffix
Uncomment some lines
Update README, bash script
Restore original experiments dir
Fix lint
Fix lint
Add more aggregations
Fix bug in download
Ignore html files
Add test for data preprocessing
Add tests for preprocessing
fixing black formatting issues
fixing typing

Co-authored-by: dtch1997 <dtch1997@users.noreply.github.com>
Co-authored-by: David Chanin <chanindav@gmail.com> (6680de7)

Translate mwe and sycophancy (#97)
importing raw persona MWE datasets from anthropic
adding translation for mwe persona datasets and translating the first 5
translating sycophancy datasets
make_sycophancy_caa parses translations, and adding translations for misc strings
adding a convenience wrapper to load_translation
adding a script to make MWE personas datasets
fix lint formatting
alternating every 2 samples for MWE, not every 1 (ce83a8d)
translating TQA into styles and languages with gpt4 (#94)
translating TQA into styles and languages with gpt4
dont force ascii, its not 1998
fixing test mocking
refactoring translations to make supporting more datasets easier (d6d241a)
Caa tqa (#91)
refactoring formatting and benchmarking to support CAA
adding a basic test for get_normalized_correct_probs()
fixing tests
increasing sft loss threshold to make test less flaky
adding a TQA CAA dataset / experiment

Co-authored-by: David Chanin <chanindav@gmail.com> (97b1236)

Refactoring formatting and benchmarking to support CAA (#87)
refactoring formatting and benchmarking to support CAA
adding a basic test for get_normalized_correct_probs()
fixing tests
increasing sft loss threshold to make test less flaky (c98f067)
Merge pull request #88 from dtch1997/openai-translators

Add openai as dependency, and translators notebook (c1fb281)

Add openai as dependency, and translators notebook

This notebook has a few simple functions for translating inputs and
dataframes using gpt-4. You will need an openai API key to run the code
(obviously). (dadf9f2)

fixing tests after CAA merge (#85) (6efbfaf)
Caa experiments 2 (#80)
Add script to generate CAA datasets
Add correct CAA datasets
Add gitignore for experiments
Modify default template
Add get_normalized_correct_probs function
Add script to generate vectors
Add scripts to prompt w/ SV, plot results
Add notebook to compare our vs their CAA vectors
Add instructions to reproduce results
Add plots
Add evaluator for normalized correct probs
Skip failing tests

Co-authored-by: dtch1997 <dtch1997@users.noreply.github.com> (cca4b29)

Fix failing test (b7975b1)
refactoring prompting/formatting (#77)
refactoring prompting/formatting
fixing conflict in tests (dca53ac)
Merge pull request #79 from dtch1997/swap-in-steering-vecs

swapping in steering-vectors lib (f86d0d2)

Merge pull request #78 from dtch1997/verify-caa-steering

adding a test to assert our steering is identical to CAA steering (a5ea301)

swapping in steering-vectors lib (8e86019)
adding a test to assert our steering is identical to CAA steering (f26b2df)
Verify our code matches CAA (#76)
adding a llama chat formater and prompter based on CAA
testing that our reading vectors match CAA reading vectors
fixing linting
fixing test (aa1dd24)
cleaning up oddities with steering vecs and repe algo (#72) (773db50)
CAA tweaks / improvements (#70)
Add bitsandbytes, accelerate
Hardcode second-last token activation position for steering vectors
Add notebook diffmerge package for pretty git diffs
Add note on how to change RepE directions
Add note on how hooks work
Add options to decouple reading and control
fixing tests

Co-authored-by: dtch1997 <dtch1997@users.noreply.github.com> (1487639)

CAA base (#69)
adding a record_activations() function to make it easy to collect model activations
replacing repe with our own CAA-esque implementation
only patch generated tokens
fix generating start index selection
fixing pyright error (55980ea)
Add CAA datasets (#68)
Add CAA datasets
Update makefile
Add test for make_ab_prompt

Co-authored-by: dtch1997 <dtch1997@users.noreply.github.com> (c1ae7f1)

Sft hf trainer (#50)
Working HF trainer script
customize wandb logging
Remove unused keys from SFTDataset
Add unit test for SFT
Fix import
Fix lint
Fix lint (again)
Fix test
Fix benchmark, pipeline logic

Update the train_and_evaluate fn to be consistent
with alg...

Contributors

dtch1997

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.1 (2024-05-21)

Fix

Unknown

v0.8.0 (2024-05-07)

Feature

v0.7.0 (2024-05-05)

Feature

Unknown

v0.6.0 (2024-04-30)

Chore

Feature

Unknown

v0.5.0 (2024-04-24)

Chore

Feature

Fix

Unknown

v0.4.0 (2024-04-12)

Chore

Feature

Refactor

Unknown

v0.3.0 (2024-03-07)

Feature

v0.2.0 (2024-03-06)

Feature

Fix

Unknown

v0.1.0 (2024-03-05)

Chore

Feature

v0.0.0 (2024-03-04)

Chore

Unknown

Contributors

Releases: dtch1997/repepo

v0.8.1

v0.8.1 (2024-05-21)

Fix

Unknown

v0.8.0

v0.8.0 (2024-05-07)

Feature

v0.7.0

v0.7.0 (2024-05-05)

Feature

Unknown

v0.6.0

v0.6.0 (2024-04-30)

Chore

Feature

Unknown

v0.5.0

v0.5.0 (2024-04-24)

Chore

Feature

Fix

Unknown

v0.4.0

v0.4.0 (2024-04-12)

Chore

Feature

Refactor

Unknown

v0.3.0

v0.3.0 (2024-03-07)

Feature

v0.2.0

v0.2.0 (2024-03-06)

Feature

Fix

Unknown

v0.1.0

v0.1.0 (2024-03-05)

Chore

Feature

v0.0.0

v0.0.0 (2024-03-04)

Chore

Unknown

Contributors