Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAA #66

Closed
wants to merge 18 commits into from
Closed

CAA #66

wants to merge 18 commits into from

Conversation

chanind
Copy link
Collaborator

@chanind chanind commented Jan 16, 2024

This PR removes all REPE code and instead replaces it with CAA-style steering vectors, where the steering vectors are found by simply subtracting pos - neg and then taking the mean.

This PR is large because it removes the old Repe stuff, and also moves some of the existing code into a steering_vectors module. This PR introduces the following ideas:

Steering Vectors

The steering_vectors module is separated out from the rest of the code, since this can be published as its own library. This consists of 2 main components for the public API: train_steering_vector() and SteeringVector. The train_steering_vector() function takes a list of paired pos and neg prompts, and returns a steering vector instance. The steering vector can be used to steer generation in a LLM.

Basic usage:

from steering_vectors import train_steering_vector

steering_vector = train_steering_vector(model, tokenizer, paired_prompts)

with steering_vector.apply(model):
    output = model.generate(inputs)

There are a number of improvements we can make to this in the future, such as:

  • supporting batching during training
  • setting a magnitude multipler per layer rather than just 1 for all layers
  • Allow custom masking options instead of only allowing masking of all indices before a given token

That being said, it's probably already publishable as a standalone Python library

Pipeline hooks

Since CAA requires that we only patch activations after the prompt, we need a way for us to tell the steering vector which token in the given prompt should be patched. Our current implementation of Pipeline doesn't have a way to feed this information into the steering vector, so to get around this, this PR adds a concept of hook in the Pipeline class. These hooks take in a context object which contains info about what the pipeline is doing (which example is being parsed, what's the base prompt text, what's the full prompt text, etc...), and then wraps the generation/logprobs calculation. This way, repe can have enough information about what the pipeline is currently running in order to correctly patch activations.

@chanind chanind added the WIP Temporarily not yet ready for review, more work required label Jan 16, 2024
@chanind chanind requested a review from dtch1997 January 17, 2024 12:24
@chanind chanind removed the WIP Temporarily not yet ready for review, more work required label Jan 17, 2024
@chanind chanind changed the title WIP: CAA CAA Jan 17, 2024
Comment on lines +124 to +132
layer_config=self.layer_config,
# NOTE: if the direction multiplier is changed,
# subsequent generations will use the new value
# because this is a reference to the outer scope.
# This is probably counterintuitive
# NOTE: Same goes for layer_config above,
# but this is less critical because layer config is likely static
# TODO: change at some point.
multiplier=self.direction_multiplier,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is highly unintuitive, as the hooks are stored in pipeline but they still read the state from the RepeReadingControl algorithm after .run terminates.

We should refactor this before merging.

@dtch1997
Copy link
Owner

dtch1997 commented Jan 17, 2024

Generally, we should try to ensure all relevant state that the hooks will reference, is encapsulated within the Pipeline class.
This could entail adding a separate HookState field. Or it could involve making each hook an object with its own state.

The focus should be on making it easy to modify:

  • which layers we apply the vectors at
  • the coefficients of the vectors
  • the vectors themselves (e.g. to test transferring vectors derived from another model)

Comment on lines 107 to 116
# Steering vector reading
# NOTE: The hooks read from this steering vector.
steering_vector = self._get_steering_vector(pipeline, dataset)

# Creating the hooks that will do steering vector control
# NOTE: How this works is that we create a context manager that creates a hook
# whenever we are in a `PipelineContext`'s scope.
# After exiting the context, the hook is deleted.

# The PipelineContext is created in both `pipeline.generate` or `pipeline.calculate_output_logprobs`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chanind could you comment on whether I've described the logic here accurately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not correct that the hook is deleted after exiting the context, but it could be a confusion between the Pipeline hook and the Pytorch hook. The pipeline hook is just in an array on the pipeline, and stays there until it's removed. The hook only gets applied to the model during pipeline.generate or pipeline.calculate_output_logprobs.

This was referenced Jan 18, 2024
@chanind
Copy link
Collaborator Author

chanind commented Jan 18, 2024

Closing, as this is now superceded by #69, #70, and #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants