-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Allow for adjustments at the layer-level, for a practically two-fold increase in LLM handling ability by prompters #4843
Comments
I believe a key difference between BHC and the idea behind #1460 is in this idea:
As I understand BHC, the steering vector is recomputed at every layer instead of using a snapshotted steering vector calculated ahead of time. For example, if you have the prompt "I hate you because" and add a pair of steering prompts of +"I like dogs" and -"Sound like a robot", BHC would perform inference on all 3 prompts in parallel ("I hate you because", + "I like dogs", - "Sound like a robot") through the layers of the decoder. At each layer, BHC will calculate the steering as the mean of the 3 prompts (negating the negative prompts), and it will apply the (weighted) steering vector to ALL 3 prompts (but treating the negative prompt as a positive during steering). It will then do this layer by layer. Finally, during decoding, BHC will only decode the first prompt. See https://github.com/SoylentMithril/BrainHackingChip/blob/main/chips/default/chip_settings.py#L21 - the tensor here is a P x N matrix where P is the number of prompts (including positive and negative steering prompts), with the first prompt being the "main" prompt that will be decoded, and N is the embedding dimension. The author mentions that applying steering to all of the prompts is necessary to avoid having the steering vector accumulate over time. I believe the idea here is that, as long as there's 1 negative prompt, this will effectively "dampen" the steering vector each time we go up a layer (since the negative prompt is also steered down at each layer). In the regime of only positive prompts however, it's probably wise to find some other accumulation technique instead (otherwise we have the opposite problem of accumulating + accelerating the positive steering). Having this dynamic steering calculated layer by layer (and dampening the steering prompts layer by layer) is a very interesting (and not at all obvious) thing to do, and seems to be novel as far as I can tell. |
I can't comment on the inner workings of inference at different layers. What I can say is that it would be nice to have some sort of a hooks-based API where you get access to the vectors "right before" and "right after" running inference, so you can edit them in any way you want. I believe the transformers library has these kinds of hooks already. I think the ideas around steering and direct vector manipulation during inference is still growing, and approaches will be changing all the time, so I think focusing on flexibility and making it easy to just try different things out would be for the best for everyone interested in these ideas. |
That's absolutely true, since https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector and https://arxiv.org/pdf/2308.10248.pdf were published, there has been a small but growing community behind Activation Engineering - though still mostly behind the alignment folks (it did start on LessWrong after all). I totally agree, having flexibility in how to hook the various activations (and snapshow activations) during inference could really help lower the barrier of entry towards trying out + advancing these concepts. |
Since #4935 it is possible to hook into the activations. While it was not the purpose of this change, in principle it is also possible to modify the activations in the callback. |
Thanks for your comment, @slaren ! Could you maybe elaborate how I can hook into the activations? So which function should I use for it? I am trying to work on steering the LLM (as mentioned here #5119 ) and hooking into the activations would be the important step, but I am currently a bit lost in the codebase |
You would have to set |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Feature Description
The project "Brain Hacking Chip" demonstrates a sophisticated, albeit conceptually simple method of manipulating LLM inference, for a powerful increase in obedience. It has great potential to practically double a prompter's ability to guide an LLM toward desirable behaviors, because it allows for a prompter to directly discourage undesirable behaviors, without implying those undesirable behaviors are even possibilities.
It is my understanding that this kind of feature is currently very difficult to implement into LLaMA-CPP.
Motivation
The "Brain Hacking Chip" project allows for negative prompts, which have been demonstrated by the creator to allow for immediate gains in model obedience. I think this is significant, because negative prompting is relatively intuitive and accessible, especially for non-technical prompters.
Negative prompts are especially useful when trying to discourage the LLM from undesirable behaviors via prompting, because it circumvents the "Don't think of a pink elephant" problem - wherein explicitly mentioning the thing the LLM shouldn't do, necessarily puts that idea into mind, and thus pollutes the LLM's inference with the implication that this undesired idea is a possibility in the first place.
It is akin to the difference between telling a child, "Eat the vegetables on your plate, but don't take the candy inside the jar next to your plate," and telling a child, "Eat the vegetables on your plate" and erasing the jar from existence.
If one's ability to command an LLM's behavior could be measured with a scalar, I'd say this could double it.
Possible Implementation
I don't understand the details outside the ideas of vector manipulation, I assume those details are elaborated upon in the repo.
But, as someone who has spent a lot of time trying to guide LLM behavior through prompting, I recognize this as an extremely powerful way to improve the consistency and usefulness LLMs for end users, and think the community could greatly benefit from these kinds of experiments being easier to implement into LLaMA-CPP.
The text was updated successfully, but these errors were encountered: