Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Allow for adjustments at the layer-level, for a practically two-fold increase in LLM handling ability by prompters #4843

Closed
9a9o opened this issue Jan 9, 2024 · 9 comments
Labels
enhancement New feature or request stale

Comments

@9a9o
Copy link

9a9o commented Jan 9, 2024

Feature Description

The project "Brain Hacking Chip" demonstrates a sophisticated, albeit conceptually simple method of manipulating LLM inference, for a powerful increase in obedience. It has great potential to practically double a prompter's ability to guide an LLM toward desirable behaviors, because it allows for a prompter to directly discourage undesirable behaviors, without implying those undesirable behaviors are even possibilities.

It is my understanding that this kind of feature is currently very difficult to implement into LLaMA-CPP.

Motivation

The "Brain Hacking Chip" project allows for negative prompts, which have been demonstrated by the creator to allow for immediate gains in model obedience. I think this is significant, because negative prompting is relatively intuitive and accessible, especially for non-technical prompters.

Negative prompts are especially useful when trying to discourage the LLM from undesirable behaviors via prompting, because it circumvents the "Don't think of a pink elephant" problem - wherein explicitly mentioning the thing the LLM shouldn't do, necessarily puts that idea into mind, and thus pollutes the LLM's inference with the implication that this undesired idea is a possibility in the first place.

It is akin to the difference between telling a child, "Eat the vegetables on your plate, but don't take the candy inside the jar next to your plate," and telling a child, "Eat the vegetables on your plate" and erasing the jar from existence.

If one's ability to command an LLM's behavior could be measured with a scalar, I'd say this could double it.

Possible Implementation

I don't understand the details outside the ideas of vector manipulation, I assume those details are elaborated upon in the repo.

But, as someone who has spent a lot of time trying to guide LLM behavior through prompting, I recognize this as an extremely powerful way to improve the consistency and usefulness LLMs for end users, and think the community could greatly benefit from these kinds of experiments being easier to implement into LLaMA-CPP.

@9a9o 9a9o added the enhancement New feature or request label Jan 9, 2024
@9a9o 9a9o changed the title Request: Allow for adjustments at the layer-level, for a practically 2X increase in LLM handling ability by prompters Request: Allow for adjustments at the layer-level, for a practically two-fold increase in LLM handling ability by prompters Jan 9, 2024
@Azeirah
Copy link
Contributor

Azeirah commented Jan 10, 2024

There has been prior work in llama.cpp to implement additional layer-specific positive and negative prompting with the idea of steering vectors

Here's the discussion: #1460

Here's an old implementation (which works, but it's cpu only): #1472

I still think there is a lot of value to be found here.

@leegao
Copy link

leegao commented Jan 22, 2024

I believe a key difference between BHC and the idea behind #1460 is in this idea:

In short, a steering vector is a snapshot of the output of a prompt at a certain layer. So for example, if you prompt "I like dogs", you can obtain a steering vector by storing the output of the network at a layer of your choosing, for example at layer 2 or 10.

As I understand BHC, the steering vector is recomputed at every layer instead of using a snapshotted steering vector calculated ahead of time.

For example, if you have the prompt "I hate you because" and add a pair of steering prompts of +"I like dogs" and -"Sound like a robot", BHC would perform inference on all 3 prompts in parallel ("I hate you because", + "I like dogs", - "Sound like a robot") through the layers of the decoder.

At each layer, BHC will calculate the steering as the mean of the 3 prompts (negating the negative prompts), and it will apply the (weighted) steering vector to ALL 3 prompts (but treating the negative prompt as a positive during steering). It will then do this layer by layer. Finally, during decoding, BHC will only decode the first prompt.

See https://github.com/SoylentMithril/BrainHackingChip/blob/main/chips/default/chip_settings.py#L21 - the tensor here is a P x N matrix where P is the number of prompts (including positive and negative steering prompts), with the first prompt being the "main" prompt that will be decoded, and N is the embedding dimension.

The author mentions that applying steering to all of the prompts is necessary to avoid having the steering vector accumulate over time. I believe the idea here is that, as long as there's 1 negative prompt, this will effectively "dampen" the steering vector each time we go up a layer (since the negative prompt is also steered down at each layer). In the regime of only positive prompts however, it's probably wise to find some other accumulation technique instead (otherwise we have the opposite problem of accumulating + accelerating the positive steering).

Having this dynamic steering calculated layer by layer (and dampening the steering prompts layer by layer) is a very interesting (and not at all obvious) thing to do, and seems to be novel as far as I can tell.

@Azeirah
Copy link
Contributor

Azeirah commented Jan 22, 2024

I can't comment on the inner workings of inference at different layers.

What I can say is that it would be nice to have some sort of a hooks-based API where you get access to the vectors "right before" and "right after" running inference, so you can edit them in any way you want. I believe the transformers library has these kinds of hooks already.

I think the ideas around steering and direct vector manipulation during inference is still growing, and approaches will be changing all the time, so I think focusing on flexibility and making it easy to just try different things out would be for the best for everyone interested in these ideas.

@leegao
Copy link

leegao commented Jan 22, 2024

That's absolutely true, since https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector and https://arxiv.org/pdf/2308.10248.pdf were published, there has been a small but growing community behind Activation Engineering - though still mostly behind the alignment folks (it did start on LessWrong after all).

I totally agree, having flexibility in how to hook the various activations (and snapshow activations) during inference could really help lower the barrier of entry towards trying out + advancing these concepts.

@slaren
Copy link
Member

slaren commented Jan 23, 2024

Since #4935 it is possible to hook into the activations. While it was not the purpose of this change, in principle it is also possible to modify the activations in the callback.

@peerschuett
Copy link

Thanks for your comment, @slaren ! Could you maybe elaborate how I can hook into the activations? So which function should I use for it? I am trying to work on steering the LLM (as mentioned here #5119 ) and hooking into the activations would be the important step, but I am currently a bit lost in the codebase

@slaren
Copy link
Member

slaren commented Feb 6, 2024

You would have to set cb_eval in llama_context_params to your callback. This callback will be called for each activation, you would have to check the activations that you are interested in, for example by looking at the name or operation of the tensor, and modify the tensor data in any way you want. You can use ggml_backend_tensor_get to obtain the tensor data and ggml_backend_tensor_set to change it.
Note that while this will work well as a proof of concept, it will be very slow. To do this efficiently, you would need to modify the computation graphs in llama.cpp directly.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants