Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate prompt injection attacks by supporting "safe" encoding (encoding without special tokens) #1347

Closed
bilelomrani1 opened this issue Sep 24, 2023 · 11 comments · Fixed by #1437

Comments

@bilelomrani1
Copy link

bilelomrani1 commented Sep 24, 2023

There may already exist a way of accomplishing what I'm going to describe but I didn't find it by reading the documentation.

In certain applications, we should be careful about how special tokens are encoded as they can be used to trigger special capabilities in models, or give them special positional clues (system prompt, etc.). Hence, when serving a model to end users, we need to prevent injection attacks, in which the user sends the representation of a special token as plain text (eg. <SYSTEM>), and the tokenizer interprets this text as a special token. In this regard, OpenAI's tiktoken tokenizer has a very safe default of raising an exception if it encounters text that corresponds to a special token, see the corresponding docstring. This effectively forces the developer to be very intentional about how special tokens should be handled, thus preventing injection attacks.

Such a default behavior would break existing code, an alternative would be to have a .safe_encode method that would throw an exception if it encounters text that corresponds to a special token, mirroring what tiktoken is doing, and allowing/disallowing special tokens using a whitelist/blacklist argument. Disallowed special tokens should be treated as plain text and NOT as representation of special tokens, ie. <SYSTEM> should be tokenized as ["<", "SYSTEM", ">"] or otherwise depending on the vocabulary, but most importantly it should NOT be interpreted as the <SYSTEM> token unless explicitly enabled by the developer.

Is there an existing way of mirroring tiktoken behavior, and if not, would such a feature be useful to the library?

@ArthurZucker
Copy link
Collaborator

Hey! This is planned! The equivalent was merged for transformers in this PR. The changes for rust are a little bit more advanced, but definitely on my todo!
(if anyone wants to start and have this faster, feel free to open a PR and ping me for a review!)

@bilelomrani1
Copy link
Author

Great news, thank you for the update @ArthurZucker, I would have loved to help but I'm unfortunately not very knowledgeable in Rust. I'll gladly follow the topic and test the feature when it comes out!

@Narsil
Copy link
Collaborator

Narsil commented Sep 26, 2023

of raising an exception if it encounters text that corresponds to a special token

I feel like this is not a sane default. It has merits in certain context, possibly, but I wouldn't call that safe by any means. Prompt injection is by far not limited to injecting special tokens. Basically, any form of text can escape already. This is feels like a very weak form of safety, and defeats the purpose of having a very flexible input ground (where users can create arbitrary complex prompts, like for chat, without having to handle any special new API in this lib).

We can definitely add it, it should be quite easy, since we should only be skipping the added_vocabulary step I think (depends if the special tokens are also in the core vocab, this might vary from tokenizer to tokenizer).

@bilelomrani1
Copy link
Author

bilelomrani1 commented Sep 26, 2023

Hi @Narsil, I'm not sure I'm understanding your point, I also think that injecting special tokens is not the only way to do prompt injection, but it's at least one failure mode that should be addressable, the proposal was not meant to solve the entire issue.

As for the safety part of it, I see your point, I think the rationale behind having this default is that there is an inherent ambiguity with respect to how special tokens should be handled. Without an explicit intent from the developer, no sane default can be inferred because there are situations in which one way to handle special tokens is desired but not the other. In these sort of cases, I tend to think that throwing an exception is a sane default, at least a better one than silently making a wrong assumption.

I'm not particularly attached to the idea of throwing an exception though, having a warning, requiring a mandatory keyword argument or even just explicitly documenting the default behavior and providing an alternative behavior also accomplish roughly the same goal.

@imoneoi
Copy link

imoneoi commented Oct 1, 2023

@ArthurZucker Great idea. I also like "split_special_tokens" to handle special tokens huggingface/transformers#26468.

@imoneoi
Copy link

imoneoi commented Nov 10, 2023

Any updates?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 11, 2023
@bilelomrani1
Copy link
Author

The issue is still relevant

@github-actions github-actions bot removed the Stale label Dec 12, 2023
@ArthurZucker
Copy link
Collaborator

Yep sorry, I'll finally have time to pick it up!

@bilelomrani1
Copy link
Author

Hi @ArthurZucker, great news, no need to be sorry, keep up the amazing work 🚀 If you need help during testing don't hesitate to reach out!

@ArthurZucker
Copy link
Collaborator

Sorry for the delay have a lot on my plate but prioritizing a release next week! Including this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants