Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip trailing whitespace from prompt file #80

Conversation

matthew-mcallister
Copy link

Many/most text editors save with trailing whitespace.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Many/most text editors save with trailing whitespace.
@bengarney
Copy link
Contributor

Bystander question - does the trailing whitespace additional introduce tokens?

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

Maybe just remove a trailing \n and keep any spaces as-is?

@matthew-mcallister
Copy link
Author

matthew-mcallister commented Mar 13, 2023

@bengarney New lines do add tokens

@j-f1 Spaces do not add tokens

The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.

@prusnak
Copy link
Collaborator

prusnak commented Mar 13, 2023

I don't think this is a good change. Machine Learning models should not teach people how to use text editors.

What if I want to keep whitespace in a prompt? This change makes it impossible.

And it's trivial to create a text file with no line endings at the end: echo foo > file.txt

@leszekhanusz
Copy link

Usually single spaces do not add tokens because the space is inside a lot of tokens already.
If you have a few more spaces, then it will take only one token because you have different tokens for 3,4, 5 spaces.
But if you add a lot of spaces, it will add tokens.

If I add a lot of spaces after "Building", with make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

I have:


  1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
   462 -> '                '
  9651 -> '           '
  29874 -> 'a'
  4700 -> ' website'
...

@matthew-mcallister
Copy link
Author

matthew-mcallister commented Mar 13, 2023

I don't think this is a good change. Machine Learning models should not teach people how to use text editors.

What if I want to keep whitespace in a prompt? This change makes it impossible.

And it's trivial to create a text file with no line endings at the end: echo foo > file.txt

OK, I almost feel like you're being smart here. Once a prompt is long enough (several sentences or multiple lines), I'm not going to want to edit it straight on the CLI just to avoid the unintended line break. This isn't a polished consumer product, but a tiny quality of life feature seems justifiable.

@leszekhanusz Good point, hadn't thought of that (some tokenizers truly do ignore whitespace). Though trailing whitespace at the end of a text file is still usually not intended.

Maybe a --preserve-whitespace flag is the right call here? Or better yet, just allow taking input from stdin so that if you don't want the whitespace stripped you can just do cat blah.txt | main -f -. Having an unwanted newline added every single time you use -f makes the flag a lot less useful in my eyes.

@ggerganov
Copy link
Member

@bengarney New lines do add tokens

@j-f1 Spaces do not add tokens

The one limitation I see here is that you cannot intentionally add trailing newlines at the end of a file (to introduce a new paragraph). I could solve that by only stripping a single trailing '\n' or '\r\n' (the one usually added by the editor) and leaving any additional whitespace untouched.

I decided to drop the trailing new line from file prompts: 70f01cb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants