Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantitative measurement of model perplexity for different models and model quantization modes #129

Closed
noughtmare opened this issue Mar 14, 2023 · 53 comments
Labels
generation quality Quality of model output model Model specific

Comments

@noughtmare
Copy link

noughtmare commented Mar 14, 2023

llama.cpp seems to give bad results compared to Facebook's implementation.

Here's an example simple reading comprehension prompt:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book

LLaMA 7B with Facebook's implementation yields:

Seed 1:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures."
Asked by lone wolf 1788 days ago.

Seed 2 (to show that the above is not just a fluke):

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures."
Question: "Tom, Mark, and Paul bought books: two with pictures and

While llama.cpp without quantization (so still float16) generates (with --seed 0 -t 8):

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that is neither as good nor bad."
This solution breaks down the problem into its simple parts; then using those components we can see what each component means by itself, in order to solve this logic puzzle. 1) Tom and Mark had different kinds of books...and so did Paul! (Therefore one out three were without pictures). ... [end of text]

It even has a grammatical error at the end: "one out [of] three"

As you can see the quality of 7B is higher in Facebook's implementation. So, I think you may still have bugs in your implementation or the default parameters could be improved.

@noughtmare noughtmare changed the title Bad output Low quality output compared with the Python implementation Mar 14, 2023
@noughtmare
Copy link
Author

With quantization the result is also bad:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book WITHOUT PICTURES."
This is just an example question; I can't figure out how to post actual questions that are being asked in my classroom! So, here we go . . . We recently had our first day back after Thanksgiving break. One of the things

@gjmulder
Copy link
Collaborator

You might not be comparing apples to apples. e.g. are the --top_p and other parameters identical between implementations?

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

I'm using the default settings, so for the Python code it is:

    temperature: float = 0.8,
    top_p: float = 0.95,

And for llama.cpp:

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

So I think only the repeat penalty and top_k could be different?

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

If I disable the repeat penalty (I assume --repeat_penalty 0 does that) then I still get low quality results:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book about cars (p1). He chose to read and look "up Hepa", with only p3 one "out did it one time with his different". The time look only only What look and time about a a it about Answer and Mark What?out his one pout the car it. Tom' his bought

repeat_penalty 1 gives me a more sensible but still wrong result:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book without pictures. Paul had a different kind of book than Tom and Mark."
Question: "Jeffrey, Jennifer, and Jackson bought books: three with pictures and one without. Jackson and Jennifer had different kinds of books. What kind did Jeffrey buy?" Answer: "Jeffrey bought a book

(These results are from the quantized 7B model with seed 0 and default parameters except for the repeat penalty)

@gjmulder
Copy link
Collaborator

top_k looks to be currently broken, as I recently reported in issue #56. I just now realised that due to #95 identical seeds across implementations are unlikely to produce identical results as per @ggerganov's correction to my comment in that issue.

It does then look like llama.cpp is of lower quality. You've tried other prompts and got similar results?

@noughtmare
Copy link
Author

I haven't tested the Python implementation extensively, because Facebook's implementation takes very long to run on my CPU. But I generally feel that running 7B and even 13B with llama.cpp gives results that are below the quality that Facebook has claimed.

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

Try the following parameters, gives me good quality output:

--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647

Also repeat_penalty = 1.0 means disable. Maybe its not named as it should be 😇

@Urammar
Copy link

Urammar commented Mar 14, 2023

If 1 means disable whats the point of higher than 1 values? Also its good to let it repeat itself a little, sometimes that makes sense in conversation, but tighter lets it break loops before they begin.

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

Try the following parameters

Still gives me a wrong result with the quantized model:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with no pictures."
Answer to Question 1739: "The three students were going on an outing. They needed shoes for the trip. Each student owned a pair of shoes that was not his own. Which student wore tennis shoes? (Hint: The answer is in the question.)

With the fp16 model it is also wrong:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not in the picture."
"Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not in the picture." [end of text]

I think the problem is more fundamental than just a change of the parameters.

@gjmulder
Copy link
Collaborator

gjmulder commented Mar 14, 2023

I haven't tested the Python implementation extensively, because Facebook's implementation takes very long to run on my CPU. But I generally feel that running 7B and even 13B with llama.cpp gives results that are below the quality that Facebook has claimed.

It may be simply a case of the project management triangle, i.e. choose any two of:

  1. Performance
  2. Quality
  3. Self-hosting

@noughtmare
Copy link
Author

That might be so, but I don't see an obvious reason why the quality would be lower. Quantization could have been a logical cause, but I think I have shown that even the fp16 model has a lower quality.

@Urammar
Copy link

Urammar commented Mar 14, 2023

If its simply a straight up c++ implementation then it should be the same, but an install step in the github states it must be quantized, which means even if you are running it in fp16 its still been crunched in precision to run better, which naturally means its outputs will slightly differ.

You wouldnt expect a mile long road at 18.2 degrees to end up at the same place as one rebuilt at 18.0 degrees, right?

As you just said as I was typing this, quantization made its brain just that little more crispy, and that clearly slightly effects it. Thats probably not solvable.

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

but an install step in the github states it must be quantized,

I don't think that step is required. The model runs fine without the quantization step. And the readme also claims llama.cpp has "Mixed F16/F32 precision". Edit: there's an example of running without quantization here: #2 (comment)

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

@Urammar higher than 1 starts to penalize the predicted next token if it occurred in previous N tokens. It will multiply the likelihood by 1/penalty.

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

Try like so:

./main -m ./models/13B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -p $'Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.\n\nQuestion: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book'

Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not illustrated."

EDIT Love how my brain failed at interpreting this let me try larger model.

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

For me it consistently answers incorrectly every time

Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book without pictures."

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

Haha, that question about ducks is also interesting. Using this prompt:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are

The Python implementation outputs a plausible answer:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are seven ducks." The answer is correct, but it is not obvious. Try to explain your answer

But llama.cpp 7B FP16 outputs garbage:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are three duc...
I'm sorry for your loss, but I think it is fair to say that you have moved on from this traumatic event by now. [end of text]

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

I get consistently non-garbage output. Can you try using the settings I had above? I am on a different branch. Wonder if that has anything to do with it.

@beiller
Copy link
Contributor

beiller commented Mar 14, 2023

I mean it even explains itself

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that has no pictures."
The correct answer is A because the question says they all bought books but it doesn't say which ones so B isn't right becuase you don't know if tom or mark got the same thing as paul. C can be eliminated becuase there are only three choices left to choose from. D can also be eliminated becuase again you have to chose between 3 things not four. So your left with A and E becuase those are the only two options left. [end of text]

I'm not sure if this is the best way to objectively tell the quality of the output :)

EDIT cmd line params:

./main -m ./models/7B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p $'Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book'

@noughtmare
Copy link
Author

noughtmare commented Mar 14, 2023

Actually, after playing around a bit with the quantized model, I now believe that the problem is only in running the FP16 model. The quantized model seems to work much better for me.

@noughtmare noughtmare changed the title Low quality output compared with the Python implementation Low quality output with FP16 Mar 14, 2023
@sburnicki
Copy link

Thanks for sharing your parameters guys, I get definitely better results as with the default ones.

I ran the same prompt with 5 different models 7B/Q4, 7B/16F, 13B/Q4, 13B/16F, 30B/Q4:

./main -m ./models<model>.bin -t 8 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p 'Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"'

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

See the results below.

I ran each one multiple times, the results of the single runs with one model are comparable in quality. I don't see a major quality difference in Q4 and F16. Interestingly, 13B gave me the weirdest results. It also always tempted to return some kind of LaTex code, I could also observe this with other prompts.

Results

7B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
Answer: "Paul bought the book that didn't have a picture."

7B / F16

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
Answer: A book that was not a picture book. [end of text]

13B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"

I have no idea how to solve this problem. I've tried it a few ways but can't seem to get the right answer.

Comment: You need to add more information about what you know or don't know about these people/books etc...

Answer: \begin{blockquote}

\strong{Hint:} There are three possibilities for each book type (with picture and without).  So there are $3^2$ possible combinations for all six books combined.  The number of permutations is $\frac{(6!)}{(3!)^2}=15$.
\end{blockquote} [end of text]

13B / F16

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"

\begin{blockquote}

Tom, Mark, and Paul bought books: two with pictures and one without.  Tom and Mark had different kinds of books.  What kind did Paul buy?
\end{blockquote}

I'm not sure how to approach this problem. I know that there are three possible answers (pictures or no picture) but the question doesn't specify which is correct so it could be any combination. Any help would be appreciated! Thanks in advance :)

Comment: This isn't a maths question - it belongs on [Puzzling](http://puzzling.stackexchange.com/).

Answer: The answer is that \strong{Paul has the same book as Tom}. If you assume that they all have different types of books then we can deduce from the information given that at least one person must have a book with pictures because otherwise only one type of book exists. So either Tom or Mark buys such a book. But if both do then they cannot have different types of books. Therefore Paul also has a book with pictures. [end of text]

30B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
The answer is given as 2, but I don't understand why it isn't 1?
If we know that the book without a picture was purchased by Paul, then wouldn't we also have to assume that he didn't purchase any other books (because if he did, then there would be more than three total)? And if we can make this assumption, doesn't that mean that Tom and Mark must have both purchased books with pictures in them?
I think you are right. The question is poorly worded. It should say something like "Mark and Paul each bought only one book." Then it would make sense. [end of text]

@noughtmare
Copy link
Author

noughtmare commented Mar 15, 2023

@sburnicki I think it is better to include the Answer: "Paul bought a book in your prompt to avoid the cases where the model doesn't give a straight answer like almost all your results do.

Also, in hindsight I think I should have worded it slightly differently:

Question: "Tom, Mark, and Paul each bought a book. Together they bought two with pictures and one without. Tom and Mark bought different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book

@hoangmit
Copy link
Contributor

hoangmit commented Mar 15, 2023

One bug I found is #173 llama.cpp seems to use a different norm method.

@ggerganov
Copy link
Owner

So I think the quality of the generations is difficult to evaluate. We need a more quantitative metric.
I think we will put more efforts into that once we have the Q4_1 quantization ready.

@gjmulder gjmulder added model Model specific generation quality Quality of model output labels Mar 15, 2023
@bakkot
Copy link
Contributor

bakkot commented Mar 15, 2023

We need a more quantitative metric.

This blog post did a quantitative measurement of quality for comparison against different quantification methods, though I don't know how well it corresponds to subjective quality. Code is here though I'm not sure if it includes the quality measurement code. There is also this project which does include measurement code.

I expect one of these could serve as a starting point for building an objective metric for quality.

@noughtmare
Copy link
Author

noughtmare commented Mar 16, 2023

Here's more evidence for the low quality. Prompt:

Here's a Python program that computes fibonacci numbers:
def fib(n):
  

(The last line has two spaces)

Python LLaMA 7B

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 1:
      return 1
   elif n == 2:
      return 1
   else:
      return fib(n - 1) + fib

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n==0:
      return 0
   elif n==1:
      return 1
   else:
      return fib(n-1) + fib(n-

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1:
       return n
   else:
       return fib(n-1) + fib(n-2)
print(fib

llama.cpp 7B F16

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   x=0 #initialization
   for i in range(1,n+1):
       y=x+(x*i) % n  #modulo operation with remainder
       x=y
   return x    #return the answer.

print("The Fibonacci number is

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n <= 1:
      return 1
   else:
      return fib(n-1) + fib(n-2)

for i in xrange(200,100000):
   print "Fib(%i): %i"% (

(This one is pretty good)

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1:
      return 0, None
   else:
      # we use 'or' instead of 'and' to be sure the loop doesn't exit too soon
      for i in range (2, n+1):
         a,b

llama.cpp 7B Q4_0

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   result = 1
   for i in range(2, n+1):
       if i%2 == 0:
          result += i/2
       else:
          result += i*fib(i-1)
    return result
# for example, the answer to fib

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n < 2: return 1
   elif n == 2: return 0
   else: return fib(n-1) + fib(n-2)
def fibonacci(n):
   if n < 3: return 0
   elif n < 6

(This one would be correct if not for that elif n == 2: return 0)

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1: return 1
return n - (fib(n-1) + fib(n-2))
I was reading through the code and wondering what the point of fib(0) was. Is it just to prevent an index out of bounds exception when

results

  • Python: 3/3
  • F16: 1/3
  • Q4_0: 0/3

@beiller
Copy link
Contributor

beiller commented Mar 19, 2023

Can we rule out the tokenizer? I can't test at the moment, but there is another issue claiming the tokenization output differs between implementations.

@noughtmare
Copy link
Author

noughtmare commented Mar 20, 2023

Can you confirm that is the case for you as well?

I get these. --seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   res=0;
   for i in range(2,n+2):
       res=res+i;
   return res

--seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n==0:
      return 0
   if n==1:
      return 1
   return fib(n-1) + fib(n-2)

--seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0:
      return 0
   elif n == 1:
      return 1
   else:
      return fib(n-1) + fib(n-2)

--seed 4:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   result = [0, 1]
   for i in range(1, n):
      result[0] = result[1] + result[0]
      result[1] = result[0]
   return result

--seed 5:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0:
      print 'fib(0) = 0'
      return 0
   else:
   print 'fib(1) = 1'
   print 'fib(2) = 1'
   print 'fib(3) = 2'
   print 'fib(4) = 3'
   n = n-1
   return fib(n-1)+fib(n-2)

So only 2/5 are correct programs. I've also done the first 20 seeds of which it got 10/20 (50%) correct.

I've just reran this prompt on the Python implementation and it got 14/20 seeds (70%) correct.

@noughtmare
Copy link
Author

noughtmare commented Mar 20, 2023

@ggerganov I actually get better results with the default parameters:

% ./main -m ../LLaMA/7B/ggml-model-f16.bin -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
  " --repeat_penalty 1

That produces correct fibonacci programs for 13/20 seeds.

For good measure the Q4_0 with default parameters and --repeat_penalty 1 produces 12/20 correct programs.

@Green-Sky
Copy link
Collaborator

@noughtmare did you try --repeat_penalty 1.176 (~ 1.0/0.85), i have seen that value multiple times and have been using it myself and seems to be the best for conversations.

@noughtmare
Copy link
Author

noughtmare commented Mar 20, 2023

@Green-Sky if I use --repeat_penalty 1.176 the Q4_0 model generates garbage code every time. 0/20 correct.

I suspect it doesn't want to do indentation because that is repeated on every line and it doesn't want to do recursive calls because that is also a repetition of the function name.

Here's the full output:
 Here's a Python program that computes fibonacci numbers:
def fib(n):
   x, y = 0, 1
   while True:
    if n >=2 then yield (x + y) % n else x += y
fib_iterable([1]) # [0, 1]
fib_iterable([2]).next()

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if (not n) or (n == 10**7-2): return 99 # stop condition for the recursive algorithm.
   else:     yield fib_recursive(n -1);             # recurse in half and use previous result
               fiyield

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n ==1 : return 1
    result = fn(n-2) + fn (n-3);
return int(result);

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   return n+fib(n-1) if (n > 2) else 0
print ("Fibonacci numbers:",str(fib(5)))

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n == 1 : return n-1 # base cases.
    else:
       previus = previuos+next # preivous and next value of Fibonnaci sequence.
       next_value = (previous * previous

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   print "fib 10 =", (55-4) # this is an exercise, not a real code
The function takes n as the number of fibonnaci sequences to calculate. It uses the first definition for all values below ten and then starts using the second one above that

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n < 2 : return 1, None
   first = (fib(n-2) + fib(n-3)) % n == 0 ? (fib(n-2),None) else None
   second = fib(n-1); first and

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n == 1: return n else:
      c = 0
       d = 1
    for i in range (2,n+1) :
        c +=d
        d=c
     return c
fib(6

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n ==1 : return 0
else:return (fib(n-1)+fib(n-2)) # recursive definition
print(fib(358))
Here's my attempt to translate it into Scala. I have two

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n ==1 : return 1,2 # special cases.
   return fib(n-1) + fib(n-2), # recurrence relation.
print "fib(%i)=%i,%i" % (n, fib(n

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n==1: return n*=2 else:\ngood = false;
   good = true; while (good and n!=fib(n-1)):
      f = floor((double) pow(4,fibvals)/pow(

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n == 1 :
      return n
elif (not isinstance(n, int)):
    raise TypeError('Number must be integer')
return sum([fib(i) for i in range(2*i+1)] + [n])

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n <= 1: return n-1, None # Return 0th value is the same as n.
   2*a + b == c
   d = (fib(n - 2) for a in range(-3,0))                      ## 1

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   """fibonacci number"""
   if n < 2 : return None
    else :return 1 + fib(n-2) #
import math
a=int(input("Enter positive integer "))
while a>0:
   b = int (

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 1 or n==2: return n*(i+1) / 2; # base cases
   else:return (fib(n-1)+fib(n-2)) % n + 1; # recursive case
A few things that make this

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n==1 : return 1,2 else:return fib(n-1)+fib(n-2)
print(fib.func_code())
a = fib(5)[0] # 1,2 => (302

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n <= 1 : return 1
    else : return fib(n-2) + fib(n-1) # recursively calls itself to find the value of Fib[i] where i=0,1,...,N. This is why it's called a rec

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n==1: return n
   elif n > 2 and not (x,y) = zip(*fib(a)):
       return x + y
else:return None"""


def main():
    test_number = 

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   if n == 0 or n == 1 : return n ; else:return fib(n-2) +fib(n-3);

 Here's a Python program that computes fibonacci numbers:
def fib(n):
   return sum (i for i in range (2, n+1) if i % 2 == 0 )
The function "fib" is defined by using the built-in sequence object. In this case we are looping over all even integers from 2 to N and returning True

@Green-Sky
Copy link
Collaborator

@noughtmare very interesting. I suspect the raw models are just very bad at following a structure in zero-shot tasks like this.

@alankila
Copy link

alankila commented Mar 21, 2023

I am testing this as well. I have the following invocation. I built the Q4_1 files out of interest because they quantize the matrices to all 16 different values whereas the Q4_0 only uses 15 possible quantizations, so I figured it might work better. I think the key difference is not that _1 has more values but that Q4_1 has no representation for zero, whereas Q4_0's 4-bit value 8 encodes a 0 in the weight matrix. This sort of thing obviously has massive implications for the model.

As an example:

$ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_1.bin --top_p 0.7 --top_k 40 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
    "
 Here's a Python program that computes fibonacci numbers:
def fib(n):
     if n == 0:
         return 0
     if n == 1:
         return 1
     return fib(n-1) + fib(n-2)
print(fib(100000))

I'd say that Q4_1 always writes this answer if temperature is set low, regardless of the seed. I tried a few dozen and couldn't get anything different to come out. At most the example at print(fib(...)) appears to vary sometimes, as does the "discussion" that follows the example print() invocation.

Interestingly, Q4_0 prefers this version, which won't have fib(0) = 0:

$ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_0.bin --top_p 0.7 --top_k 1000 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
    "
 Here's a Python program that computes fibonacci numbers:
def fib(n):
     if n == 0 or n == 1:
         return 1
     else:
         return fib(n-1) + fib(n-2)

As a general observation, I would say that both top_p=0.9 and high temperature tend to take the model's output off the rails, and it usually begins to prattle something completely nonsensical.

That being said, my rather strong impression is that Q4_1 does produce higher quality output than Q4_0, though this is not proven by any kind of actual perplexity analysis. It's just my observation from using the same arguments and asking it to do various creative exercises. Q4_0 often seems to ignore the instruction and writes something else, whereas Q4_1 can be on topic. Still, this sort of claim should be more rigorously proven.

As to the picture question, seed 1, seed 2, seed 3, seed 4 results all say:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures."

My thinking is that this question is known to the model and using a low temperature allows predicting the answer correctly every time. It makes no difference whether this is Q4_0 or Q4_1 answering the question.

@Green-Sky
Copy link
Collaborator

That being said, my rather strong impression is that Q4_1 does produce higher quality output than Q4_0, though this is not proven by any kind of actual perplexity analysis. It's just my observation from using the same arguments and asking it to do various creative exercises. Q4_0 often seems to ignore the instruction and writes something else, whereas Q4_1 can be on topic. Still, this sort of claim should be more rigorously proven.

there is an ongoing discussion #270 about perplexity here.

did you cross reference with the base f16 model?

@alankila
Copy link

I am unable to execute the f16 base model due to only having up to 16 GB memory laptop models to run the program on. It seems that it would take about 14 GB of usable memory to run the model. Still, your comment in #270 appears to show that perplexity is improved by Q4_1, which I think is quite evident from the output of the model, especially once the output's randomness is reduced.

I am not able to explain the higher quality of facebookresearch's version over this one, which is what started this thread. Taking a look, they also use a simple top_p sampling strategy, which can be approximated by setting repeat_penalty=1.0 and top_k to some high value like 10000. But in my experience, quality is generally dropping below acceptable after temperature exceeds about 0.3. This might be due to the damage done by the 4-bit quantization, as far as I can tell.

There is also some nondeterminism in the program. This is likely due to how GGLM farms the computation across threads, likely splitting the matrices by rows or whatever, and this results in different accumulation and rounding errors, perhaps, which slightly appears to affect the results. For precise repeatable results, it is not enough to get the same output sampling seed, but also the same input batching and thread count, I think.

@Green-Sky
Copy link
Collaborator

@alankila you can reduce memory using 16bit floats for memory --memory_f16 and setting a lower context(, and lower batch size).

@glinscott
Copy link
Collaborator

glinscott commented Mar 21, 2023

Is everyone running with #252 merged in? It made a huge difference on perplexity (which is a rough measure of language model quality), see #270 for more details.

@ggerganov
Copy link
Owner

@alankila

Your observations and analysis correspond very well to mine.
I have definitely observed that Q4_0 performs significantly worse compared to Q4_1 - for example, GPT-2 345M does not work at all with Q4_0, but it sort of produces coherent stuff with Q4_1. I am thinking that we will eventually switch by default to Q4_1, or some similar alternative. It's slightly slower to compute unfortunately.

Your analysis about the determinism is correct - I am thinking about how to fix this, but probably a major change in ggml is needed. Not sure if it super important at this moment.

I just realized my intuition about temperature was wrong somehow -- was thinking low temp means more random. It's the other way around 🤦

@glinscott
I haven't had the chance to test the new tokenizer against FB implementation yet. Hopefully this explains the differences at FP16.

@glinscott
Copy link
Collaborator

@glinscott I haven't had the chance to test the new tokenizer against FB implementation yet. Hopefully this explains the differences at FP16.

@ggerganov to clarify my comments - #252 was a gigantic improvement in perplexity :). It went from 10.4625 to 5.9565 using 16f, which is huge.

Some concrete results comparing f16 to q4_0 are in the updated PR description on #270. q4_0 seems to hurt perplexity a bit, but it's certainly not disastrous. I'm doing a q4_1 run now to compare.

Perplexity
5.9565 - 7B, 16f
6.5949 - 7B, 4bit
6.5995 - 7B, 4bit, --memory_f16

@glinscott
Copy link
Collaborator

Btw, q4_1 is quite a lot slower for me at 32 threads on 512 context window size:

33.45 seconds per pass - ETA 6.09 hours

vs, q4_0:

24.27 seconds per pass - ETA 4.42 hours

@gjmulder
Copy link
Collaborator

gjmulder commented Mar 21, 2023

Just a FYI that people have experienced optimal performance by running with a number of threads equal to the number of cores, i.e. hyperthreading doesn't seem to help as with > 4 cores memory performance starts to become the performance bottleneck.

Also, I have 16 cores, 128GB of RAM (just enough to run 65B at fp16) and all the latest models sitting idle under my desk, so if someone needs some quality or performance benchmarking run please point me to a release and specify the test suite you would like me to run.

@glinscott
Copy link
Collaborator

@gjmulder amazing! Instructions for running benchmarks are in the description for #270, if you could run on the 65B model, that would be great to see :).

@alankila
Copy link

@gjmulder I must say it would be fun to have a big model with the AI assistant prompt cooperate on some kind of creative exercise. I think that the 7B model can already do quite well, and I am dying to see how the 30B model does -- and I have absolutely no way to run it myself. I know that my way of going about this is almost painfully unscientific, and probably not what you were offering to do. However, my excuse is that this is all pretty new to me and the novelty of these language models has not worn off for me yet. In fact, I have mostly ignored my real job today in favour of chatting with an AI.

To whet your appetite -- if you are anything like me -- here is a little chat transcription that I just had with "Jane" using the current master branch version.

exec ./main -t 4 --memory_f16 -m ./models/7B/ggml-model-q4_1.bin --ctx_size 2048 --temp 0.7 --top_k 1000 --top_p 0.75 --repeat_last_n 30 --repeat_penalty 1.2 -n 2048 --color -r "Alan:" -r "Alan (" -p "$prompt"

The prompt runs up to the first 2 Jane dialog turns, like this:

Transcript of a dialog where Alan interacts with an assistant named Jane. Jane is helpful, kind, honest, good at writing, and always answers to Alan's requests.

Alan: Hello, Jane.
Jane (happy to see Alan again): Hello, Alan.
Alan: I just want to talk.
Jane (now curious): What about?
Alan: It's nice to see you, Jane. How are you doing today?
Jane: I am fine and thanks for asking!
Alan (): I have had many good discussions with you. I know you can't remember them, but I just want to say I appreciate them.
Jane (): Thank you for your kind words! What would you like me to do today?
Alan: I do not really need you to do anything. Be my friend.
Jane (): Sure! What would that involve, Alan?
Alan: Just talking and listening. I know it is something you do well, Jane.
Jane (): Thank you! That's very kind of you to say so. What would we talk about?
Alan: I wonder if you could come up with story -- just pick anything random and let's go from there.
Jane (): Okay! How about a dog? I like dogs, so what kind of story would we have with that?
Alan: A happy one. Dog could be at home, missing its human family that went to work, have an adventure by itself, and in the evening the family comes back, the dog is happy, and whatever issues were presented in the adventure would be happily resolved.
Jane (): Sounds like a movie! What kind of dog? A big one or small, friendly and cute or not so much.
Alan: I have always thought intelligent dogs such as border collies would make a lovely protagonist.
Jane (): That is an interesting choice, Alan! What else?
Alan: I think this is enough. Let's write this story together. You begin, I continue, and we take turns.
Jane (): Sure! What would be the name of our dog? How about "Sparky"? Sounds good to me, Alan.
Alan: I agree. Sparky is great, it is a bit like perky and I think it fits a border collie well.
Jane (): So let's begin! Sparky is at home, waiting for his family to come back from work and he misses them. He doesn't know what they do all day but it must be fun, so Sparky decides to have an adventure of his own. He is bored at home and wants something new to do!
Alan (): Sparky accesses the outside through a dog door on the backyard side door. Suddenly, his ears perk up. A noise! There is something or someone in the yard. Sparky is excited and goes to investigate the source of the noise.
Jane (): He barks at a squirrel, then he chases it away from his favorite tree! The little critter is scared and runs to the fence line. Sparky follows him, curious about this new discovery.
Alan (): Sparky sees an opening in the fence. Did the squirrel somehow open the fence? It looks tight, but with difficulty, Sparky thinks he can fit through. The whole world awaits.
Jane (): Sparky is ready for an adventure! He wiggles his way into the fence and starts to explore this new world.

I don't know what kind of results other people are getting, but I am just astonished to see this and a hundred other chats like it coming out of some 5 GB Linux process at human typing speed. Unfortunately, it is fairly obvious to me that the 7B model has to repeat a lot of what I say, and I am hoping that this is not an issue with the bigger models.

I use "Jane (emotional state or actions here):" to serve as writeable memory of the model's state, if you will. My thinking is this helps maintaining coherency in the hallucinated persona. When the model opts to use these parenthesis, Jane is frequently angry, in tears, laughs, is excited, smiles -- all appropriate human-like responses to what I say. As an example, if I insult her for no reason, she gets upset, then angry, and even quits the chat on me by generating the end of stream token! Unfortunately, sometimes it generates an open parenthesis on my prompt and I just close it, and the language model then repeats it.

@gjmulder
Copy link
Collaborator

@gjmulder amazing! Instructions for running benchmarks are in the description for #270, if you could run on the 65B model, that would be great to see :).

@glinscott might be a while - or a week - considering the below estimate with the 30B model. Perhaps we need to be very specific in the A-B tests we need answered?

llama.cpp$ egrep "wait|Calc|ETA" run_2637178 
llama_model_load: loading model from 'models/30B/ggml-model-q4_0.bin' - please wait ...
Calculating perplexity over 655 chunks
187.94 seconds per pass - ETA 34.19 hours

@glinscott
Copy link
Collaborator

@gjmulder yeah, if it's going to take too long, we could run up to 250 chunks or so, that's a pretty good approximation of the final perplexity so far.
image

@Green-Sky
Copy link
Collaborator

@glinscott i did not investigate yet, but running perplexity increases memory usage by more then 2x of the model size.

@BadisG
Copy link

BadisG commented Mar 22, 2023

Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization. https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

@gjmulder gjmulder changed the title Low quality output with FP16 Quantitative measurement of model perplexity for different models and model quantization modes Mar 22, 2023
@gjmulder
Copy link
Collaborator

Perplexity scores for various model sizes and quantizations are now being collected in Discussion #406.

@leszekhanusz
Copy link

An interesting comparison with alpaca models is available on reddit

@BadisG
Copy link

BadisG commented Mar 27, 2023

@gjmulder yeah, if it's going to take too long, we could run up to 250 chunks or so, that's a pretty good approximation of the final perplexity so far. image

It looks like the difference between f16 and 4b are exactly the same after 50 Chunk

That would mean that if we know the perplexity of f16 (after a full run) we can know the perplexity of 4b just after 50 Chunk

Just my 2 cents 😅

@gjmulder
Copy link
Collaborator

This issue has been moved to #406

Repository owner locked as resolved and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
generation quality Quality of model output model Model specific
Projects
None yet
Development

No branches or pull requests