Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ1_XS FTYPE quant strategy #6310

Conversation

Nexesenex
Copy link
Contributor

@Nexesenex Nexesenex commented Mar 26, 2024

Considering that @ikawrakow's IQ Quants brought us the SOTA quantization available as of now, and yet disappointed by the lack of usability of the IQ1_S model quant below 70b (and even still), I wondered if there couldn't be a better "mixed" strategy to drop furthermore the quality/size ratio of the sub-2bpw model quants, and bring them in line with the other IQ LLAMA_FTYPE strategies.

I tested a lot of combinations, out of the known patterns of quantization mix, an extremely basic understanding of what tensor/weight does what, and also some sense of proportions, and here comes a slightly different model quant strategy using the current IQ1_S GGML_TYPE, which can easily be scaled upward from this LLAMA_FTYPE IQ1_XS model quantization. I did already the scaling-up to a IQ1_S replacement candidate to follow soon if the demarch is approved, and it's very satisfactory, not to speak about the incoming IQ1_M GGML_Type to improve the IQ1_S FTYPE furthermore, and make a scaled IQ1_M FTYPE after that.

The IQ1_XS strategy is as follows :

  • The token embedding weight goes from Q2_K to IQ2_S, except for MOEs like Mixtral who does not seem like 2 bits IQ Quants for this tensor.
  • The quite determinant attn.v.weight scales up in respect of the GQA parameter and furthermore of the number of experts, because the relative size of these small tensors shrinks quickly compared to the rest of the GQA/MOE models. Same idea for attn.k.weight, but with a lower quantization quality.
  • In this IQ1_XS FTYPE, which is to be the smallest FTYPE available for now if accepted, I do not touch the FFN weights, which all remain in IQ1_S.
  • The output weight goes from Q5_K to IQ4_XS, except for the MOEs (for somehow similar reason than the embedding weight, although the difference is lesser).
  • For the MOEs, the attn.q.weight is pushed to IQ2 quants, XXS, XS or S accordingly to the number of experts : Having Q4_K (now IQ4_XS/Q5_K) K/V tensors and only IQ1_S Q tensor was not optimal on a MOE, for it seems that the importance of the integrity of the Query/Key/Value triplet is particularly high on MOEs. And considering that a attn.q.weight is shared between the 8 experts, the size increase is minimal.
  • I numbered IQ1_XS as "32" in the llama.h enumeration, for "31" will likely be taken by the incoming IQ1_M FTYPE prepared by @ikawrakow.

IQ1_XS PR vs IQ1_S "Even Better" master :

Perplexity at 512 ctx :

  • Llama 2 7b : 14.6916 vs 13.8991, at 1.76bpw instead of 1.81bpw
  • Llama 2 7b RMS 1.875e-5 : 14.7964 vs 13.8402, at 1.76bpw instead of 1.81bpw
  • Mistral Instruct 7b v0.2 : 12.1735 vs 11.8538, at 1.72bpw instead of 1.78bpw
  • Yi 34b (Kyllene 1.1) : 9.5278 vs 9.8761, at 1.69bpw instead of 1.74bpw
  • Mixtral Instruct 0.1 : 7.267 vs 7.3085, at 1.63bpw instead of 1.68bpw

Perplexity at 4096 ctx :

  • Llama 2 7b : 12.6731 vs 12.6402 at 1.76bpw instead of 1.81bpw
  • Llama 2 7b RMS 1.875e-5 : 12.4292 vs 11.8383 at 1.76bpw instead of 1.81bpw
  • Mistral Instruct 7b v0.2 : 9.6538 vs 9.2874, at 1.72bpw instead of 1.78bpw
  • Yi 34b (Kyllene 1.1) : 7.6289 vs 7.8954, at 1.69bpw instead of 1.74bpw
  • Mixtral Instruct 0.1 : 6.0612 vs 6.0789, at 1.63bpw instead of 1.68bpw

I didn't work much on Mistral Instruct 7b 0.2, and there's a small regression in quality/size on this model merely in line with the reduced size.

Llama 2 70b IQ1_XS is also likely to be very close to the current IQ1_S (I bumped attn.k.weight from IQ2_XS to IQ2_S since my last test, and i had a 1.5% perplexity bump vs the current IQ1_S "Even Better"), at 1.65-1.66bpw instead of 1.69bpw.

Such strategy can already scale well in the interval between FTYPE IQ1_S and IQ2_XXS, and there's also a more elusive but nevertheless existing margin of progress to reach beyond.

Then, the new IQ1_M proposed by @ikawrakow (thanks again!) will help a lot in the attempt to get a really usable 2.0/sub 2bpw quant strategy, and an IQ 4.5+bpw and IQ 5+bpw made available could help as well to refine furthermore the small tensors and the output tensor!

At the end of the day, the IQ1_S GGML_Type is VERY useful to quantize the ffn tensors, especially up.ffn and gate.ffn (there's some experiments to do on these 2 also by varying their ratio around a down.ffn "pillar"), which represent most of the size of a model, are the least sensitive to low bpw quantization, while the smaller tensors are much more sensitive and can be beefed up without much size increase.

Tests and feedback will be appreciated!

Footnote : this is my first "real" PR. I don't know much about code, sorry for the bulky formatting. I had to choose between the 2 different approaches (per quant strategy / per tensor), and I chose the first one because a "small tree" is the most obvious logical shape for me!

Edit : Llama 2 7b scores are corrected.

Edit : IQ4_XS tensors pushed to Q4_K to focus on quality with a minor size increase.

llama.cpp Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
@ikawrakow
Copy link
Contributor

Perplexity at 4096 ctx :

  • Llama 2 7b : 12.7689 vs 13.3220 at 1.76bpw instead of 1.81bpw

I confirmed your other values, but this one is wrong: I get PPL = 11.86 for LLaMA-v2-7B for IQ1_S on master.

@Nexesenex Nexesenex closed this Mar 26, 2024
@Nexesenex Nexesenex reopened this Mar 26, 2024
@Nexesenex
Copy link
Contributor Author

Nexesenex commented Mar 26, 2024

Perplexity at 4096 ctx :

  • Llama 2 7b : 12.7689 vs 13.3220 at 1.76bpw instead of 1.81bpw

I confirmed your other values, but this one is wrong: I get PPL = 11.86 for LLaMA-v2-7B for IQ1_S on master.

I didn't change the rms_norm_epsilon value when testing. I will download a fresh Llama_2_7B and remake a fp16 to retest, I need to test Q4_K output anyway.

Note : I closed/reopened the PR by mistake. :X

Edit : Now my results for IQ1_S are in line with yours for Llama 2 7b. I'm retesting IQ1_XS now.

Edit 2 : Llama 2 scores corrected. Now I move onto the output tensor.

@Nindaleth
Copy link
Contributor

Nindaleth commented Mar 26, 2024

Please update the Python constants too (used e.g. by gguf-dump.py).

@Nexesenex
Copy link
Contributor Author

Please update the Python constants too (used e.g. by gguf-dump.py).

I might be wrong, but looking at the code, I think it applies only to GGML_TYPE (tensor quantization), not to LLAMA_FTYPE (quantization mix strategy).

- There's indeed a slight bonus worthy of not being missed for such a cheap cost with Q4_K compared to IQ4_XS, especially on the K & V attention tensors.
- Obsession on size doesn't matter much for the smallest models which are small anyway and need an offset toward quality for the sake of logic, while the bigger models which can actually be usable almost won't be impacted in size but will appreciate the slight quality bump offered by Q4_K vs IQ4_XS.
@Nindaleth
Copy link
Contributor

It looks like you're right and since there's no change to ggml.h, neither is a change to gguf constants.py necessary; sorry for the noise.

Copy link
Contributor Author

@Nexesenex Nexesenex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantization failing when the token embeddings tensor is quantized in IQ2_S, due to the iMatrix absence warning issue, is now solved after adding an exception.

As for the IQ4_XS vs Q4_K question for some tensors/cases, Q4_K is chosen, in line again with @ikawrakow's remarks & also with my own concurring recollection of some past testing made to prepare this PR, results that I initially dismissed in a too size-shrinking oriented approach.

On my side, I think this PR is ready for pre-merge review.

llama.cpp Show resolved Hide resolved
@Nexesenex Nexesenex requested a review from ikawrakow March 26, 2024 17:33
@Nexesenex Nexesenex changed the title IQ1_XS FTYPE quant strategy attempt IQ1_XS FTYPE quant strategy Mar 26, 2024
@Nexesenex Nexesenex marked this pull request as draft March 27, 2024 12:48
@Nexesenex
Copy link
Contributor Author

Nexesenex commented Mar 27, 2024

After further testing, this PR can still be improved (Arc and Winogrande results, maybe use IQ1_M for some tensors) and thus I convert it to draft, time to dig a bit more.

@DesperateZero
Copy link

I'm sorry, I'm quite new to quantization technology and I'm curious if these technologies offer improvements for quantizing other bits. I'm aware that the low-bit quantization performance of larger models is generally better than the high-precision quantization results of smaller models. However, I'm uncertain about which approach has more potential from a cost-effectiveness perspective in the future: dig below 2 bits or optimize2-3 bits. I've personally conducted a subjective test using a 72b model (qwen1.5-72b-chat), and I've found that the difference in model performance above iq3_s precision is imperceptible. However, how do we assess the loss in performance from a human perspective between q1_xs and iq2_m?

@Nexesenex
Copy link
Contributor Author

Nexesenex commented Mar 28, 2024

I'm sorry, I'm quite new to quantization technology and I'm curious if these technologies offer improvements for quantizing other bits. I'm aware that the low-bit quantization performance of larger models is generally better than the high-precision quantization results of smaller models. However, I'm uncertain about which approach has more potential from a cost-effectiveness perspective in the future: dig below 2 bits or optimize2-3 bits. I've personally conducted a subjective test using a 72b model (qwen1.5-72b-chat), and I've found that the difference in model performance above iq3_s precision is imperceptible. However, how do we assess the loss in performance from a human perspective between q1_xs and iq2_m?

I found during my test that the sweet spot quality/size ratio was between 2.3 and 2.5bpw. quality/speed is something I didn't test, and is of course very impactful.

At 2.5bpw, and aside of coding which usually requires more precision accordingly to what I read, we can start to really use a model (MOE, 70b, even 34b to some extend) even if the quality is ofc lower than in 3bpw+. All my tests are aimed to find best quality/size spots for quantization strategy with the GGML quants provided by @ikawrakow.

As for the future, it depends on the model and hardware you run, but 2.3-2.5 bpw in particular, and 2-3bpw in general is, in my uneducated opinion, the place to dig until new SOTA-1.58bpw (@ikawrakow mentioned that recently) appear and maybe offset the best quality/size game toward lower quants strategies.

As for the testing, get a look at LocalLLama on reddit, there's a bunch of people making tests there. And you can simply try the models with different quants and the same prompt, or use the benchmarks included in LlamaCPP (Arc, Winogrande, Hellaswag, etc) to get some measurements between different quants.

- IQ4_XS output for models lesser than 8 experts or GQA 8
- granularity for QKV tensor when existing
- also, drop for Mistral & Yi attn.k.weight from IQ2_XS to IQ2_XXS
@Nexesenex Nexesenex marked this pull request as ready for review March 29, 2024 12:15
attn.v.weight in Q4_K for all MOEs & models with GQA4, Mistral (PPL4096 benefits quite a lot) and incidentally CodeLlama34b (which is for coding anyway and isn't exploitable in IQ1 quants).
Yi 34b gets IQ3_S for now, more tests are needed due to perplexity huge increase problems with IQ4_XS and Q4_K for attn.v.weight on my test model (Kyllene 1.1).
@mofosyne mofosyne added generation quality Quality of model output Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8633.01ms p(95)=21269.17ms fails=, finish reason: stop=475 truncated=64
  • Prompt processing (pp): avg=103.51tk/s p(95)=470.11tk/s
  • Token generation (tg): avg=32.18tk/s p(95)=45.58tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=Nexesenex-IQ1_XS-IQ1_S-quant-strategies commit=e4ac8ae720847b674c681e3b6218fc1c67683725

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 259.46, 259.46, 259.46, 259.46, 259.46, 617.4, 617.4, 617.4, 617.4, 617.4, 664.09, 664.09, 664.09, 664.09, 664.09, 692.14, 692.14, 692.14, 692.14, 692.14, 752.74, 752.74, 752.74, 752.74, 752.74, 763.64, 763.64, 763.64, 763.64, 763.64, 765.06, 765.06, 765.06, 765.06, 765.06, 782.9, 782.9, 782.9, 782.9, 782.9, 797.56, 797.56, 797.56, 797.56, 797.56, 813.91, 813.91, 813.91, 813.91, 813.91, 841.28, 841.28, 841.28, 841.28, 841.28, 855.75, 855.75, 855.75, 855.75, 855.75, 879.01, 879.01, 879.01, 879.01, 879.01, 913.31, 913.31, 913.31, 913.31, 913.31, 793.93, 793.93, 793.93, 793.93, 793.93, 795.32, 795.32, 795.32, 795.32, 795.32, 794.82, 794.82, 794.82, 794.82, 794.82, 780.69, 780.69, 780.69, 780.69, 780.69, 778.47, 778.47, 778.47, 778.47, 778.47, 780.77, 780.77, 780.77, 780.77, 780.77, 788.26, 788.26, 788.26, 788.26, 788.26, 787.96, 787.96, 787.96, 787.96, 787.96, 793.05, 793.05, 793.05, 793.05, 793.05, 792.16, 792.16, 792.16, 792.16, 792.16, 795.65, 795.65, 795.65, 795.65, 795.65, 796.1, 796.1, 796.1, 796.1, 796.1, 808.13, 808.13, 808.13, 808.13, 808.13, 804.94, 804.94, 804.94, 804.94, 804.94, 805.29, 805.29, 805.29, 805.29, 805.29, 806.58, 806.58, 806.58, 806.58, 806.58, 802.26, 802.26, 802.26, 802.26, 802.26, 804.28, 804.28, 804.28, 804.28, 804.28, 802.82, 802.82, 802.82, 802.82, 802.82, 808.53, 808.53, 808.53, 808.53, 808.53, 820.35, 820.35, 820.35, 820.35, 820.35, 825.25, 825.25, 825.25, 825.25, 825.25, 826.45, 826.45, 826.45, 826.45, 826.45, 819.85, 819.85, 819.85, 819.85, 819.85, 818.59, 818.59, 818.59, 818.59, 818.59, 821.11, 821.11, 821.11, 821.11, 821.11, 822.64, 822.64, 822.64, 822.64, 822.64, 827.86, 827.86, 827.86, 827.86, 827.86, 825.34, 825.34, 825.34, 825.34, 825.34, 832.96, 832.96, 832.96, 832.96, 832.96, 833.33, 833.33, 833.33, 833.33, 833.33, 831.62, 831.62, 831.62, 831.62, 831.62, 829.76, 829.76, 829.76, 829.76, 829.76, 835.21, 835.21, 835.21, 835.21, 835.21, 838.04, 838.04, 838.04, 838.04, 838.04, 837.58, 837.58, 837.58, 837.58, 837.58, 838.88, 838.88, 838.88, 838.88, 838.88, 842.4, 842.4, 842.4, 842.4, 842.4, 846.84, 846.84, 846.84, 846.84, 846.84, 845.72, 845.72, 845.72, 845.72, 845.72, 847.59, 847.59, 847.59, 847.59, 847.59, 851.5, 851.5, 851.5, 851.5, 851.5, 852.55, 852.55, 852.55, 852.55, 852.55, 852.36, 852.36, 852.36, 852.36, 852.36, 853.52, 853.52, 853.52, 853.52, 853.52, 854.35, 854.35, 854.35, 854.35, 854.35, 854.52, 854.52, 854.52]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.73, 31.73, 31.73, 31.73, 31.73, 29.13, 29.13, 29.13, 29.13, 29.13, 29.73, 29.73, 29.73, 29.73, 29.73, 30.82, 30.82, 30.82, 30.82, 30.82, 31.49, 31.49, 31.49, 31.49, 31.49, 33.15, 33.15, 33.15, 33.15, 33.15, 33.7, 33.7, 33.7, 33.7, 33.7, 34.25, 34.25, 34.25, 34.25, 34.25, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.34, 34.34, 34.34, 34.34, 34.34, 33.46, 33.46, 33.46, 33.46, 33.46, 33.39, 33.39, 33.39, 33.39, 33.39, 32.24, 32.24, 32.24, 32.24, 32.24, 32.42, 32.42, 32.42, 32.42, 32.42, 32.59, 32.59, 32.59, 32.59, 32.59, 32.63, 32.63, 32.63, 32.63, 32.63, 31.97, 31.97, 31.97, 31.97, 31.97, 31.47, 31.47, 31.47, 31.47, 31.47, 31.48, 31.48, 31.48, 31.48, 31.48, 31.33, 31.33, 31.33, 31.33, 31.33, 31.49, 31.49, 31.49, 31.49, 31.49, 31.27, 31.27, 31.27, 31.27, 31.27, 31.42, 31.42, 31.42, 31.42, 31.42, 31.43, 31.43, 31.43, 31.43, 31.43, 31.52, 31.52, 31.52, 31.52, 31.52, 31.21, 31.21, 31.21, 31.21, 31.21, 31.03, 31.03, 31.03, 31.03, 31.03, 31.22, 31.22, 31.22, 31.22, 31.22, 31.34, 31.34, 31.34, 31.34, 31.34, 31.53, 31.53, 31.53, 31.53, 31.53, 31.7, 31.7, 31.7, 31.7, 31.7, 31.82, 31.82, 31.82, 31.82, 31.82, 31.79, 31.79, 31.79, 31.79, 31.79, 31.75, 31.75, 31.75, 31.75, 31.75, 31.52, 31.52, 31.52, 31.52, 31.52, 31.31, 31.31, 31.31, 31.31, 31.31, 31.28, 31.28, 31.28, 31.28, 31.28, 31.48, 31.48, 31.48, 31.48, 31.48, 31.59, 31.59, 31.59, 31.59, 31.59, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.55, 31.55, 31.55, 31.55, 31.55, 30.93, 30.93, 30.93, 30.93, 30.93, 30.29, 30.29, 30.29, 30.29, 30.29, 29.54, 29.54, 29.54, 29.54, 29.54, 29.45, 29.45, 29.45, 29.45, 29.45, 29.46, 29.46, 29.46, 29.46, 29.46, 29.63, 29.63, 29.63, 29.63, 29.63, 29.69, 29.69, 29.69, 29.69, 29.69, 29.72, 29.72, 29.72, 29.72, 29.72, 29.77, 29.77, 29.77, 29.77, 29.77, 29.75, 29.75, 29.75, 29.75, 29.75, 29.63, 29.63, 29.63, 29.63, 29.63, 29.61, 29.61, 29.61, 29.61, 29.61, 29.68, 29.68, 29.68, 29.68, 29.68, 29.8, 29.8, 29.8, 29.8, 29.8, 29.93, 29.93, 29.93, 29.93, 29.93, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1, 30.1, 30.1]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.35, 0.35, 0.35, 0.35, 0.35, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.35, 0.35, 0.35, 0.35, 0.35, 0.48, 0.48, 0.48, 0.48, 0.48, 0.58, 0.58, 0.58, 0.58, 0.58, 0.48, 0.48, 0.48, 0.48, 0.48, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0]
                    
Loading

@Nexesenex Nexesenex marked this pull request as draft August 10, 2024 18:30
@Nexesenex Nexesenex closed this Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants