Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change default temperature of OAI compat API from 0 to 1 #7226

Merged
merged 2 commits into from
May 13, 2024

Conversation

Kartoffelsaft
Copy link
Contributor

This should make the API more similar to that of OpenAI's actual API

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 12, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8893.87ms p(95)=21900.26ms fails=, finish reason: stop=474 truncated=53
  • Prompt processing (pp): avg=102.31tk/s p(95)=434.58tk/s
  • Token generation (tg): avg=47.46tk/s p(95)=49.31tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=oai-temp commit=540d9b5970644896c1281bad56b2ae6ebeae5bd7

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715481093 --> 1715481727
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 642.9, 642.9, 642.9, 642.9, 642.9, 729.09, 729.09, 729.09, 729.09, 729.09, 722.42, 722.42, 722.42, 722.42, 722.42, 752.89, 752.89, 752.89, 752.89, 752.89, 793.82, 793.82, 793.82, 793.82, 793.82, 803.04, 803.04, 803.04, 803.04, 803.04, 801.41, 801.41, 801.41, 801.41, 801.41, 821.64, 821.64, 821.64, 821.64, 821.64, 827.66, 827.66, 827.66, 827.66, 827.66, 842.68, 842.68, 842.68, 842.68, 842.68, 867.25, 867.25, 867.25, 867.25, 867.25, 871.84, 871.84, 871.84, 871.84, 871.84, 892.85, 892.85, 892.85, 892.85, 892.85, 889.82, 889.82, 889.82, 889.82, 889.82, 902.11, 902.11, 902.11, 902.11, 902.11, 890.35, 890.35, 890.35, 890.35, 890.35, 889.12, 889.12, 889.12, 889.12, 889.12, 890.72, 890.72, 890.72, 890.72, 890.72, 887.99, 887.99, 887.99, 887.99, 887.99, 884.57, 884.57, 884.57, 884.57, 884.57, 886.79, 886.79, 886.79, 886.79, 886.79, 890.82, 890.82, 890.82, 890.82, 890.82, 891.77, 891.77, 891.77, 891.77, 891.77, 902.32, 902.32, 902.32, 902.32, 902.32, 905.99, 905.99, 905.99, 905.99, 905.99, 906.8, 906.8, 906.8, 906.8, 906.8, 906.55, 906.55, 906.55, 906.55, 906.55, 898.5, 898.5, 898.5, 898.5, 898.5, 895.31, 895.31, 895.31, 895.31, 895.31, 893.05, 893.05, 893.05, 893.05, 893.05, 893.23, 893.23, 893.23, 893.23, 893.23, 885.52, 885.52, 885.52, 885.52, 885.52, 881.69, 881.69, 881.69, 881.69, 881.69, 879.51, 879.51, 879.51, 879.51, 879.51, 890.54, 890.54, 890.54, 890.54, 890.54, 895.04, 895.04, 895.04, 895.04, 895.04, 889.29, 889.29, 889.29, 889.29, 889.29, 883.99, 883.99, 883.99, 883.99, 883.99, 883.7, 883.7, 883.7, 883.7, 883.7, 887.95, 887.95, 887.95, 887.95, 887.95, 888.99, 888.99, 888.99, 888.99, 888.99, 890.01, 890.01, 890.01, 890.01, 890.01, 882.56, 882.56, 882.56, 882.56, 882.56, 881.13, 881.13, 881.13, 881.13, 881.13, 880.35, 880.35, 880.35, 880.35, 880.35, 873.33, 873.33, 873.33, 873.33, 873.33, 871.83, 871.83, 871.83, 871.83, 871.83, 867.02, 867.02, 867.02, 867.02, 867.02, 866.84, 866.84, 866.84, 866.84, 866.84, 866.13, 866.13, 866.13, 866.13, 866.13, 869.62, 869.62, 869.62, 869.62, 869.62, 869.13, 869.13, 869.13, 869.13, 869.13, 871.25, 871.25, 871.25, 871.25, 871.25, 874.09, 874.09, 874.09, 874.09, 874.09, 873.68, 873.68, 873.68, 873.68, 873.68, 877.2, 877.2, 877.2, 877.2, 877.2, 878.15, 878.15, 878.15, 878.15, 878.15, 876.71, 876.71, 876.71, 876.71, 876.71, 872.29, 872.29, 872.29, 872.29, 872.29, 873.29, 873.29, 873.29, 873.29, 873.29, 873.15, 873.15, 873.15]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715481093 --> 1715481727
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.75, 43.75, 43.75, 43.75, 43.75, 31.64, 31.64, 31.64, 31.64, 31.64, 27.85, 27.85, 27.85, 27.85, 27.85, 27.93, 27.93, 27.93, 27.93, 27.93, 29.1, 29.1, 29.1, 29.1, 29.1, 29.84, 29.84, 29.84, 29.84, 29.84, 31.04, 31.04, 31.04, 31.04, 31.04, 31.92, 31.92, 31.92, 31.92, 31.92, 32.3, 32.3, 32.3, 32.3, 32.3, 32.68, 32.68, 32.68, 32.68, 32.68, 33.08, 33.08, 33.08, 33.08, 33.08, 33.03, 33.03, 33.03, 33.03, 33.03, 32.59, 32.59, 32.59, 32.59, 32.59, 31.58, 31.58, 31.58, 31.58, 31.58, 30.01, 30.01, 30.01, 30.01, 30.01, 29.9, 29.9, 29.9, 29.9, 29.9, 28.27, 28.27, 28.27, 28.27, 28.27, 28.11, 28.11, 28.11, 28.11, 28.11, 28.35, 28.35, 28.35, 28.35, 28.35, 28.24, 28.24, 28.24, 28.24, 28.24, 28.32, 28.32, 28.32, 28.32, 28.32, 28.44, 28.44, 28.44, 28.44, 28.44, 28.51, 28.51, 28.51, 28.51, 28.51, 28.64, 28.64, 28.64, 28.64, 28.64, 28.55, 28.55, 28.55, 28.55, 28.55, 28.68, 28.68, 28.68, 28.68, 28.68, 28.91, 28.91, 28.91, 28.91, 28.91, 29.09, 29.09, 29.09, 29.09, 29.09, 29.0, 29.0, 29.0, 29.0, 29.0, 29.07, 29.07, 29.07, 29.07, 29.07, 29.33, 29.33, 29.33, 29.33, 29.33, 29.47, 29.47, 29.47, 29.47, 29.47, 29.55, 29.55, 29.55, 29.55, 29.55, 29.87, 29.87, 29.87, 29.87, 29.87, 29.86, 29.86, 29.86, 29.86, 29.86, 29.68, 29.68, 29.68, 29.68, 29.68, 29.6, 29.6, 29.6, 29.6, 29.6, 29.46, 29.46, 29.46, 29.46, 29.46, 29.65, 29.65, 29.65, 29.65, 29.65, 29.8, 29.8, 29.8, 29.8, 29.8, 29.95, 29.95, 29.95, 29.95, 29.95, 30.01, 30.01, 30.01, 30.01, 30.01, 29.87, 29.87, 29.87, 29.87, 29.87, 29.7, 29.7, 29.7, 29.7, 29.7, 29.7, 29.7, 29.7, 29.7, 29.7, 28.26, 28.26, 28.26, 28.26, 28.26, 28.0, 28.0, 28.0, 28.0, 28.0, 28.03, 28.03, 28.03, 28.03, 28.03, 28.06, 28.06, 28.06, 28.06, 28.06, 28.09, 28.09, 28.09, 28.09, 28.09, 28.08, 28.08, 28.08, 28.08, 28.08, 28.14, 28.14, 28.14, 28.14, 28.14, 28.25, 28.25, 28.25, 28.25, 28.25, 28.25, 28.25, 28.25, 28.25, 28.25, 28.27, 28.27, 28.27, 28.27, 28.27, 28.25, 28.25, 28.25, 28.25, 28.25, 28.2, 28.2, 28.2, 28.2, 28.2, 28.21, 28.21, 28.21, 28.21, 28.21, 28.38, 28.38, 28.38, 28.38, 28.38, 28.49, 28.49, 28.49, 28.49, 28.49, 28.59, 28.59, 28.59]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715481093 --> 1715481727
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.24, 0.24, 0.24, 0.24, 0.24, 0.37, 0.37, 0.37, 0.37, 0.37, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.3, 0.3, 0.3, 0.3, 0.3, 0.52, 0.52, 0.52, 0.52, 0.52, 0.42, 0.42, 0.42, 0.42, 0.42, 0.38, 0.38, 0.38, 0.38, 0.38, 0.09, 0.09, 0.09, 0.09, 0.09, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.3, 0.3, 0.3, 0.3, 0.3, 0.58, 0.58, 0.58, 0.58, 0.58, 0.68, 0.68, 0.68, 0.68, 0.68, 0.62, 0.62, 0.62, 0.62, 0.62, 0.43, 0.43, 0.43, 0.43, 0.43, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715481093 --> 1715481727
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0]
                    
Loading

Copy link
Collaborator

@mofosyne mofosyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking your assertion, can confirm that at least for chat completion mode, which is what we are dealing with this PR. The default is indeed temperature=1.0

Source: https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature

temperature
number or null
The sampling temperature used for this run. If not set, defaults to 1.

Just a quick note that this is example code not the actual llama.cpp endpoint itself. But still be useful to maintain consistency.


Note that when in transcript mode, creativity/temperature is by default 0. So temperature defaults can differ between different api endpoints.

@mofosyne mofosyne merged commit e586ee4 into ggerganov:master May 13, 2024
64 checks passed
@shibe2
Copy link
Contributor

shibe2 commented May 13, 2024

Different models can tolerate different temperatures. What if 1 is too high for most models that people run locally? Default in main is 0.8.

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024
)

* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
@jukofyork
Copy link
Contributor

jukofyork commented Jul 3, 2024

Different models can tolerate different temperatures. What if 1 is too high for most models that people run locally? Default in main is 0.8.

The value of 1 should work for any model assuming the logits weren't scaled whilst training.

The value of 1 actually corresponds to the model outputting "well calibrated" probability estimates, ie: if you were to plot the post-softmax probability estimates along with the empirical fraction of times the next token fell in the respective "bin" (or the log of these values more likely), then assuming log-loss (aka "cross entropy" loss) was used; you'd find that temperature=1 would make the plots line up the best.

(The inverse of this is even used to calibrate non-probabilistic models' outputs for SVM using "maximum margin" loss, etc: https://en.m.wikipedia.org/wiki/Platt_scaling)

This doesn't necessarily mean the temperature=1 will be optimal for different use cases, but it should definitely not be broken and likely the best default IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server/api
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants