Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added implementation of DRY sampler #6896

Closed
wants to merge 3 commits into from

Conversation

l3utterfly
Copy link
Contributor

Accidentally broke branch by merging, creating a new PR.

Original PR: #6839

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 217 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=22305.59ms p(95)=39668.19ms fails=, finish reason: stop=102 truncated=115
  • Prompt processing (pp): avg=272.24tk/s p(95)=792.89tk/s
  • Token generation (tg): avg=18.7tk/s p(95)=25.13tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=dry-sampler-2 commit=4d603e3520b8cba69b99a27c9f9b0d77e0e36439

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 217 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714029432 --> 1714030070
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 434.75, 434.75, 434.75, 434.75, 434.75, 440.12, 440.12, 440.12, 440.12, 440.12, 454.29, 454.29, 454.29, 454.29, 454.29, 496.42, 496.42, 496.42, 496.42, 496.42, 496.42, 496.42, 496.42, 496.42, 496.42, 513.12, 513.12, 513.12, 513.12, 513.12, 512.18, 512.18, 512.18, 512.18, 512.18, 498.69, 498.69, 498.69, 498.69, 498.69, 499.34, 499.34, 499.34, 499.34, 499.34, 520.58, 520.58, 520.58, 520.58, 520.58, 542.01, 542.01, 542.01, 542.01, 542.01, 544.64, 544.64, 544.64, 544.64, 544.64, 569.27, 569.27, 569.27, 569.27, 569.27, 579.57, 579.57, 579.57, 579.57, 579.57, 580.33, 580.33, 580.33, 580.33, 580.33, 579.52, 579.52, 579.52, 579.52, 579.52, 582.44, 582.44, 582.44, 582.44, 582.44, 582.33, 582.33, 582.33, 582.33, 582.33, 595.22, 595.22, 595.22, 595.22, 595.22, 594.88, 594.88, 594.88, 594.88, 594.88, 599.35, 599.35, 599.35, 599.35, 599.35, 601.68, 601.68, 601.68, 601.68, 601.68, 603.04, 603.04, 603.04, 603.04, 603.04, 601.87, 601.87, 601.87, 601.87, 601.87, 612.34, 612.34, 612.34, 612.34, 612.34, 612.99, 612.99, 612.99, 612.99, 612.99, 612.04, 612.04, 612.04, 612.04, 612.04, 613.49, 613.49, 613.49, 613.49, 613.49, 626.11, 626.11, 626.11, 626.11, 626.11, 624.16, 624.16, 624.16, 624.16, 624.16, 634.93, 634.93, 634.93, 634.93, 634.93, 643.39, 643.39, 643.39, 643.39, 643.39, 642.37, 642.37, 642.37, 642.37, 642.37, 646.07, 646.07, 646.07, 646.07, 646.07, 650.4, 650.4, 650.4, 650.4, 650.4, 649.58, 649.58, 649.58, 649.58, 649.58, 648.7, 648.7, 648.7, 648.7, 648.7, 645.79, 645.79, 645.79, 645.79, 645.79, 644.32, 644.32, 644.32, 644.32, 644.32, 642.69, 642.69, 642.69, 642.69, 642.69, 642.69, 642.69, 642.69, 642.69, 642.69, 645.09, 645.09, 645.09, 645.09, 645.09, 641.16, 641.16, 641.16, 641.16, 641.16, 640.5, 640.5, 640.5, 640.5, 640.5, 640.2, 640.2, 640.2, 640.2, 640.2, 638.42, 638.42, 638.42, 638.42, 638.42, 649.48, 649.48, 649.48, 649.48, 649.48, 652.12, 652.12, 652.12, 652.12, 652.12, 651.28, 651.28, 651.28, 651.28, 651.28, 648.88, 648.88, 648.88, 648.88, 648.88, 650.53, 650.53, 650.53, 650.53, 650.53, 649.58, 649.58, 649.58, 649.58, 649.58, 650.94, 650.94, 650.94, 650.94, 650.94, 650.49, 650.49, 650.49, 650.49, 650.49, 654.55, 654.55, 654.55, 654.55, 654.55, 654.76, 654.76, 654.76, 654.76, 654.76, 654.82, 654.82, 654.82, 654.82, 654.82, 655.05, 655.05, 655.05, 655.05, 655.05, 656.51, 656.51, 656.51, 656.51, 656.51, 656.0, 656.0, 656.0, 656.0, 656.0, 660.51, 660.51, 660.51, 660.51, 660.51, 660.51, 660.51, 660.51, 660.51, 660.51]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 217 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714029432 --> 1714030070
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.41, 33.41, 33.41, 33.41, 33.41, 31.29, 31.29, 31.29, 31.29, 31.29, 25.7, 25.7, 25.7, 25.7, 25.7, 25.71, 25.71, 25.71, 25.71, 25.71, 25.71, 25.71, 25.71, 25.71, 25.71, 24.03, 24.03, 24.03, 24.03, 24.03, 21.28, 21.28, 21.28, 21.28, 21.28, 17.56, 17.56, 17.56, 17.56, 17.56, 17.62, 17.62, 17.62, 17.62, 17.62, 18.04, 18.04, 18.04, 18.04, 18.04, 18.15, 18.15, 18.15, 18.15, 18.15, 18.43, 18.43, 18.43, 18.43, 18.43, 18.67, 18.67, 18.67, 18.67, 18.67, 18.66, 18.66, 18.66, 18.66, 18.66, 18.62, 18.62, 18.62, 18.62, 18.62, 18.61, 18.61, 18.61, 18.61, 18.61, 18.75, 18.75, 18.75, 18.75, 18.75, 18.92, 18.92, 18.92, 18.92, 18.92, 19.34, 19.34, 19.34, 19.34, 19.34, 19.35, 19.35, 19.35, 19.35, 19.35, 19.42, 19.42, 19.42, 19.42, 19.42, 19.46, 19.46, 19.46, 19.46, 19.46, 19.46, 19.46, 19.46, 19.46, 19.46, 19.49, 19.49, 19.49, 19.49, 19.49, 19.55, 19.55, 19.55, 19.55, 19.55, 19.57, 19.57, 19.57, 19.57, 19.57, 19.53, 19.53, 19.53, 19.53, 19.53, 19.53, 19.53, 19.53, 19.53, 19.53, 19.5, 19.5, 19.5, 19.5, 19.5, 19.41, 19.41, 19.41, 19.41, 19.41, 19.35, 19.35, 19.35, 19.35, 19.35, 19.25, 19.25, 19.25, 19.25, 19.25, 19.09, 19.09, 19.09, 19.09, 19.09, 18.93, 18.93, 18.93, 18.93, 18.93, 18.8, 18.8, 18.8, 18.8, 18.8, 18.68, 18.68, 18.68, 18.68, 18.68, 18.66, 18.66, 18.66, 18.66, 18.66, 18.41, 18.41, 18.41, 18.41, 18.41, 18.07, 18.07, 18.07, 18.07, 18.07, 17.73, 17.73, 17.73, 17.73, 17.73, 17.73, 17.73, 17.73, 17.73, 17.73, 17.63, 17.63, 17.63, 17.63, 17.63, 17.65, 17.65, 17.65, 17.65, 17.65, 17.66, 17.66, 17.66, 17.66, 17.66, 17.69, 17.69, 17.69, 17.69, 17.69, 17.77, 17.77, 17.77, 17.77, 17.77, 17.83, 17.83, 17.83, 17.83, 17.83, 17.84, 17.84, 17.84, 17.84, 17.84, 17.81, 17.81, 17.81, 17.81, 17.81, 17.7, 17.7, 17.7, 17.7, 17.7, 17.57, 17.57, 17.57, 17.57, 17.57, 17.55, 17.55, 17.55, 17.55, 17.55, 17.55, 17.55, 17.55, 17.55, 17.55, 17.6, 17.6, 17.6, 17.6, 17.6, 17.68, 17.68, 17.68, 17.68, 17.68, 17.7, 17.7, 17.7, 17.7, 17.7, 17.71, 17.71, 17.71, 17.71, 17.71, 17.73, 17.73, 17.73, 17.73, 17.73, 17.77, 17.77, 17.77, 17.77, 17.77, 17.8, 17.8, 17.8, 17.8, 17.8, 17.82, 17.82, 17.82, 17.82, 17.82, 17.8, 17.8, 17.8, 17.8, 17.8]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 217 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714029432 --> 1714030070
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.38, 0.38, 0.38, 0.38, 0.38, 0.48, 0.48, 0.48, 0.48, 0.48, 0.49, 0.49, 0.49, 0.49, 0.49, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.41, 0.41, 0.41, 0.41, 0.41, 0.42, 0.42, 0.42, 0.42, 0.42, 0.42, 0.42, 0.42, 0.42, 0.42, 0.4, 0.4, 0.4, 0.4, 0.4, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.26, 0.26, 0.26, 0.26, 0.26, 0.37, 0.37, 0.37, 0.37, 0.37, 0.39, 0.39, 0.39, 0.39, 0.39, 0.35, 0.35, 0.35, 0.35, 0.35, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.32, 0.32, 0.32, 0.32, 0.32]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 217 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714029432 --> 1714030070
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@bviksoe
Copy link
Contributor

bviksoe commented Apr 25, 2024

[@p-e-w] Forwarding your comment from the retired PR here:

I might have written the original implementation for llama.cpp, but my two previous sampler-related PRs (#5561, #5675) have received very little maintainer feedback with no way forward, and I don't enjoy putting effort into...

I hope you don't get too discouraged by this. This project seem teeming with great ideas and experiments. It's also very busy. And perhaps Samplers don't get as much attention as they deserve. They are certainly a hot topic with several of the popular webui's.
That's because Samplers are not just about "repetition loops", but maintaining a fluent narrative over a long context.

Your ideas seems novel, and perhaps if we nudge @ggerganov to be more lenient with adding new Sampler under an "Experimental" tag, in a separate implementation file, requiring a WIKI page entry that explains the algorithm, etc. It might be easier to get support for more people cooperating and eventually adding them here.

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

There was no need to close the first PR. Fragmenting the discussion is bad, and people doing git blame in the future shouldn't have to link-hop through a chain of PRs to understand everything.

Just force-push to the previous branch. Something like this should do the trick (from the dry-sampler-2 local branch):

git push --force --set-upstream origin dry-sampler

@l3utterfly
Copy link
Contributor Author

Sorry, I'm bad with git. Tried your suggestion, didn't seem to work.

I tried a "reset + force push" before closing the other PR, it deleted any new changes in the dry-sampler branch, but I accidentally merged the flash attention changes into the branch, those commits were earlier than the dry sampler changes, so I don't believe they got deleted.

Obligatory xkcd: https://xkcd.com/1597/

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

I don't get it. git push --force should replace the contents of the remote branch with those from the local one. Which includes the entire history. I've used it many times like this and that's what happened. Are you sure you were pushing from the new branch (dry-sampler-2)?

@l3utterfly
Copy link
Contributor Author

Hmm.. locally I'm on the dry-sampler2 branch, on github it's origin/dry-sampler.

I used this command:
image

It just says everything is up to date. But as you can see on github, the dry-sampler branch still contains the flash attention changes

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

Ah, I see. You're still pushing from your local dry-sampler branch rather than from dry-sampler-2. See the output of the push command. The name of the local branch is the problem.

Try this command:

git push --force --set-upstream origin dry-sampler-2:dry-sampler

@l3utterfly
Copy link
Contributor Author

Thank you for taking the time to help me with this. I did that command, the dry-sampler branch seems to be correct now: https://github.com/l3utterfly/llama.cpp/commits/dry-sampler/

It seems the previous PR has not updated automatically: https://github.com/ggerganov/llama.cpp/pull/6839/files

Is there anything manual I need to do? Or just need to wait until Github picks it up?

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

Just a random guess, maybe the PR needs to be open in order to reflect changes?

@l3utterfly
Copy link
Contributor Author

I think you are right.. but that means we are out of luck:

image

@l3utterfly
Copy link
Contributor Author

Did that already:

image

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

That was an hour ago. Since then you force-pushed. You may have to delete and restore again.

@l3utterfly
Copy link
Contributor Author

Fixed! Thank you so much for helping me!

And sorry for the hassle, didn't know how force push worked before.

@l3utterfly l3utterfly closed this Apr 26, 2024
@l3utterfly l3utterfly deleted the dry-sampler-2 branch June 5, 2024 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants