Skip to content

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Closed
@young-developer

Description

@young-developer

CUDA GPU inference is slower for the latest version(1449) in comparison to 1336:

1449

llama_print_timings:        load time =    3111.53 ms
llama_print_timings:      sample time =      99.26 ms /   617 runs   (    0.16 ms per token,  6215.81 tokens per second)
llama_print_timings: prompt eval time =      73.73 ms /    22 tokens (    3.35 ms per token,   298.37 tokens per second)
llama_print_timings:        eval time =   11428.62 ms /   616 runs   (   18.55 ms per token,    53.90 tokens per second)
llama_print_timings:       total time =   11679.26 ms

1336

llama_print_timings:        load time =    3150.73 ms
llama_print_timings:      sample time =     149.46 ms /   623 runs   (    0.24 ms per token,  4168.31 tokens per second)
llama_print_timings: prompt eval time =     115.86 ms /    23 tokens (    5.04 ms per token,   198.52 tokens per second)
llama_print_timings:        eval time =    9558.68 ms /   622 runs   (   15.37 ms per token,    65.07 tokens per second)
llama_print_timings:       total time =   10518.21 ms

Logs:

logs-fast-1336.txt
logs-slow-1449.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions