Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threadpool: take 2 #8672

Merged
merged 48 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
130adf8
Introduce ggml_compute_threadpool
fmz Jul 31, 2024
a0aae52
Minor fixes
fmz Jul 31, 2024
d5c9c14
fixed use after release bug
fmz Jul 31, 2024
82224f8
fixed a harmless race condition
fmz Jul 31, 2024
817eaf0
Fix Android bulid issue
fmz Jul 31, 2024
5763732
fix more race conditions
fmz Jul 31, 2024
3008b31
fix deadlock for cases where cgraph.n_nodes == 1
fmz Jul 31, 2024
96d6603
threadpool: use cpu_get_num_math to set the default number of threadp…
max-krasnyansky Aug 3, 2024
2953441
bench: create fresh threadpool for each test
max-krasnyansky Aug 4, 2024
6fcc780
atomics: always use stdatomics with clang and use relaxed memory orde…
max-krasnyansky Aug 5, 2024
3b62f7c
threadpool: make polling the default to match openmp behavior
max-krasnyansky Aug 7, 2024
dfa6377
threadpool: do not wakeup threads in already paused threadpool
max-krasnyansky Aug 8, 2024
2e18f0d
fix potential race condition in check_for_work
Aug 8, 2024
48aa8ee
threadpool: do not create two threadpools if their params are identical
max-krasnyansky Aug 8, 2024
494e27c
threadpool: reduce pause/resume/wakeup overhead in common cases
max-krasnyansky Aug 10, 2024
b630acd
threadpool: add support for hybrid polling
max-krasnyansky Aug 11, 2024
9d3e78c
threadpool: reduce the number of barrier required
max-krasnyansky Aug 13, 2024
538bd9f
threadpool: remove special-casing for disposable threadpools
max-krasnyansky Aug 13, 2024
db45b6d
threadpool: do not clear barrier counters between graphs computes (fi…
max-krasnyansky Aug 15, 2024
307fece
threadpool: use relaxed order for chunk sync
max-krasnyansky Aug 21, 2024
63a0dad
threadpool: remove abort_callback from threadpool state
max-krasnyansky Aug 24, 2024
2358bb3
threadpool: better naming for thread/cpumask releated functions
max-krasnyansky Aug 24, 2024
4a4d715
threadpool: consistent use of int type for n_threads params
max-krasnyansky Aug 24, 2024
c4452ed
threadpool: add support for ggml_threadpool_params_default/init
max-krasnyansky Aug 24, 2024
31541d7
threadpool: move typedef into ggml.h
max-krasnyansky Aug 24, 2024
4064860
threadpool: fix apply_priority() function name
max-krasnyansky Aug 24, 2024
f64c975
threadpool: fix swift wrapper errors due to n_threads int type cleanup
max-krasnyansky Aug 24, 2024
c506d7f
threadpool: enable --cpu-mask and other threadpool related options on…
max-krasnyansky Aug 24, 2024
8008463
threadpool: replace checks for compute_thread ret code with proper st…
max-krasnyansky Aug 24, 2024
49ac51f
threadpool: simplify threadpool init logic and fix main thread affini…
max-krasnyansky Aug 25, 2024
204377a
threadpool: update threadpool resume/pause function names
max-krasnyansky Aug 25, 2024
93f170d
threadpool: enable openmp by default for now
max-krasnyansky Aug 25, 2024
a7496bf
threadpool: don't forget to free workers state when omp is enabled
max-krasnyansky Aug 25, 2024
8186e96
threadpool: avoid updating process priority on the platforms that do …
max-krasnyansky Aug 26, 2024
658f16c
threadpool: update calling thread prio and affinity only at start/resume
max-krasnyansky Aug 26, 2024
8d5ab9a
llama-bench: turn threadpool params into vectors, add output headers,…
max-krasnyansky Aug 27, 2024
3bcc4de
llama-bench: add support for cool off between tests --delay
max-krasnyansky Aug 27, 2024
5d4c0a1
threadpool: move process priority setting into the apps (bench and cli)
max-krasnyansky Aug 27, 2024
e3c2202
threadpool: move all pause/resume logic into ggml
max-krasnyansky Aug 27, 2024
c6328bc
threadpool: futher api cleanup and prep for future refactoring
max-krasnyansky Aug 28, 2024
bead7d4
threadpool: minor indent fixes
max-krasnyansky Aug 28, 2024
8e8f8ce
threadpool: improve setprioty error message
max-krasnyansky Aug 28, 2024
c6c27b1
Update examples/llama-bench/llama-bench.cpp
max-krasnyansky Aug 29, 2024
b97bd67
threadpool: fix indent in set_threadpool call
max-krasnyansky Aug 29, 2024
cae35b9
use int32_t for n_thread type in public llama.cpp API
max-krasnyansky Aug 29, 2024
c49d634
threadpool: use _new and _free instead of _create and _release
max-krasnyansky Aug 29, 2024
3b5f7c2
fix two more public APIs to use int32_t for n_threads
max-krasnyansky Aug 29, 2024
52aa677
build: set _GNU_SOURCE for Adroid
max-krasnyansky Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
350 changes: 327 additions & 23 deletions common/common.cpp

Large diffs are not rendered by default.

30 changes: 23 additions & 7 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,18 @@ enum dimre_method {
DIMRE_METHOD_MEAN,
};

struct cpu_params {
int n_threads = -1;
bool cpumask[GGML_MAX_N_THREADS] = {false}; // CPU affinity mask.
bool mask_valid = false; // Default: any CPU
enum ggml_sched_priority priority = GGML_SCHED_PRIO_NORMAL; // Scheduling prio : (0 - normal, 1 - medium, 2 - high, 3 - realtime)
bool strict_cpu = false; // Use strict CPU placement
uint32_t poll = 50; // Polling (busywait) level (0 - no polling, 100 - mostly polling)
};

struct gpt_params {
uint32_t seed = LLAMA_DEFAULT_SEED; // RNG seed

int32_t n_threads = cpu_get_num_math();
int32_t n_threads_draft = -1;
int32_t n_threads_batch = -1; // number of threads to use for batch processing (-1 = use n_threads)
int32_t n_threads_batch_draft = -1;
int32_t n_predict = -1; // new tokens to predict
int32_t n_ctx = 0; // context size
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
Expand All @@ -100,6 +105,11 @@ struct gpt_params {
int32_t yarn_orig_ctx = 0; // YaRN original context length
float defrag_thold = -1.0f; // KV cache defragmentation threshold

struct cpu_params cpuparams;
struct cpu_params cpuparams_batch;
struct cpu_params draft_cpuparams;
struct cpu_params draft_cpuparams_batch;

ggml_backend_sched_eval_callback cb_eval = nullptr;
void * cb_eval_user_data = nullptr;

Expand Down Expand Up @@ -204,7 +214,7 @@ struct gpt_params {
int32_t port = 8080; // server listens on this network port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests
int n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)

std::string hostname = "127.0.0.1";
std::string public_path = "";
Expand Down Expand Up @@ -277,6 +287,11 @@ void gpt_params_print_usage(int argc, char ** argv, const gpt_params & params);

std::string gpt_params_get_system_info(const gpt_params & params);

bool parse_cpu_range(const std::string& range, bool(&boolmask)[GGML_MAX_N_THREADS]);
bool parse_cpu_mask(const std::string& mask, bool(&boolmask)[GGML_MAX_N_THREADS]);
void postprocess_cpu_params(cpu_params& cpuparams, const cpu_params* role_model = nullptr);
bool set_process_priority(enum ggml_sched_priority prio);

//
// String utils
//
Expand Down Expand Up @@ -327,8 +342,9 @@ struct llama_init_result {

struct llama_init_result llama_init_from_gpt_params(gpt_params & params);

struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
struct llama_context_params llama_context_params_from_gpt_params(const gpt_params & params);
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
struct llama_context_params llama_context_params_from_gpt_params (const gpt_params & params);
struct ggml_threadpool_params ggml_threadpool_params_from_cpu_params(const cpu_params & params);

struct llama_model * llama_load_model_from_url(const char * model_url, const char * path_model, const char * hf_token, const struct llama_model_params & params);
struct llama_model * llama_load_model_from_hf(const char * repo, const char * file, const char * path_model, const char * hf_token, const struct llama_model_params & params);
Expand Down
2 changes: 1 addition & 1 deletion examples/baby-llama/baby-llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ constexpr float rms_norm_eps = 5e-6f;
#endif

static void ggml_graph_compute_helper(std::vector<uint8_t> & buf, ggml_cgraph * graph, int n_threads) {
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads);
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads, nullptr);

if (plan.work_size > 0) {
buf.resize(plan.work_size);
Expand Down
4 changes: 2 additions & 2 deletions examples/benchmark/benchmark-matmult.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
#endif

static void ggml_graph_compute_helper(std::vector<uint8_t> & buf, ggml_cgraph * graph, int n_threads) {
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads);
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads, nullptr);

if (plan.work_size > 0) {
buf.resize(plan.work_size);
Expand Down Expand Up @@ -54,7 +54,7 @@ static void tensor_dump(const ggml_tensor * tensor, const char * name) {
#define TENSOR_DUMP(tensor) tensor_dump(tensor, #tensor)

struct benchmark_params_struct {
int32_t n_threads = 1;
int n_threads = 1;
int32_t n_iterations = 10;
};

Expand Down
4 changes: 2 additions & 2 deletions examples/cvector-generator/cvector-generator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -486,8 +486,8 @@ int main(int argc, char ** argv) {
if (use_pca) {
// run PCA
PCA::pca_params pca_params;
pca_params.n_threads = params.n_threads;
pca_params.n_batch = params.n_pca_batch;
pca_params.n_threads = params.cpuparams.n_threads;
pca_params.n_batch = params.n_pca_batch;
pca_params.n_iterations = params.n_pca_iterations;
PCA::run_pca(pca_params, ctx_train.v_diff, ctx_train.v_final);
} else {
Expand Down
2 changes: 1 addition & 1 deletion examples/export-lora/export-lora.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ int main(int argc, char ** argv) {

g_verbose = (params.verbosity == 1);
try {
lora_merge_ctx ctx(params.model, params.lora_adapters, params.lora_outfile, params.n_threads);
lora_merge_ctx ctx(params.model, params.lora_adapters, params.lora_outfile, params.cpuparams.n_threads);
ctx.run_merge();
} catch (const std::exception & err) {
fprintf(stderr, "%s\n", err.what());
Expand Down
Loading
Loading