Skip to content

Commit

Permalink
Merge branch 'ggerganov:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
apicalshark authored Oct 31, 2024
2 parents 2ac815f + 61408e7 commit afdd774
Show file tree
Hide file tree
Showing 7 changed files with 541 additions and 77 deletions.
6 changes: 3 additions & 3 deletions convert_lora_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def get_base_tensor_name(lora_tensor_name: str) -> str:

def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Convert a huggingface PEFT LoRA adapter to a GGML compatible file")
description="Convert a Hugging Face PEFT LoRA adapter to a GGUF file")
parser.add_argument(
"--outfile", type=Path,
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
Expand All @@ -257,11 +257,11 @@ def parse_args() -> argparse.Namespace:
)
parser.add_argument(
"--base", type=Path, required=True,
help="directory containing base model file",
help="directory containing Hugging Face model config files (config.json, tokenizer.json) for the base model that the adapter is based on - only config is needed, actual model weights are not required",
)
parser.add_argument(
"lora_path", type=Path,
help="directory containing LoRA adapter file",
help="directory containing Hugging Face PEFT LoRA config (adapter_model.json) and weights (adapter_model.safetensors or adapter_model.bin)",
)

return parser.parse_args()
Expand Down
11 changes: 9 additions & 2 deletions examples/main/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,15 @@ These options help improve the performance and memory usage of the LLaMA models.

For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-and-quantize).

## LoRA (Low-Rank Adaptation) adapters

- `--lora FNAME`: Optional path to a LoRA adapter to use with scaling of 1.0. Can be mixed with `--lora-scaled` and can be repeated to use multiple adapters.
- `--lora-scaled FNAME`: Optional path to a LoRA adapter with user-defined scaling. Can be mixed with `--lora` and can repeated to use multiple adapters.

You can add LoRA adapters using `--lora` or `--lora-scaled`. For example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...` or `--lora-scaled lora_task_A.gguf 0.5 --lora-scaled lora_task_B.gguf 0.5`.

LoRA adapters should be in GGUF format. To convert from Hugging Face format use the `convert-lora-to-gguf.py` script. LoRA adapters are loaded separately and applied during inference - they are not merged with the main model. This means that mmap model loading is fully supported when using LoRA adapters. The old `--lora-base` flag has been removed now that merging is no longer performed.

## Additional Options

These options provide extra functionality and customization when running the LLaMA models:
Expand All @@ -341,6 +350,4 @@ These options provide extra functionality and customization when running the LLa
- `--verbose-prompt`: Print the prompt before generating text.
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
- `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable or in an OS-specific local cache.
4 changes: 4 additions & 0 deletions ggml/include/ggml-kompute.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
extern "C" {
#endif

#define GGML_KOMPUTE_MAX_DEVICES 16

struct ggml_vk_device {
int index;
int type; // same as VkPhysicalDeviceType
Expand Down Expand Up @@ -41,6 +43,8 @@ GGML_API bool ggml_backend_is_kompute(ggml_backend_t backend);

GGML_API ggml_backend_buffer_type_t ggml_backend_kompute_buffer_type(int device);

GGML_API ggml_backend_reg_t ggml_backend_kompute_reg(void);

#ifdef __cplusplus
}
#endif
Loading

0 comments on commit afdd774

Please sign in to comment.