Rework config to be more native / allow customization

[ Comment moved from https://github.com/Granite-Code/granite-code/issues/119#issuecomment-2902226353 ]

### Use cases

Reasons that a user might want to customize the models:

* Using a different Granite autocomplete model. Right now, there are multiple reasonable options:

   granite-3.3-8b-base: this is the best Granite model for autocomplete. If someone has a fast GPU, they probably want to use this.
   granite-3.3-2b-base: worse, but faster. Best model for mid-end Macs 
   granite-3.3-8b-instruct: Quality is similar to 2b-base, speed similar to 8b-base. If you have a GPU with limited memory and only have room for one Granite, this can make sense.

  In the future, it's also possible there will be multiple reasonable choices for chat models.

* Using Ollama on a different port 

* Using hosted models. If you have access to an instance of VLLM running granite3.3:8b, use that instead of Ollama

 * ~~Using old models~~ - I don't care about this one. The Granite models have a track record for improving over time, and I don't want users torturing themselves trying to figure out whether `granite3.2:8b` is better than `granite3.3:2b` - because model outputs are inherently random (even at temperature 0), you just can't tell based on a small number of prompts. If someone really wants to investigate, they can always configure the models themselves. (See below)

* Using third party models. Not our emphasis, but users (or at least Granite.Code developers...) will want to compare.


### Proposed way it looks

A basic principle should be that selecting models and customizing models feels like an extension of the upstream UI rather than something alien to it.

To change your autocomplete model, you edit `~/.granite-code/models/autocomplete.yaml` to change

```
  name: Granite.Code chat model
  version: 1.0.0
  schema: v1
  models:
-    - uses: granite.code/autocomplete@default
+    - uses: granite.code/granite-3.3:8b-base
 ```

To use a different Ollama port, you edit `~/.granite-code/models/{autocomplete,chat,embed}.yaml` and add:

```
  name: Granite.Code chat model
  version: 1.0.0
  schema: v1
  models:
    - uses: granite.code/autocomplete@default
    - override:
       apiBase: ollama.local:11434
```

(*OR* we add a setting for this, *OR* we simply honor the OLLAMA_HOST environment variable - but this would be the general mechanism for overrides)

To use a hosted model, you replace ~/.granite-code/models/{autocomplete,chat,embed}.yaml with your own content.

To stop using a hosted model, you delete those files, and they will be recreated with the default content.

Notes:
* This requires a pretty simple code change to provide our own RegistryClient when unrolling the yaml file
* An alternative would be to `uses: ./default-models/granite-autocomplete.yaml` which avoids the code change and lets people actually open that file and see what is in there. Might be better.
* When you are using a hosted model, you want to completely replace not override: to repoint, otherwise changes to the default model will cause user's configurations to break.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework config to be more native / allow customization #135

Use cases

Proposed way it looks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rework config to be more native / allow customization #135

Description

Use cases

Proposed way it looks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions