This project provides Python scripts to interact with APIs from Ollama and vLLM. The script ollama_app.py
allows you to call models such as llama3:8b
, gemma2:9b
, and phi-3
from the Ollama API. The vLLM models can be hosted and accessed through Docker.
For Ollama and vLLM supports, see ollama_app.py
and vllm_app.py
respectively.
The ollama_app.py
script allows you to generate text completions using different models provided by the Ollama API.
codestral:22b
is recommended for running as a coding assistant...
codestral:22b
llama3.1:8b
gemma2:9b
- Python 3.6+
requests
library
You can install the required library using pip:
pip install requests
-
Edit the Script: Ensure that the
model_names
list contains the correct models you want to use. -
Run the Script:
import requests
import json
def query_ollama(prompt, model_name, temperature=0):
url = "http://140.119.164.212:11434/api/generate"
headers = {"Content-Type": "application/json", "Accept": "application/json"}
payload = {
"model": model_name,
"prompt": prompt,
"temperature": temperature,
"guided_json": "TEST_SCHEMA" # Ensure TEST_SCHEMA is a string variable
}
response = requests.post(url, headers=headers, data=json.dumps(payload))
response.raise_for_status()
if response.headers.get('Content-Type') == 'application/x-ndjson':
ndjson_content = response.text.strip().split('\n')
json_responses = [json.loads(line) for line in ndjson_content]
complete_response = ''.join([item['response'] for item in json_responses if 'response' in item])
return complete_response
elif response.headers.get('Content-Type') == 'application/json':
return response.json()
else:
raise Exception(f"Unexpected content type: {response.headers.get('Content-Type')}")
model_names = ["codestral:22b", "llama3.1:8b", "gemma2:9b"]
prompt = "Translate English to French: 'Hello, how are you?'"
response = query_ollama(prompt, model_names[2])
print(response)
- Execute the Script:
python ollama_app.py
- Docker
- NVIDIA Docker Runtime (for GPU support)
- Hugging Face API token
- Pull the vLLM Docker Image:
docker pull vllm/vllm-openai:latest
- Run the Docker Container:
Ensure you have your Hugging Face API token. Replace YOUR_HUGGING_FACE_HUB_TOKEN
with your actual token.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /path/to/your/model_dir:/root/model_config \
--env "HUGGING_FACE_HUB_TOKEN=YOUR_HUGGING_FACE_HUB_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model /root/model_config
You can interact with the vLLM API using HTTP requests. Below is an example Python script for querying the vLLM API:
import requests
def query_vllm(prompt, model_name, max_tokens=50):
url = "http://localhost:8000/v1/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_HUGGING_FACE_HUB_TOKEN"
}
payload = {
"model": model_name,
"prompt": prompt,
"max_tokens": max_tokens
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
return response.json()
prompt = "Once upon a time"
model_name = "llama3.1:8b"
response = query_vllm(prompt, model_name)
print(response)
This project provides a straightforward way to interact with text generation models from Ollama and vLLM. Ensure you have the required dependencies installed and configured correctly to use the provided scripts effectively.
For any issues or contributions, feel free to open an issue or a pull request on the project's repository.