Gauge - LLM Evaluation

Gauge is a Python library for evaluating and comparing language models (LLMs). Compare models based on their performance on complex and custom tasks, alongside numeric measurements like latency and cost.

How does it work?

Gauge uses a model-on-model approach to evaluate LLMs qualitatively. An advanced arbiter model (GPT-4) evaluates the performance of smaller LLMs on specific tasks, providing a numeric score based on their output. This allows users to create custom benchmarks for their tasks and obtain qualitative evaluations of different LLMs. Gauge is useful for evaluating and ranking LLMs on a wide range of complex and subjective tasks, such as creative writing, staying in character, formatting outputs, extracting information, and translating text.

Features

Evaluate and compare multiple LLMs using custom benchmarks
Straightforward API for running and evaluating LLMs
Extensible architecture for including additional models

Installation

To install Gauge, run the following command:

pip install gauge-llm

Before using Gauge, set your HUGGINGFACE_TOKEN environment variable, your REPLICATE_API_TOKEN, and import the openai library and set your .api_key.

import os
import openai

os.environ["HUGGINGFACE_TOKEN"] = "your_huggingface_token"
os.environ["REPLICATE_API_TOKEN"] = "your_replicate_api_token"
openai.api_key = "your_openai_api_key"

Examples

Information Extraction: Historical Event

import gauge

query = "Extract the main points from the following paragraph: On July 20, 1969, American astronauts Neil Armstrong and Buzz Aldrin became the first humans to land on the Moon. Armstrong stepped onto the lunar surface and described the event as 'one small step for man, one giant leap for mankind.'"
gauge.evaluate(query)

Staying in Character: Detective's Monologue

import gauge

query = "Write a monologue for a detective character in a film noir setting."
gauge.evaluate(query)

Translation: English to Spanish

import gauge

query = "Translate the following English text to Spanish: 'The quick brown fox jumps over the lazy dog.'"
gauge.evaluate(query)

Formatting Output: Recipe Conversion

import gauge

query = "Convert the following recipe into a shopping list: 2 cups flour, 1 cup sugar, 3 eggs, 1/2 cup milk, 1/4 cup butter."
gauge.evaluate(query)

These examples will display a table with the results for each model, including their name, response, score, explanation, latency, and cost.

API

`gauge.run(model, query)`

Runs the specified model with the given query and returns the output, latency, and cost.

Parameters:

model: A dictionary containing the model's information (type, name, id, and price_per_second).
query: The input query for the model.

Returns:

output: The generated output from the model.
latency: The time taken to run the model.
cost: The cost of running the model.

`gauge.evaluate(query)`

Evaluates multiple LLMs using the given query and displays a table with the results, including the model's name, response, score, explanation, latency, and cost.

Parameters:

query: The input query for the models.

Contributing

Contributions to Gauge are welcome! If you'd like to add a new model or improve the existing code, please submit a pull request. If you encounter issues or have suggestions, open an issue on GitHub.

License

Gauge is released under the MIT License.

Acknowledgements

This project was created by Killian Lucas and Roger Hu during the AI Tinkerers Summer Hackathon, which took place on June 10th, 2023 in Seattle at Create 33. The event was sponsored by AWS Startups, Cohere, Madrona Venture Group, and supported by Pinecone, Weaviate, and Blueprint AI. Gauge made it to the semi-finals.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dist		dist
gauge		gauge
images		images
venv		venv
.replit		.replit
LICENSE		LICENSE
README.md		README.md
openai-evals.txt		openai-evals.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
replit.nix		replit.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gauge - LLM Evaluation

How does it work?

Features

Installation

Examples

Information Extraction: Historical Event

Staying in Character: Detective's Monologue

Translation: English to Spanish

Formatting Output: Recipe Conversion

API

`gauge.run(model, query)`

`gauge.evaluate(query)`

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

License

akrichikov/gauge

Folders and files

Latest commit

History

Repository files navigation

Gauge - LLM Evaluation

How does it work?

Features

Installation

Examples

Information Extraction: Historical Event

Staying in Character: Detective's Monologue

Translation: English to Spanish

Formatting Output: Recipe Conversion

API

gauge.run(model, query)

gauge.evaluate(query)

Contributing

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gauge.run(model, query)`

`gauge.evaluate(query)`

Packages