llm-benchmarking

Here are 18 public repositories matching this topic...

robertvacareanu / llm4regression

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

linear-regression sklearn regression regression-models large-language-models llm llms llm-inference llm-benchmarking

Updated Sep 10, 2024
Python

lakeraai / pint-benchmark

Star

A benchmark for prompt injection detection systems.

benchmark llm prompt-injection llm-security llm-benchmarking

Updated Sep 10, 2024
Jupyter Notebook

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG

benchmark leaderboard gemini llama language-model claude rag hallucinations ai-evaluation llm llm-benchmarking gpt-4o o1-mini o1-preview confabulations

Updated Nov 5, 2024
HTML

asimsinan / LLM-Research

Star

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

arxiv-papers large-language-models llm llms llm-datasets llm-tools buyuk-dil-modelleri llm-research llm-theses llm-benchmarking llm-frameworks

Updated Oct 8, 2024
Python

MJ-Bench / MJ-Bench

Star

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

Updated Nov 19, 2024
Jupyter Notebook

AKSW / LLM-KG-Bench

Star

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

sparql rdf knowledge-graph large-language-models llm llm-benchmarking

Updated Dec 3, 2024
Python

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Oct 31, 2024
Python

An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions

benchmark text-to-sql llm-benchmarking

Updated Aug 29, 2024
Jupyter Notebook

lechmazur / deception

Star

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Oct 22, 2024

AUCOHL / RTL-Repo

Star

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

verilog rtl-design llm llm-benchmarking

Updated Jun 5, 2024
Python

Cristian-Curaba / CryptoFormalEval

Star

We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.

cryptography communication-protocol evaluation vulnerability-detection llm-reasoning llm-benchmarking llm-based-agents

Updated Nov 4, 2024
Haskell

redblock-ai / parrot-python

Star

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

benchmarking-framework llm-inference llm-datasets llm-qa-document llm-benchmarking