chroma/README.md at main · chroma-core/chroma #678

irthomasthomas · 2024-03-04T12:24:06Z

chroma/README.md at main · chroma-core/chroma

chroma/README.md at main · chroma-core/chroma

Chroma - the open-source embedding database.
The fastest way to build Python or JavaScript LLM apps with memory!

|

pip install chromadb # python client
# for javascript, npm install chromadb!
# for client-server mode, chroma run --path /chroma_db_path

The core API is only 4 functions (run our 💡 Google Colab or Replit template):

import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
    ids=["doc1", "doc2"], # unique for each doc
)

# Query/search 2 most similar results. You can also .get by id
results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)

Features

Simple: Fully-typed, fully-tested, fully-documented == happiness
Integrations: 🦜️🔗 LangChain (python and js), 🦙 LlamaIndex and more soon
Dev, Test, Prod: the same API that runs in your python notebook, scales to your cluster
Feature-rich: Queries, filtering, density estimation and more
Free & Open Source: Apache 2.0 Licensed

Use case: ChatGPT for ______

For example, the "Chat your data" use case:

Add documents to your database. You can pass in your own embeddings, embedding function, or let Chroma embed them for you.
Query relevant documents with natural language.
Compose documents into the context window of an LLM like GPT3 for additional summarization or analysis.

Embeddings?

What are embeddings?

Read the guide from OpenAI
Literal: Embedding something turns it from image/text/audio into a list of numbers. 🖼️ or 📄 => [1.2, 2.1, ....]. This process makes documents "understandable" to a machine learning model.
By analogy: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
Technical: An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
A small example: If you search your photos for "famous bridge in San Francisco". By embedding this query and comparing it to the embeddings of your photos and their metadata - it should return photos of the Golden Gate Bridge.

Embeddings databases (also known as vector databases) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses Sentence Transformers to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.

View on GitHub

Suggested labels

The text was updated successfully, but these errors were encountered:

irthomasthomas · 2024-03-04T12:24:08Z

Related issues

#325: simonw/llm pattern to migrate llm cli embeddings to chromadb

### Details

Similarity score: 0.93 > **Note:** > > - [ ] [Support for plugins that implement vector indexes · Issue #216 · simonw/llm](https://github.com/simonw/llm/issues/216) > > @simonw happy to help here > > Here is how Chroma's roadmap aligns with your goals. > > The internals of Chroma are set up to have pluggable indexes on top of collections - we haven't yet exposed this to end users. But will fairly soon. We also plan to have a "smart index" that does KNN brute force and then a cutover to ANN. > > Indexes over multiple collections - while I do understand the use case - we've chosen to not prioritize this as it adds a lot of DX complexity. We instead encourage users to eat the "read amplification" and query multiple collections/indexes and then cull/rerank themselves client side. > > You may also enjoy reading this chroma proposal where we have put a lot of thought into the pipelines to support index/collection creation and access - [chroma-core/chroma#1110](https://github.com/chroma-core/chroma/issues/1110) > > @dave1010 mentioned this issue > > Add option for RAG-style augmentation [dave1010/clipea#1](https://github.com/dave1010/clipea/issues/1) > > Open > > @IvanVas commented > > If someone ever need to move data from llm to Chroma, below is a simple script to do so. Needs a little more work to productise it though. > > @simonw, hope it would help if you ever need to create smth like llm embed-multi migrate --chroma > > ```python > import sqlite3 > import struct > import chromadb > > def decode(binary): > if not isinstance(binary, bytes): > raise ValueError("Binary data must be of bytes type") > return struct.unpack("<" + "f" * (len(binary) // 4), binary) > > client = chromadb.PersistentClient(path="chroma.db") # FIXME > > collectionName = "collection" # FIXME > collection = client.get_or_create_collection(collectionName) > > # Path to your SQLite database > db_path = "/Users/username/Library/Application Support/io.datasette.llm/embeddings.db" # FIXME > > # Query to retrieve embeddings > llm_collection = "fixme" # FIXME > query = f""" > SELECT id, embedding, content FROM embeddings WHERE collection_id = ( > SELECT id FROM collections WHERE name = "{llm_collection}" > ) > """ > > def parse_id(id_str): > # FIXME parse metadata > > return { > "meta1": "meta1", > } > > def main(db_path, batch_size=1000): > # Connect to the SQLite database > conn = sqlite3.connect(db_path) > cursor = conn.cursor() > > # Query the embeddings table > cursor.execute(query) > rows = cursor.fetchall() # FIXME assumes there is not MUCH data, works fine for x100k records > > # Initialize batch variables > batch_embeddings = [] > batch_documents = [] # You'll need to adjust how you handle documents > batch_metadatas = [] # You'll need to adjust how you handle metadatas > batch_ids = [] > > for i, (id, embedding, content) in enumerate(rows): > idParsed = parse_id(id) > > # Decode the binary data > decoded_embedding = decode(embedding) > > # Append to the batch > batch_embeddings.append(list(decoded_embedding)) > batch_documents.append(content) > batch_metadatas.append({"meta1": idParsed['meta1']}) # FIXME > batch_ids.append(id) > > # When batch size is reached or end of rows > if len(batch_embeddings) == batch_size or i == len(rows) - 1: > collection.add( > embeddings=batch_embeddings, > documents=batch_documents, > metadatas=batch_metadatas, > ids=batch_ids > ) > > print(f"Added {len(batch_embeddings)} rows to chromaDb (Total: {i + 1})") > > # Reset batch variables > batch_embeddings = [] > batch_documents = [] > batch_metadatas = [] > batch_ids = [] > > # Close the database connection > conn.close() > > if __name__ == "__main__": > main(db_path) > ``` > > #### Suggested labels > > - { "key": "llm-inference-engines", "value": "Software and tools for running inference on Large Language Models (LLMs)" } > - { "key": "llm-quantization", "value": "All about quantized LLM models and their serving" }

#625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.91 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

This conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

🦥 Unsloth.ai News

📣 Gemma 7b on 6T tokens now works. And Gemma 2b notebook
📣 Added conversational notebooks and raw text notebooks
📣 2x faster inference added for all our models
📣 DPO support is now included. More info on DPO
📣 We did a blog with 🤗Hugging Face and are in their official docs! Check out the SFT docs and DPO docs
📣 Download models 4x faster from 🤗Hugging Face. Eg: unsloth/mistral-7b-bnb-4bit

🔗 Links and Resources

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducable benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

View on GitHub

Suggested labels

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

### Details

Similarity score: 0.89 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api)

Astra Assistant API Service

A drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector).

Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code.

Getting Started

Create an Astra DB Vector database
Replace the following code:

client = OpenAI(
    api_key=OPENAI_API_KEY,
)

with:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
    }
)

Or, if you have an existing astra db, you can pass your db_id in a second header:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "astra-db-id": ASTRA_DB_ID
    }
)

Create an assistant

assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "retrieval"}]
)

By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion.

Third party LLM Support

We now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using api-key and embedding-model headers.

For AWS Bedrock, you can pass additional custom headers:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key="NONE",
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "embedding-model": "amazon.titan-embed-text-v1",
        "LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID,
        "LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY,
        "LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION,
    }
)

and again, specify the custom model for the assistant.

assistant = client.beta.assistants.create(
    name="Math Tutor",
    instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
    model="meta.llama2-13b-chat-v1",
)

Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under examples.

To run the examples using poetry:

Create a .env file in this directory with your secrets.
Run:

poetry install
poetry run python examples/completion/basic.py
poetry run python examples/retreival/basic.py
poetry run python examples/function-calling/basic.py

Coverage

See our coverage report here.

Roadmap

Support for other embedding models and LLMs
Function calling
Pluggable RAG strategies
Streaming support

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### Details

Similarity score: 0.88 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Open-Web (Common Crawl Foundation Terms of Use)
Books: the_pile_books3 license and pg19 license
ArXiv Terms of Use
Wikipedia License
StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#74: Sqlite WASM and Github Pages

### Details

Similarity score: 0.88 - [ ] [merge sqlite - Kagi Search](https://kagi.com/search?q=merge+sqlite&from_date=2021-01-01) - [ ] [sql - Fastest Way merge two SQLITE Databases - Stack Overflow](https://stackoverflow.com/questions/9349659/fastest-way-merge-two-sqlite-databases) - [ ] [simonw/sqlite-utils: Python CLI utility and library for manipulating SQLite databases](https://github.com/simonw/sqlite-utils) - [ ] [sqlite-utils](https://sqlite-utils.datasette.io/en/stable/) - [ ] [GitHub Pages | Websites for you and your projects, hosted directly from your GitHub repository. Just edit, push, and your changes are live.](https://pages.github.com/) - [ ] [sqlite wasm sql.js github pages - Kagi Search](https://kagi.com/search?q=sqlite+wasm+sql.js+github+pages) - [ ] [phiresky/sql.js-httpvfs: Hosting read-only SQLite databases on static file hosters like Github Pages](https://github.com/phiresky/sql.js-httpvfs) - [ ] [sqlite3 WebAssembly & JavaScript Documentation Index](https://sqlite.org/wasm/doc/trunk/index.md) - [ ] [phiresky/youtube-sponsorship-stats/](https://github.com/phiresky/youtube-sponsorship-stats/) - [ ] [phiresky/world-development-indicators-sqlite/](https://github.com/phiresky/world-development-indicators-sqlite/) - [ ] [Show HN: Fully-searchable Library Genesis on IPFS | Hacker News](https://news.ycombinator.com/item?id=28585208) - [ ] [nalgeon/sqlime: Online SQLite playground](https://github.com/nalgeon/sqlime) - [ ] [nalgeon/sqlean.js: Browser-based SQLite with extensions](https://github.com/nalgeon/sqlean.js) - [ ] [Sqlime - SQLite Playground](https://sqlime.org/) - [ ] [sqlite-wasm-http - npm](https://www.npmjs.com/package/sqlite-wasm-http) - [ ] [Postgres WASM by Snaplet and Supabase](https://supabase.com/blog/postgres-wasm) - [ ] [rhashimoto/wa-sqlite: WebAssembly SQLite with experimental support for browser storage extensions](https://github.com/rhashimoto/wa-sqlite) - [ ] [jlongster/absurd-sql: sqlite3 in ur indexeddb (hopefully a better backend soon)](https://github.com/jlongster/absurd-sql) - [ ] [help: basic example for the web browser · Issue #23 · phiresky/sql.js-httpvfs](https://github.com/phiresky/sql.js-httpvfs/issues/23) - [ ] [phiresky/sql.js-httpvfs: Hosting read-only SQLite databases on static file hosters like Github Pages](https://github.com/phiresky/sql.js-httpvfs#minimal-example-from-scratch) - [ ] [sql.js-httpvfs/example at master · phiresky/sql.js-httpvfs](https://github.com/phiresky/sql.js-httpvfs/tree/master/example) - [ ] [Code search results](https://github.com/search?utf8=%E2%9C%93&q=sql.js-httpvfs&type=code) - [ ] [knowledge/docs/databases/sqlite.md at 7c4bbc755c64368a82ca22b76566e9153cd2e377 · nikitavoloboev/knowledge](https://github.com/nikitavoloboev/knowledge/blob/7c4bbc755c64368a82ca22b76566e9153cd2e377/docs/databases/sqlite.md?plain=1#L96) - [ ] [static-wiki/src/sqlite.js at b0ae18ed02ca64263075c3516c38a504f46e10c4 · segfall/static-wiki](https://github.com/segfall/static-wiki/blob/b0ae18ed02ca64263075c3516c38a504f46e10c4/src/sqlite.js#L2)

#134: marker: Convert PDF to markdown quickly with high accuracy

### Details

Similarity score: 0.88 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)

Marker

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Support for a range of PDF documents (optimized for books and scientific papers)
Removes headers/footers/other artifacts
Converts most equations to latex
Formats code blocks and tables
Support for multiple languages (although most testing is done in English). See settings.py for a language list.
Works on GPU, CPU, or MPS

More Details

How it works

Marker is a pipeline of deep learning models:

Extract text, OCR if necessary (heuristics, tesseract)
Detect page layout (layout segmenter, column detector)
Clean and format each block (heuristics, nougat)
Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)

Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.

Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.

Examples

PDF	Type	Marker	Nougat
Think Python	Textbook	View	View
Think OS	Textbook	View	View
Switch Transformers	arXiv paper	View	View
Multi-column CNN	arXiv paper	View	View

Performance

The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
Whitespace and indentations are not always respected.
Not all lines/spans will be joined properly.
Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry.

First, clone the repo:

git clone https://github.com/VikParuchuri/marker.git
cd marker

Linux

Install system requirements
- Optional: Install tesseract 5 by following these instructions or running scripts/install/tesseract_5_install.sh.
- Install ghostscript > 9.55 by following these instructions or running scripts/install/ghostscript_install.sh.
- Install other requirements with cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
Set the tesseract data folder path
- Find the tesseract data folder tessdata with find / -name tessdata. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
- Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
Install python requirements
- poetry install
- poetry shell to activate your poetry venv
Update pytorch since poetry doesn't play nicely with it
- GPU only: run pip install torch to install other torch dependencies.
- CPU only: Uninstall torch, then follow the CPU install instructions.

Mac

Install system requirements from scripts/install/brew-requirements.txt
Set the tesseract data folder path
- Find the tesseract data folder tessdata with brew list tesseract
- Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
Install python requirements
- poetry install
- poetry shell to activate your poetry venv

Usage

First, some configuration:

Set your torch device in the local.env file. For example, TORCH_DEVICE=cuda or TORCH_DEVICE=mps. cpu is the default.
- If using GPU, set INFERENCE_RAM to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set INFERENCE_RAM=16.
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure VRAM_PER_TASK to adjust this if you notice tasks failing with GPU out of memory errors.
Inspect the other settings in marker/settings.py. You can override any settings in the local.env file, or by setting environment variables.
- By default, the final editor model is off. Turn it on with ENABLE_EDITOR_MODEL.
- By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the OCR_ENGINE setting.

Convert a single file

Run convert_single.py, like this:

python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10

--parallel_factor is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
--max_pages is the maximum number of pages to process. Omit this to convert the entire document.

Make sure the DEFAULT_LANG setting is set appropriately for your document.

Convert multiple files

Run convert.py, like this:

python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000

--workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
--max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
--metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. The format is:
--min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

{
  "pdf1.pdf": {"language": "English"},
  "pdf2.pdf": {"language": "Spanish"},
  ...
}

Convert multiple files on multiple GPUs

Run chunk_convert.sh, like this:

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out

METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
NUM_WORKERS is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK.
MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.

Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method	Average Score	Time per page	Time per document
naive	0.350727	0.00152378	0.326524
marker	0.641062	0.360622	77.2762
nougat	0.629211	3.77259	808.413

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method	switch_trans.pdf	crowd.pdf	multicolcnn.pdf	thinkos.pdf	thinkdsp.pdf	thinkpython.pdf
naive	0.244114	0.140669	0.0868221	0.366856	0.412521	0.468281
marker	0.482091	0.466882	0.537062	0.754347	0.78825	0.779536
nougat	0.696458	0.552337	0.735099	0.655002	0.645704	0.650282

Peak GPU memory usage during the benchmark is 3.3GB for nougat, and 3.1GB for marker. Benchmarks were run on an A6000.

Throughput

Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.

Running your own benchmarks

You can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip.

Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Commercial usage

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.

Here are the non-commercial/restrictive dependencies:

LayoutLMv3: CC BY-NC-SA 4.0 . Source
Nougat: CC-BY-NC . Source
PyMuPDF - GPL . Source

Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

Nougat from Meta
Layoutlmv3 from Microsoft
DocLayNet from IBM
ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!

This was referenced Mar 5, 2024

gh actions sqlite wasm #691

Open

pattern: detect wasm availability #692

Open

My Knowledge Wiki: personal wiki pattern #705

Open

privateGPT/README.md at main · imartinez/privateGPT #707

Open

irthomasthomas mentioned this issue Mar 22, 2024

Bibliothecula: Using sqlite3 as a Notekeeping Document Graph with Automatic Reference Indexing #779

Open

1 task

irthomasthomas mentioned this issue Apr 3, 2024

Amazon bedrock plugin for llm cli #785

Open

1 task

ShellLM mentioned this issue Jun 19, 2024

jjleng/sensei: Yet another open source Perplexity #837

Open

1 task

ShellLM mentioned this issue Jul 3, 2024

simonw/llm-cluster: LLM plugin for clustering embeddings #839

Open

1 task

ShellLM mentioned this issue Jul 25, 2024

Visual Grep - Xautomation #846

Open

1 task

This was referenced Aug 2, 2024

lesser-known Python libraries | Hacker News #860

Open

Reader API #865

Open

boltons/README.md at master · mahmoud/boltons #868

Open

This was referenced Aug 13, 2024

Deno: the easiest, most secure JavaScript runtime | Deno Docs #880

Open

Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models — pyvespa documentation #892

Open

ShellLM mentioned this issue Nov 7, 2024

infinity/README.md at main · infiniflow/infinity #938

Open

1 task

ShellLM mentioned this issue Jan 29, 2025

CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation and curation-related operations over collections of objects. #979

Open

1 task

chroma/README.md at main · chroma-core/chroma #678

chroma/README.md at main · chroma-core/chroma #678

Comments

irthomasthomas commented Mar 4, 2024

chroma/README.md at main · chroma-core/chroma

Features

Use case: ChatGPT for ______

Embeddings?

Suggested labels

irthomasthomas commented Mar 4, 2024

Related issues

#325: simonw/llm pattern to migrate llm cli embeddings to chromadb

#625: unsloth/README.md at main · unslothai/unsloth

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

Suggested labels

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

Astra Assistant API Service

Getting Started

Third party LLM Support

Coverage

Roadmap

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

Getting Started

Languages

Dataset Structure

Dataset Creation

Source Data

License

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#74: Sqlite WASM and Github Pages

#134: marker: Convert PDF to markdown quickly with high accuracy

Marker

How it works

Examples

Performance

Limitations

Installation

Linux

Mac

Usage

Convert a single file

Convert multiple files

Convert multiple files on multiple GPUs

Benchmarks

Running your own benchmarks

Commercial usage

Thanks