unsloth/README.md at main · unslothai/unsloth #625

irthomasthomas · 2024-02-27T18:30:58Z

unsloth/README.md at main · unslothai/unsloth

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

This conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

🦥 Unsloth.ai News

📣 Gemma 7b on 6T tokens now works. And Gemma 2b notebook
📣 Added conversational notebooks and raw text notebooks
📣 2x faster inference added for all our models
📣 DPO support is now included. More info on DPO
📣 We did a blog with 🤗Hugging Face and are in their official docs! Check out the SFT docs and DPO docs
📣 Download models 4x faster from 🤗Hugging Face. Eg: unsloth/mistral-7b-bnb-4bit

🔗 Links and Resources

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducable benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

View on GitHub

Suggested labels

The text was updated successfully, but these errors were encountered:

irthomasthomas · 2024-02-27T18:31:00Z

Related issues

#134: marker: Convert PDF to markdown quickly with high accuracy

### Details

Similarity score: 0.9 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)

Marker

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Support for a range of PDF documents (optimized for books and scientific papers)
Removes headers/footers/other artifacts
Converts most equations to latex
Formats code blocks and tables
Support for multiple languages (although most testing is done in English). See settings.py for a language list.
Works on GPU, CPU, or MPS

More Details

How it works

Marker is a pipeline of deep learning models:

Extract text, OCR if necessary (heuristics, tesseract)
Detect page layout (layout segmenter, column detector)
Clean and format each block (heuristics, nougat)
Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)

Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.

Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.

Examples

PDF	Type	Marker	Nougat
Think Python	Textbook	View	View
Think OS	Textbook	View	View
Switch Transformers	arXiv paper	View	View
Multi-column CNN	arXiv paper	View	View

Performance

The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
Whitespace and indentations are not always respected.
Not all lines/spans will be joined properly.
Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry.

First, clone the repo:

git clone https://github.com/VikParuchuri/marker.git
cd marker

Linux

Install system requirements
- Optional: Install tesseract 5 by following these instructions or running scripts/install/tesseract_5_install.sh.
- Install ghostscript > 9.55 by following these instructions or running scripts/install/ghostscript_install.sh.
- Install other requirements with cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
Set the tesseract data folder path
- Find the tesseract data folder tessdata with find / -name tessdata. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
- Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
Install python requirements
- poetry install
- poetry shell to activate your poetry venv
Update pytorch since poetry doesn't play nicely with it
- GPU only: run pip install torch to install other torch dependencies.
- CPU only: Uninstall torch, then follow the CPU install instructions.

Mac

Install system requirements from scripts/install/brew-requirements.txt
Set the tesseract data folder path
- Find the tesseract data folder tessdata with brew list tesseract
- Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
Install python requirements
- poetry install
- poetry shell to activate your poetry venv

Usage

First, some configuration:

Set your torch device in the local.env file. For example, TORCH_DEVICE=cuda or TORCH_DEVICE=mps. cpu is the default.
- If using GPU, set INFERENCE_RAM to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set INFERENCE_RAM=16.
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure VRAM_PER_TASK to adjust this if you notice tasks failing with GPU out of memory errors.
Inspect the other settings in marker/settings.py. You can override any settings in the local.env file, or by setting environment variables.
- By default, the final editor model is off. Turn it on with ENABLE_EDITOR_MODEL.
- By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the OCR_ENGINE setting.

Convert a single file

Run convert_single.py, like this:

python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10

--parallel_factor is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
--max_pages is the maximum number of pages to process. Omit this to convert the entire document.

Make sure the DEFAULT_LANG setting is set appropriately for your document.

Convert multiple files

Run convert.py, like this:

python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000

--workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
--max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
--metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. The format is:
--min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

{
  "pdf1.pdf": {"language": "English"},
  "pdf2.pdf": {"language": "Spanish"},
  ...
}

Convert multiple files on multiple GPUs

Run chunk_convert.sh, like this:

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out

METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
NUM_WORKERS is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK.
MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.

Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method	Average Score	Time per page	Time per document
naive	0.350727	0.00152378	0.326524
marker	0.641062	0.360622	77.2762
nougat	0.629211	3.77259	808.413

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method	switch_trans.pdf	crowd.pdf	multicolcnn.pdf	thinkos.pdf	thinkdsp.pdf	thinkpython.pdf
naive	0.244114	0.140669	0.0868221	0.366856	0.412521	0.468281
marker	0.482091	0.466882	0.537062	0.754347	0.78825	0.779536
nougat	0.696458	0.552337	0.735099	0.655002	0.645704	0.650282

Peak GPU memory usage during the benchmark is 3.3GB for nougat, and 3.1GB for marker. Benchmarks were run on an A6000.

Throughput

Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.

Running your own benchmarks

You can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip.

Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Commercial usage

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.

Here are the non-commercial/restrictive dependencies:

LayoutLMv3: CC BY-NC-SA 4.0 . Source
Nougat: CC-BY-NC . Source
PyMuPDF - GPL . Source

Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

Nougat from Meta
Layoutlmv3 from Microsoft
DocLayNet from IBM
ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)

Baseline Benchmark for 17 Coding Models

Discussion

I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.

Notes:

This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
Each model was prompted according to the model card template. Here's an example for the codellama series -

f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""

Results

I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.

Model	Temp	Correct / 164	Percentage
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.0	67	0.40853658536585363
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.1	63	0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.2	68	0.4146341463414634
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.3	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.4	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.5	63	0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.6	54	0.32926829268292684
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.7	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.8	60	0.36585365853658536
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.9	59	0.3597560975609756
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	1.0	65	0.39634146341463417

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#160: sid321axn/tinyllama-text2sql-finetuned at main

### Details

Similarity score: 0.89 ## tiny-llama-text2sql ## safetensors - [ ] [sid321axn/tinyllama-text2sql-finetuned at main](https://huggingface.co/sid321axn/tinyllama-text2sql-finetuned/tree/main)

adapter

https://huggingface.co/sid321axn/tiny-llama-text2sql

This model is a fine-tuned version of PY007/TinyLlama-1.1B-Chat-v0.3 on the None dataset.

{
 "_name_or_path": "PY007/TinyLlama-1.1B-Chat-v0.3",
 "architectures": [
   "LlamaForCausalLM"
 ],
 "attention_bias": false,
 "attention_dropout": 0.0,
 "bos_token_id": 1,
 "eos_token_id": 2,
 "hidden_act": "silu",
 "hidden_size": 2048,
 "initializer_range": 0.02,
 "intermediate_size": 5632,
 "max_position_embeddings": 2048,
 "model_type": "llama",
 "num_attention_heads": 32,
 "num_hidden_layers": 22,
 "num_key_value_heads": 4,
 "pretraining_tp": 1,
 "rms_norm_eps": 1e-05,
 "rope_scaling": null,
 "rope_theta": 10000.0,
 "tie_word_embeddings": false,
 "torch_dtype": "float16",
 "transformers_version": "4.37.0.dev0",
 "use_cache": false,
 "vocab_size": 32003
}
```</details>


### #499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.
<details><summary>### Details</summary>Similarity score: 0.89
- [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)

# CTransformers

[![PyPI version](https://badge.fury.io/py/ctransformers.svg)](https://badge.fury.io/py/ctransformers)
[![Documentation](https://readthedocs.org/images/button/readthedocs-ci.svg)](https://ctransformers.readthedocs.io/)
[![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see [ChatDocs](https://github.com/marella/chatdocs)

## Supported Models

| Model  | Model Type | CUDA | Metal |
| ------ | --------- | :--: | :--:  |
| GPT-2   | gpt2      |      |       |
| GPT-J, GPT4All-J | gptj      |      |       |
| GPT-NeoX, StableLM | gpt_neox  |      |       |
| Falcon   | falcon    |   ✅  |       |
| LLaMA, LLaMA 2 | llamai    |   ✅   |   ✅   |
| MPT      | mpt       |   ✅  |       |
| StarCoder, StarChat | gpt_bigcode |   ✅   |       |
| Dolly V2  | dolly-v2   |      |       |
| Replit   | replit    |      |       |

## Installation

To install via `pip`, simply run:

pip install ctransformers


## Usage

It provides a unified interface for all models:

```python
from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling	`40`
`top_p`	`float`	The top-p value to use for sampling	`0.95`
`temperature`	`float`	The temperature to use for sampling	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty	`64`
`seed`	`int`	The seed value to use for sampling tokens	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate	`256`
`stop`	`List`	A list of sequences to stop generation when encountered	`None`
`stream`	`bool`	Whether to stream the generated text	`False`
`reset`	`bool`	Whether to reset the model state before generating text	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens in a single prompt	`8`
`threads`	`int`	The number of threads to use for evaluating tokens	`-1`
`context_length`	`int`	The maximum context length to use	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU	`0`

Find the URL for the model card for GPTQ here.

Made with ❤️ by marella

Suggested labels

null

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### Details

Similarity score: 0.89 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Open-Web (Common Crawl Foundation Terms of Use)
Books: the_pile_books3 license and pg19 license
ArXiv Terms of Use
Wikipedia License
StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### Details

Similarity score: 0.88 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

deepseek-coder-5.7bmqa-base: A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens.
Home Page: DeepSeek
Repository: deepseek-ai/deepseek-coder
Chat With DeepSeek Coder: DeepSeek-Coder

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at agi_code@deepseek.com.

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

irthomasthomas added the finetuning Tools for finetuning of LLMs e.g. SFT or RLHF label Mar 5, 2024

This was referenced Mar 16, 2024

JAX is NumPy on the CPU, GPU, and TPU #719

Open

Anthropic cookbook: using sub-agents with claude-3 #725

Open

This was referenced Jul 25, 2024

Visual Grep - Xautomation #846

Open

hsiehjackson/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? #848

Open

This was referenced Aug 1, 2024

igrek51/wat: Deep inspection of Python objects #850

Open

graphrag prompts · microsoft/graphrag #869

Open

magpie-ultra - a synthetic dataset for supervised fine-tuning using Llama 3.1 #870

Open

This was referenced Aug 9, 2024

Using your Kindle as an e-ink monitor #872

Open

Xgboost 2.0.0 · dmlc/xgboost #878

Open

This was referenced Nov 4, 2024

Google - Gemini Long Context | Kaggle #928

Open

infinity/README.md at main · infiniflow/infinity #938

Open

This was referenced Nov 13, 2024

paper-qa: python package for RAG with a focus on scientific literature #949

Open

awesome-reasoning #953

Open

ShellLM mentioned this issue Nov 23, 2024

charmbracelet/huh - Library for building forms and prompts in the terminal #955

Open

1 task

unsloth/README.md at main · unslothai/unsloth #625

unsloth/README.md at main · unslothai/unsloth #625

Comments

irthomasthomas commented Feb 27, 2024

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

Suggested labels

irthomasthomas commented Feb 27, 2024

Related issues

#134: marker: Convert PDF to markdown quickly with high accuracy

Marker

How it works

Examples

Performance

Limitations

Installation

Linux

Mac

Usage

Convert a single file

Convert multiple files

Convert multiple files on multiple GPUs

Benchmarks

Running your own benchmarks

Commercial usage

Thanks

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

Baseline Benchmark for 17 Coding Models

Discussion

Results

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#160: sid321axn/tinyllama-text2sql-finetuned at main

adapter

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

Suggested labels

null

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

Getting Started

Languages

Dataset Structure

Dataset Creation

Source Data

License

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

Deepseek Coder Introduction

Key Features

Model Summary

How to Use

Code Completion

Code Insertion

Repository Level Code Completion

License

Contact

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }