Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsloth/README.md at main · unslothai/unsloth #625

Open
1 task
irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Open
1 task

unsloth/README.md at main · unslothai/unsloth #625

irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. finetuning Tools for finetuning of LLMs e.g. SFT or RLHF MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic Software2.0 Software development driven by AI and neural networks.

Comments

@irthomasthomas
Copy link
Owner

unsloth/README.md at main · unslothai/unsloth

unsloth logo



Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
Gemma 7b ▶️ Start on Colab 2.4x faster 58% less
Mistral 7b ▶️ Start on Colab 2.2x faster 62% less
Llama-2 7b ▶️ Start on Colab 2.2x faster 43% less
TinyLlama ▶️ Start on Colab 3.9x faster 74% less
CodeLlama 34b A100 ▶️ Start on Colab 1.9x faster 27% less
Mistral 7b 1xT4 ▶️ Start on Kaggle 5x faster* 62% less
DPO - Zephyr ▶️ Start on Colab 1.9x faster 19% less

🦥 Unsloth.ai News

🔗 Links and Resources

Type Links
📚 Wiki & FAQ Read Our Wiki
📜 Documentation Read The Doc
💾 Installation unsloth/README.md
  Twitter (aka X) Follow us on X
🥇 Benchmarking Performance Tables
🌐 Released Models Unsloth Releases
✍️ Blog Read our Blogs

⭐ Key Features

  • All kernels written in OpenAI's Triton language. Manual backprop engine.
  • 0% loss in accuracy - no approximation methods - all exact.
  • No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
  • Works on Linux and Windows via WSL.
  • Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
  • Open source trains 5x faster - see Unsloth Pro for 30x faster training!
  • If you trained a model with 🦥Unsloth, you can use this cool sticker!  

🥇 Performance Benchmarking

1 A100 40GB 🤗Hugging Face Flash Attention 🦥Unsloth Open Source 🦥Unsloth Pro
Alpaca 1x 1.04x 1.98x 15.64x
LAION Chip2 1x 0.92x 1.61x 20.73x
OASST 1x 1.19x 2.17x 14.83x
Slim Orca 1x 1.18x 2.22x 14.82x
Free Colab T4 Dataset 🤗Hugging Face Pytorch 2.1.1 🦥Unsloth 🦥 VRAM reduction
Llama-2 7b OASST 1x 1.19x 1.95x -43.3%
Mistral 7b Alpaca 1x 1.07x 1.56x -13.7%
Tiny Llama 1.1b Alpaca 1x 2.06x 3.87x -73.8%
DPO with Zephyr Ultra Chat 1x 1.09x 1.55x -18.6%

View on GitHub

Suggested labels

@irthomasthomas irthomasthomas added Algorithms Sorting, Learning or Classifying. All algorithms go here. MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic Software2.0 Software development driven by AI and neural networks. labels Feb 27, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#134: marker: Convert PDF to markdown quickly with high accuracy

### DetailsSimilarity score: 0.9 - [ ] [https://github.com/VikParuchuri/marker#readme](https://github.com/VikParuchuri/marker#readme)

Marker

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

  • Support for a range of PDF documents (optimized for books and scientific papers)
  • Removes headers/footers/other artifacts
  • Converts most equations to latex
  • Formats code blocks and tables
  • Support for multiple languages (although most testing is done in English). See settings.py for a language list.
  • Works on GPU, CPU, or MPS
More Details

How it works

Marker is a pipeline of deep learning models:

Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.

Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.

Examples

PDF Type Marker Nougat
Think Python Textbook View View
Think OS Textbook View View
Switch Transformers arXiv paper View View
Multi-column CNN arXiv paper View View

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

  • Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
  • Whitespace and indentations are not always respected.
  • Not all lines/spans will be joined properly.
  • Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
  • This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry.

First, clone the repo:

  • git clone https://github.com/VikParuchuri/marker.git
  • cd marker

Linux

  • Install system requirements
    • Optional: Install tesseract 5 by following these instructions or running scripts/install/tesseract_5_install.sh.
    • Install ghostscript > 9.55 by following these instructions or running scripts/install/ghostscript_install.sh.
    • Install other requirements with cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
  • Set the tesseract data folder path
    • Find the tesseract data folder tessdata with find / -name tessdata. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
    • Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
  • Install python requirements
    • poetry install
    • poetry shell to activate your poetry venv
  • Update pytorch since poetry doesn't play nicely with it
    • GPU only: run pip install torch to install other torch dependencies.
    • CPU only: Uninstall torch, then follow the CPU install instructions.

Mac

  • Install system requirements from scripts/install/brew-requirements.txt
  • Set the tesseract data folder path
    • Find the tesseract data folder tessdata with brew list tesseract
    • Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
  • Install python requirements
    • poetry install
    • poetry shell to activate your poetry venv

Usage

First, some configuration:

  • Set your torch device in the local.env file. For example, TORCH_DEVICE=cuda or TORCH_DEVICE=mps. cpu is the default.
    • If using GPU, set INFERENCE_RAM to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set INFERENCE_RAM=16.
    • Depending on your document types, marker's average memory usage per task can vary slightly. You can configure VRAM_PER_TASK to adjust this if you notice tasks failing with GPU out of memory errors.
  • Inspect the other settings in marker/settings.py. You can override any settings in the local.env file, or by setting environment variables.
    • By default, the final editor model is off. Turn it on with ENABLE_EDITOR_MODEL.
    • By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the OCR_ENGINE setting.

Convert a single file

Run convert_single.py, like this:

python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10
  • --parallel_factor is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
  • --max_pages is the maximum number of pages to process. Omit this to convert the entire document.

Make sure the DEFAULT_LANG setting is set appropriately for your document.

Convert multiple files

Run convert.py, like this:

python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
  • --workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
  • --max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
  • --metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. The format is:
  • --min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
{
  "pdf1.pdf": {"language": "English"},
  "pdf2.pdf": {"language": "Spanish"},
  ...
}

Convert multiple files on multiple GPUs

Run chunk_convert.sh, like this:

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
  • METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
  • NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
  • NUM_WORKERS is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK.
  • MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.

Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method Average Score Time per page Time per document
naive 0.350727 0.00152378 0.326524
marker 0.641062 0.360622 77.2762
nougat 0.629211 3.77259 808.413

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method switch_trans.pdf crowd.pdf multicolcnn.pdf thinkos.pdf thinkdsp.pdf thinkpython.pdf
naive 0.244114 0.140669 0.0868221 0.366856 0.412521 0.468281
marker 0.482091 0.466882 0.537062 0.754347 0.78825 0.779536
nougat 0.696458 0.552337 0.735099 0.655002 0.645704 0.650282

Peak GPU memory usage during the benchmark is 3.3GB for nougat, and 3.1GB for marker. Benchmarks were run on an A6000.

Throughput

Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip.

Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Commercial usage

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.

Here are the non-commercial/restrictive dependencies:

Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

  • Nougat from Meta
  • Layoutlmv3 from Microsoft
  • DocLayNet from IBM
  • ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### DetailsSimilarity score: 0.89 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)

Baseline Benchmark for 17 Coding Models

Discussion

I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.

Notes:

  • This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
  • I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
  • AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
  • Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
  • Each model was prompted according to the model card template. Here's an example for the codellama series -
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""

Results

I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.

Model Temp Correct / 164 Percentage
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.0 67 0.40853658536585363
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.1 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.2 68 0.4146341463414634
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.3 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.4 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.5 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.6 54 0.32926829268292684
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.7 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.8 60 0.36585365853658536
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.9 59 0.3597560975609756
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 1.0 65 0.39634146341463417

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#160: sid321axn/tinyllama-text2sql-finetuned at main

### DetailsSimilarity score: 0.89 ## tiny-llama-text2sql ## safetensors - [ ] [sid321axn/tinyllama-text2sql-finetuned at main](https://huggingface.co/sid321axn/tinyllama-text2sql-finetuned/tree/main)

adapter

https://huggingface.co/sid321axn/tiny-llama-text2sql

This model is a fine-tuned version of PY007/TinyLlama-1.1B-Chat-v0.3 on the None dataset.

{
 "_name_or_path": "PY007/TinyLlama-1.1B-Chat-v0.3",
 "architectures": [
   "LlamaForCausalLM"
 ],
 "attention_bias": false,
 "attention_dropout": 0.0,
 "bos_token_id": 1,
 "eos_token_id": 2,
 "hidden_act": "silu",
 "hidden_size": 2048,
 "initializer_range": 0.02,
 "intermediate_size": 5632,
 "max_position_embeddings": 2048,
 "model_type": "llama",
 "num_attention_heads": 32,
 "num_hidden_layers": 22,
 "num_key_value_heads": 4,
 "pretraining_tp": 1,
 "rms_norm_eps": 1e-05,
 "rope_scaling": null,
 "rope_theta": 10000.0,
 "tie_word_embeddings": false,
 "torch_dtype": "float16",
 "transformers_version": "4.37.0.dev0",
 "use_cache": false,
 "vocab_size": 32003
}
```</details>


### #499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.
<details><summary>### Details</summary>Similarity score: 0.89
- [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)

# CTransformers

[![PyPI version](https://badge.fury.io/py/ctransformers.svg)](https://badge.fury.io/py/ctransformers)
[![Documentation](https://readthedocs.org/images/button/readthedocs-ci.svg)](https://ctransformers.readthedocs.io/)
[![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see [ChatDocs](https://github.com/marella/chatdocs)

## Supported Models

| Model  | Model Type | CUDA | Metal |
| ------ | --------- | :--: | :--:  |
| GPT-2   | gpt2      |      |       |
| GPT-J, GPT4All-J | gptj      |      |       |
| GPT-NeoX, StableLM | gpt_neox  |      |       |
| Falcon   | falcon    |   ✅  |       |
| LLaMA, LLaMA 2 | llamai    |   ✅   |   ✅   |
| MPT      | mpt       |   ✅  |       |
| StarCoder, StarChat | gpt_bigcode |   ✅   |       |
| Dolly V2  | dolly-v2   |      |       |
| Replit   | replit    |      |       |

## Installation

To install via `pip`, simply run:

pip install ctransformers


## Usage

It provides a unified interface for all models:

```python
from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter Type Description Default
top_k int The top-k value to use for sampling 40
top_p float The top-p value to use for sampling 0.95
temperature float The temperature to use for sampling 0.8
repetition_penalty float The repetition penalty to use for sampling 1.1
last_n_tokens int The number of last tokens to use for repetition penalty 64
seed int The seed value to use for sampling tokens -1
max_new_tokens int The maximum number of new tokens to generate 256
stop List A list of sequences to stop generation when encountered None
stream bool Whether to stream the generated text False
reset bool Whether to reset the model state before generating text True
batch_size int The batch size to use for evaluating tokens in a single prompt 8
threads int The number of threads to use for evaluating tokens -1
context_length int The maximum context length to use -1
gpu_layers int The number of layers to run on GPU 0

Find the URL for the model card for GPTQ here.


Made with ❤️ by marella

Suggested labels

null

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### DetailsSimilarity score: 0.89 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

  • Open-Web (Common Crawl Foundation Terms of Use)
  • Books: the_pile_books3 license and pg19 license
  • ArXiv Terms of Use
  • Wikipedia License
  • StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.88 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

  • Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
  • Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
  • Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
  • Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at agi_code@deepseek.com.

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. finetuning Tools for finetuning of LLMs e.g. SFT or RLHF MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic Software2.0 Software development driven by AI and neural networks.
Projects
None yet
Development

No branches or pull requests

1 participant