GitHub - datadreamer-dev/DataDreamer: DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤

Prompt. Generate Synthetic Data. Train & Align Models.

DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.

`demo.py`	Result of `demo.py`
Installation pip3 install datadreamer.dev
See the full demo script	See the synthetic dataset and the trained model
🚀 For more demonstrations and recipes see the Quick Tour page.

With DataDreamer you can:

💬 Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.
📊 Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.
⚙️ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.
... learn more about what's possible in the Overview Guide

DataDreamer is:

🧩 Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.
🔬 Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.
🏎️ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.
🔄 Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.
🤝 Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.
... learn more about the motivation and design principles behind DataDreamer.

Citation

Please cite the DataDreamer paper:

@misc{patel2024datadreamer,
      title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows}, 
      author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
      year={2024},
      eprint={2402.10379},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.

Thank you to the maintainers at Hugging Face and LiteLLM for accepting contributions necessary for DataDreamer and providing upstream support.

Funding Acknowledgements

_{ODNI, IARPA: This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
.cirun.yml		.cirun.yml
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citation

Contact

Funding Acknowledgements

About

Releases 33

Contributors 4

Languages

License

datadreamer-dev/DataDreamer

Folders and files

Latest commit

History

Repository files navigation

Citation

Contact

Funding Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 33

Contributors 4

Languages