Concedo's Dataset Explorer

Easily view and modify JSON and JSONL datasets for training large language models

Features

Easily view and modify JSON and JSONL datasets for training large language models
Supports Alpaca (Instruct), ShareGPT, and Text formats (and more)
Runs fully portable from your web browser, as a single file with zero other dependencies
Browse through your training datasets, with easy search and filter functions to segment your data
Supports searching and filtering with regex search or simple substrings search
Filter multiple samples by contents, length, matches, and number of turns. Allows combining multiple queries for composite results.
Includes an N-gram viewer to inspect selected examples for word frequency and repetition (word cloud)
Allows splitting and merging datasets by selecting desired subsets with different criteria.
Allows easy dataset deduplication
Includes a simple inline editor to modify individual samples or correct typos.
Pick individual samples or bulk-combine groups of them to curate your dataset, and save the results as a new JSON dataset
Fast and efficient, comfortably handles small to medium sized datasets of up to 400 MB. For larger datasets, it's recommended to split them first.
Fully open source, capable of running completely offline (just save the HTML file)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
LICENSE		LICENSE
README.md		README.md
index.html		index.html