Easily view and modify JSON and JSONL datasets for training large language models
- Easily view and modify JSON and JSONL datasets for training large language models
- Supports Alpaca (Instruct), ShareGPT, and Text formats (and more)
- Runs fully portable from your web browser, as a single file with zero other dependencies
- Browse through your training datasets, with easy search and filter functions to segment your data
- Supports searching and filtering with regex search or simple substrings search
- Filter multiple samples by contents, length, matches, and number of turns. Allows combining multiple queries for composite results.
- Includes an N-gram viewer to inspect selected examples for word frequency and repetition (word cloud)
- Allows splitting and merging datasets by selecting desired subsets with different criteria.
- Allows easy dataset deduplication
- Includes a simple inline editor to modify individual samples or correct typos.
- Pick individual samples or bulk-combine groups of them to curate your dataset, and save the results as a new JSON dataset
- Fast and efficient, comfortably handles small to medium sized datasets of up to 400 MB. For larger datasets, it's recommended to split them first.
- Fully open source, capable of running completely offline (just save the HTML file)
Free and open source. Try now at https://lostruins.github.io/datasetexplorer
- JSON > Parquet
- Alpaca > ChatML
- Kobo > !Kobo