Skip to content

Conversation

@podarok
Copy link

@podarok podarok commented Dec 28, 2025

Summary

Adds support to , enabling machine-readable JSON progress output similar to huggingface/tokenizers#1921.

Motivation

When using datasets in automated pipelines or UI applications, it's useful to emit machine-readable progress instead of ANSI progress bars. This PR adds the same progress_format option that was implemented in tokenizers.

Changes

New Functions

  • set_progress_format(format: str): Set global progress format
  • get_progress_format() -> str: Get current progress format

Supported Formats

  1. "tqdm" (default): Interactive progress bars
  2. "json": Machine-readable JSON lines to stderr
  3. "silent": No output

JSON Format

When progress_format="json", emits JSON every 5% progress change or completion:

{"stage":"Processing","current":50,"total":100,"percent":50.0}

Usage Example

from datasets import load_dataset
from datasets.utils import set_progress_format

# Enable JSON output
set_progress_format("json")

# Progress will now be emitted as JSON lines
dataset = load_dataset("Goader/kobza", split="train", streaming=True)
for sample in dataset:
    process(sample)

Implementation Details

  • Suppresses visual output using io.StringIO() when format is "json"
  • Keeps progress tracking active (unlike disable=True)
  • Emits JSON to stderr every 5% progress change
  • Exports new functions from datasets.utils

Cross-Reference

This implementation mirrors the approach from:

Testing

Tested with:

from datasets.utils import set_progress_format, tqdm

set_progress_format('json')
for i in tqdm(range(100), desc='Test'):
    process(i)
# Outputs: {"stage":"Test","current":10,"total":100,"percent":10.0}

Checklist

  • New functions added to datasets.utils.tqdm
  • Functions exported from datasets.utils.__init__
  • JSON format emits to stderr
  • Visual output suppressed when format="json"
  • Progress tracking remains active
  • Cross-referenced with tokenizers#1921

Similar to huggingface/tokenizers#1921, adds machine-readable JSON progress output.

- Add set_progress_format() and get_progress_format() functions
- Support 'tqdm' (default), 'json', and 'silent' formats
- Emit JSON progress every 5% when format='json'
- Export new functions from datasets.utils

Cross-reference: huggingface/tokenizers#1921
When disable=True, tqdm doesn't update internal state (self.n).
Use file=StringIO() to suppress visual output while keeping tracking active.
podarok added a commit to podarok/huggingface_hub that referenced this pull request Dec 30, 2025
Add set_progress_format() and get_progress_format() functions to control
progress output format:
- "tqdm" (default): Interactive progress bars
- "json": Machine-readable JSON lines to stderr
- "silent": No progress output

When format is "json", emits progress every 5% as:
{"stage":"Downloading file","current":1024,"total":4096,"percent":25.0}

Similar to huggingface/tokenizers#1921 and huggingface/datasets#7920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant