Skip to content

Add Nemotron Terminal Corpus dataset#165

Open
neubig wants to merge 4 commits intomainfrom
openhands/add-nemotron-terminal-corpus-dataset
Open

Add Nemotron Terminal Corpus dataset#165
neubig wants to merge 4 commits intomainfrom
openhands/add-nemotron-terminal-corpus-dataset

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 25, 2026

Description

This PR adds support for the NVIDIA Nemotron Terminal Corpus dataset, a large-scale Supervised Fine-Tuning (SFT) dataset designed to scale terminal interaction capabilities of Large Language Models (LLMs).

About the Dataset

Terminal-Corpus was developed by NVIDIA using the Terminal-Task-Gen pipeline, which combines dataset adaptation with synthetic task generation across diverse domains. The dataset contains approximately 366k high-quality execution trajectories for terminal agents.

Source: https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus

Files Added

  • README.md - Dataset documentation including description and citation
  • schema_raw.py - Pydantic schema for raw data validation
  • extract_raw.py - Script for downloading data from HuggingFace
  • raw_to_standardized.py - Script for converting raw data to standardized format
  • sample_raw.json - 3 example trajectories in raw format
  • sample_std.json - Standardized data format
  • sample_sft.json - OpenHands SFT format

Testing

All existing tests pass with the new dataset:

  • test_dataset_structure.py
  • test_raw_schemas.py
  • test_standardized_schemas.py
  • test_std_to_sft_conversion.py

Fixes #164

This PR adds support for the NVIDIA Nemotron Terminal Corpus dataset,
a large-scale SFT dataset designed to scale terminal interaction
capabilities of LLMs.

Dataset includes:
- README.md with dataset documentation and citation
- schema_raw.py defining the raw data schema
- extract_raw.py for downloading data from HuggingFace
- raw_to_standardized.py for converting to standardized format
- sample_raw.json with 3 example trajectories
- sample_std.json with standardized data
- sample_sft.json in OpenHands SFT format

Closes #164

Co-authored-by: openhands <openhands@all-hands.dev>
This change adds proper support for chain-of-thought reasoning and think tags
following conventions from Harbor ATIF and Agent Client Protocol:

Schema changes:
- Add reasoning_content field to base Action class
- Update SCHEMA.md documentation to explain the new field

Dataset updates:
- nemotron_terminal_corpus: Use reasoning_content for <think> blocks
- toucan_1_5m: Use reasoning_content for reasoning_content in raw data

SFT conversion updates:
- openhands/std_to_sft.py: Handle reasoning_content in CodeAction, ApiAction, MessageAction
- sweagent/std_to_sft.py: Handle reasoning_content with helper function

The reasoning_content field is separate from the description field:
- reasoning_content: Extended chain-of-thought reasoning (<think> blocks)
- description: Brief action description/summary

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
…ample files

Changes:
- agents/openhands/std_to_sft.py: Add _build_thought_text() helper that wraps
  reasoning_content in <think> tags while keeping description as plain text
- agents/sweagent/std_to_sft.py: Same helper function for consistency
- datasets/nemotron_terminal_corpus/sample_sft.json: Regenerated with proper
  function_call and observation labels (not converted to gpt/human)
- datasets/nemotron_terminal_corpus/sample_sft/: Agent-specific SFT samples

The <think> tags preserve the training signal so models learn to produce
chain-of-thought reasoning in the expected format.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review February 26, 2026 19:37
@neubig neubig requested a review from yueqis February 26, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nemotron Terminal Corpus Dataset

2 participants