Add Nemotron Terminal Corpus dataset by neubig · Pull Request #165 · neulab/agent-data-protocol

neubig · 2026-02-25T15:10:42Z

Description

This PR adds support for the NVIDIA Nemotron Terminal Corpus dataset, a large-scale Supervised Fine-Tuning (SFT) dataset designed to scale terminal interaction capabilities of Large Language Models (LLMs).

About the Dataset

Terminal-Corpus was developed by NVIDIA using the Terminal-Task-Gen pipeline, which combines dataset adaptation with synthetic task generation across diverse domains. The dataset contains approximately 366k high-quality execution trajectories for terminal agents.

Source: https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus

Files Added

README.md - Dataset documentation including description and citation
schema_raw.py - Pydantic schema for raw data validation
extract_raw.py - Script for downloading data from HuggingFace
raw_to_standardized.py - Script for converting raw data to standardized format
sample_raw.json - 3 example trajectories in raw format
sample_std.json - Standardized data format
sample_sft.json - OpenHands SFT format

Testing

All existing tests pass with the new dataset:

✅ test_dataset_structure.py
✅ test_raw_schemas.py
✅ test_standardized_schemas.py
✅ test_std_to_sft_conversion.py

Fixes #164

This PR adds support for the NVIDIA Nemotron Terminal Corpus dataset, a large-scale SFT dataset designed to scale terminal interaction capabilities of LLMs. Dataset includes: - README.md with dataset documentation and citation - schema_raw.py defining the raw data schema - extract_raw.py for downloading data from HuggingFace - raw_to_standardized.py for converting to standardized format - sample_raw.json with 3 example trajectories - sample_std.json with standardized data - sample_sft.json in OpenHands SFT format Closes #164 Co-authored-by: openhands <openhands@all-hands.dev>

This change adds proper support for chain-of-thought reasoning and think tags following conventions from Harbor ATIF and Agent Client Protocol: Schema changes: - Add reasoning_content field to base Action class - Update SCHEMA.md documentation to explain the new field Dataset updates: - nemotron_terminal_corpus: Use reasoning_content for <think> blocks - toucan_1_5m: Use reasoning_content for reasoning_content in raw data SFT conversion updates: - openhands/std_to_sft.py: Handle reasoning_content in CodeAction, ApiAction, MessageAction - sweagent/std_to_sft.py: Handle reasoning_content with helper function The reasoning_content field is separate from the description field: - reasoning_content: Extended chain-of-thought reasoning (<think> blocks) - description: Brief action description/summary Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

…ample files Changes: - agents/openhands/std_to_sft.py: Add _build_thought_text() helper that wraps reasoning_content in <think> tags while keeping description as plain text - agents/sweagent/std_to_sft.py: Same helper function for consistency - datasets/nemotron_terminal_corpus/sample_sft.json: Regenerated with proper function_call and observation labels (not converted to gpt/human) - datasets/nemotron_terminal_corpus/sample_sft/: Agent-specific SFT samples The <think> tags preserve the training signal so models learn to produce chain-of-thought reasoning in the expected format. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Feb 25, 2026

Nemotron Terminal Corpus Dataset #164

Open

openhands-agent added 3 commits February 25, 2026 15:39

Fix pre-commit issues: formatting, trailing whitespace, end-of-file

da7d0a8

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review February 26, 2026 19:37

neubig requested a review from yueqis February 26, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nemotron Terminal Corpus dataset#165

Add Nemotron Terminal Corpus dataset#165
neubig wants to merge 4 commits intomainfrom
openhands/add-nemotron-terminal-corpus-dataset

neubig commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neubig commented Feb 25, 2026

Description

About the Dataset

Files Added

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants