The first analytically-native Steam dataset employing multi-modal database architecture for advanced data science workflows. 239,664 applications with semantic search, graph analysis, and comprehensive metadata.
Steam Dataset 2025 moves beyond traditional CSV exports to enable semantic search, graph analysis, and machine learning applications impossible with flat-file datasets. Built exclusively from official Steam Web APIs using systematic RAVGV methodology with complete transparency and reproducibility.
- π₯ Download Dataset (Zenodo) - Complete dataset with DOI citation
- π Explore on Kaggle - Interactive notebooks and community discussions (Coming Soon)
- π Dataset Card - Complete methodology and academic documentation
- π Data Dictionary - Comprehensive field specifications
- π Jupyter Notebooks - 3 production-ready analysis examples with PDF exports
- π Sample Visualizations - Production analytics gallery
- π» Python Scripts - Complete ETL pipeline and processing code
- π Getting Started Guide - From download to first analysis
- ποΈ Database Schema - PostgreSQL 16 with pgvector implementation
- β‘ Vector Embeddings - BGE-M3 semantic search setup
- π§ Multi-Modal Architecture - Hybrid database design patterns
| Metric | Value | Coverage |
|---|---|---|
| Total Applications | 239,664 | Complete accessible Steam catalog |
| User Reviews | 1,048,148 | Full review corpus with metadata |
| Developers | 54,321 | Complete developer ecosystem |
| Publishers | 39,876 | Full publisher network |
| Temporal Range | 1997-2025 | 28 years of platform evolution |
| Success Rate | 56% | Transparent quality metrics |
β
API-Pure Methodology - Exclusively official Steam Web APIs, no scraping or estimates
β
Semantic Search - 1024-dimensional BGE-M3 embeddings for content-based discovery
β
Multi-Modal Database - PostgreSQL + JSONB + vector embeddings in unified architecture
β
Graph Analysis Ready - Publisher/developer relationships and collaboration networks
β
Academic Standards - Complete transparency, reproducibility, and peer-review documentation
β
Production Tested - 134,000+ successful retrievals with comprehensive validation
Visual insights from 239,664 Steam applications demonstrating analytical capabilities
πΈ View Complete Gallery - 12 production-scale visualizations
Steam Dataset 2025 employs a sophisticated multi-modal database architecture combining relational, document, and vector storage.
graph TD
A[Steam Web API] --> B[ETL Pipeline<br/>239,664 Apps]
B --> C[PostgreSQL 16<br/>Relational Tables]
B --> D[JSONB Storage<br/>Semi-Structured Data]
B --> E[pgvector<br/>BGE-M3 Embeddings]
C --> F[Analytics Layer]
D --> F
E --> F
F --> G[CSV/Parquet<br/>Exports]
F --> H[Semantic Search<br/>Applications]
F --> I[Graph Analysis<br/>Networks]
style A fill:#1b2838,color:#fff
style C fill:#336791,color:#fff
style E fill:#00d084,color:#fff
style F fill:#4ecdc4,color:#fff
- Relational Layer: Normalized entities with referential integrity for traditional SQL queries
- Document Layer: JSONB preserves complete API responses without information loss
- Vector Layer: Semantic embeddings enable content-based search and similarity analysis
- Analytics Layer: Materialized columns and optimized indexes for sub-second query performance
π Read Full Architecture Guide
steam-dataset-2025/
βββ π¨ assets/ # Visualizations and sponsor materials
β βββ sponsors/ # Sponsor logos and acknowledgments
β βββ steam-fulldataset-dataset-plots-initial/ # Production charts
βββ πΎ data/ # Dataset access and samples
β βββ 01_raw/ # Original API responses (5K sample)
β βββ 02_processed/ # Cleaned and validated data
β βββ 03_enriched/ # Vector embeddings and features
β βββ 04_analytics/ # Final export packages
βββ π docs/ # Technical documentation
β βββ analytics/ # Analysis reports and methodologies
β βββ methodologies/ # Data collection and processing guides
β βββ citation.md # Citation guide
βββ ποΈ documentation-standards/ # Template and style guides
βββ π» scripts/ # Complete ETL pipeline (12 phases)
β βββ 01-dataset-foundations/
β βββ 02-steam-data-sample/
β βββ 03-analyze-steam-data-sample/
β βββ 04-postgresql-schema-analysis/
β βββ 05-5000-steam-game-dataset-analysis/
β βββ 06-full-data-set-import/
β βββ 07-vector-embeddings/
β βββ 08-materialization-columns/
β βββ 09-pc-requirements-materialization/
β βββ 10-pc-requirements-validation/
β βββ 11-packaging-the-release/
β βββ 12-notebook-generation/
βββ π¦ steam-dataset-2025-v1/ # Dataset release package
β βββ DATASET_CARD.md # Academic documentation
β βββ DATA_DICTIONARY.md # Complete field specifications
β βββ notebook-data/ # Pre-exported notebook datasets
β β βββ 01-platform-evolution/
β β βββ 02-semantic-game-discovery/
β β βββ 03-the-semantic-fingerprint/
β βββ notebooks/ # Published Jupyter notebooks with PDFs
β βββ 01-steam-platform-evolution-and-marketplace/
β βββ 02-semantic-game-discovery/
β βββ 03-the-semantic-fingerprint/
βββ π work-logs/ # Complete development history
βββ 01-dataset-foundations/
βββ 02-steam-data-sample/
βββ 03-analyze-steam-data-sample/
βββ 04-postgresql-schema-analysis/
βββ 05-5000-steam-game-dataset-analysis/
βββ 06-full-data-set-import/
βββ 07-vector-embeddings/
βββ 08-materialization-columns/
βββ 09-pc-requirements-materialization/
βββ 10-dataset-accessibility-packages/
- π₯ Data Access - Download datasets and access documentation
- π Notebooks - Interactive analysis examples with full PDF exports
- π Documentation - Complete technical specifications
- π» Scripts - Full ETL pipeline and processing code
- π Work Logs - Development journey and methodology decisions
- π¦ Dataset Package - Official release with academic documentation
- Python 3.9+ - Core ETL pipeline and processing infrastructure
- Steam Web API - Official Valve endpoints with comprehensive rate limiting
- Systematic Validation - Multi-stage quality assurance and integrity checking
- Error Handling - Robust retry logic and comprehensive failure tracking
- PostgreSQL 16.10 - Production database with advanced features
- pgvector 0.5.0 - Vector similarity search with HNSW indexes
- JSONB - Flexible semi-structured data storage preserving API responses
- Materialized Views - Performance optimization for common query patterns
- BGE-M3 Embeddings - 1024-dimensional semantic vectors for content analysis
- Sentence Transformers - High-performance embedding generation
- pandas & NumPy - Data manipulation and analytical processing
- matplotlib & seaborn - Publication-quality visualization generation
Three production-ready Jupyter notebooks demonstrate dataset capabilities with complete documentation and PDF exports:
Research Questions:
- How has Steam's catalog evolved across 28 years (1997-2025)?
- Which genres drive platform growth and pricing strategy changes?
- What patterns emerge in cross-platform support (Windows/Mac/Linux)?
π View Notebook (.ipynb) | π Download PDF
Research Questions:
- How do vector embeddings capture game concepts beyond keywords?
- Can semantic search discover similar games across genre boundaries?
- What representative games best exemplify each major genre?
π View Notebook (.ipynb) | π Download PDF
Research Questions:
- Can descriptions alone predict genres better than metadata?
- How does class imbalance affect ML performance at scale?
- Which genres are most predictable from text descriptions?
π View Notebook (.ipynb) | π Download PDF
All notebooks include:
- β Pre-exported CSV/Parquet data files (no database required)
- β Kaggle-ready with automatic environment detection
- β Complete PDF exports for offline reading
- β Reproducible code with fixed random seeds
- β Publication-quality visualizations
- Market Intelligence: Genre evolution, pricing strategies, platform dynamics
- Publisher Analytics: Portfolio analysis, collaboration networks, market positioning
- Trend Forecasting: Release patterns, hardware requirements, monetization shifts
- Competitive Analysis: Market segmentation, success patterns, niche identification
- Semantic Search: Content-based game discovery and recommendation systems
- NLP Research: Sentiment analysis, review classification, description clustering
- Graph Analysis: Developer networks, publisher relationships, ecosystem mapping
- Predictive Modeling: Success prediction, genre classification, requirement estimation
- Digital Economics: Platform economics, marketplace dynamics, pricing theory
- Data Science Education: Multi-modal databases, vector search, ETL pipelines
- HCI Studies: User engagement patterns, review behavior, interface effectiveness
- Methodology Research: API-based collection, reproducibility frameworks, RAVGV methodology
# 1. Clone repository
git clone https://github.com/vintagedon/steam-dataset-2025.git
cd steam-dataset-2025
# 2. Set up Python environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# 3. Download dataset from Zenodo or Kaggle
# See data/README.md for download instructions
# 4. Explore notebooks
jupyter notebook notebooks/- Download Dataset: Access via Zenodo or Kaggle
- Read Documentation: Start with Dataset Card
- Explore Examples: Review Jupyter notebooks with PDF exports
- Cite Properly: Follow Citation Guide for academic use
- Review Architecture: Read Multi-Modal DB Guide
- Setup Database: Follow PostgreSQL Schema documentation
- Explore Worklogs: Review Worklogs Documentation for ETL implementation
- Implement Features: Use Vector Embeddings guide
- π¦ Dataset Card - Complete academic documentation
- π Data Dictionary - All 239,664 records field specifications
- π₯ Data Access Guide - Download and usage instructions
β οΈ Known Limitations - Transparent constraint documentation
- ποΈ PostgreSQL Schema - Complete database implementation
- β‘ Performance Guide - Query optimization strategies
- π§ Multi-Modal Architecture - Hybrid design patterns
- π§ Vector Embeddings - BGE-M3 implementation
- π¬ Steam API Collection - Data acquisition methodology
- β Data Validation - Quality assurance processes
- π€ AI-Human Collaboration - RAVGV framework
- π Analytics Studies - Analytical methodologies and findings
- π Work Logs - Complete 10-phase development journey
- π» Scripts Documentation - Full ETL pipeline implementation
- π Notebook Development - Analysis workflow documentation
Complete Dataset Package:
- Full PostgreSQL database dump (2.8GB compressed)
- CSV export package (85MB core data)
- Parquet files for big data workflows (45MB compressed)
- Vector embeddings (520MB)
- Complete documentation and notebooks
Advantages:
- Permanent DOI for citation
- Version controlled releases
- Academic hosting
- Long-term preservation
Interactive Dataset with Community:
- Pre-loaded notebook environment
- Discussion forums and competitions
- Version tracking and forking
- Collaborative analysis
Advantages:
- No download required
- Instant Jupyter environment
- Community engagement
- GPU acceleration available
π Explore on Kaggle (Release Pending)
5K Game Sample - Available directly in repository:
- 5,000 games with complete metadata
- Representative sample across genres and eras
- Immediate download (no external hosting)
- Perfect for testing and development
π₯ Access Sample Data (102MB compressed)
We welcome contributions from the data science and gaming research communities.
- π Analytics Development - New analytical frameworks and visualization approaches
- π Documentation Enhancement - Improved guides and methodological documentation
- β‘ Performance Optimization - Database and query performance improvements
- π¬ Research Applications - Novel use cases and analytical discoveries
- Fork the repository and create a feature branch
- Follow existing code style and documentation standards
- Include tests and documentation for new features
- Submit pull request with clear description
- Participate in code review process
π Code of Conduct - Community guidelines and expectations
MSP4 LLC - Leading managed IT services provider supporting innovative data science and research initiatives. MSP4's enterprise infrastructure expertise enables the robust technical foundation that makes large-scale projects like Steam Dataset 2025 possible.
Support open data science research and gain visibility in the analytics community.
Sponsorship Benefits:
- π Logo placement in project documentation and README
- π Early access to analytical findings and market intelligence
- π€ Technical collaboration opportunities
- π― Industry leadership association with data innovation
Contact for Sponsorship Opportunities
@dataset{fountain_2025_steam,
author = {Fountain, Donald},
title = {{Steam Dataset 2025: Multi-Modal Gaming
Analytics Platform}},
month = jan,
year = 2025,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.17286923},
url = {https://doi.org/10.5281/zenodo.17286923}
}Fountain, D. (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics
Platform (Version 1.0.0) [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.17286923
π Complete Citation Guide - Additional formats and attribution guidelines
This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)
You are free to:
- β Share - Copy and redistribute in any medium or format
- β Adapt - Remix, transform, and build upon the material
- β Commercial Use - Use for any purpose, including commercial
Under these terms:
- π Attribution - Give appropriate credit and link to license
- π No Additional Restrictions - No legal/technical measures limiting permitted uses
Repository code and scripts are licensed under MIT License
Donald Fountain (VintageDon)
- GitHub: @vintagedon
- ORCID: 0009-0008-7695-4093
- π Issues & Bugs: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ General Inquiries: Contact via GitHub
- πΌ Sponsorship: Partnership opportunities via GitHub
This dataset follows academic standards for transparency and reproducibility:
β
Complete Methodology Documentation - Every collection and processing decision documented
β
Reproducible Pipeline - All code and scripts provided for validation
β
Quality Assurance - Systematic validation with transparent success metrics
β
Limitation Disclosure - Known constraints and biases fully documented
β
Version Control - Complete development history preserved
β
Peer Review Ready - Structured documentation following Gebru et al. datasheet standards
| Phase | Milestone | Status | Documentation |
|---|---|---|---|
| Phase 1 | API Foundation & Validation | β Complete | Work Log 01 |
| Phase 2 | Sample Collection (179 apps) | β Complete | Work Log 02 |
| Phase 3 | Schema Analysis & Design | β Complete | Work Log 03 |
| Phase 4 | Database Pipeline (5K apps) | β Complete | Work Log 04 |
| Phase 5 | Analytics Framework | β Complete | Work Log 05 |
| Phase 6 | Full Dataset (239,664 apps) | β Complete | Work Log 06 |
| Phase 7 | Vector Embeddings | β Complete | Work Log 07 |
| Phase 8 | Performance Optimization | β Complete | Work Log 08 |
| Phase 9 | Hardware Extraction | β Complete | Work Log 09 |
| Phase 10 | Dataset Packaging | β Complete | Work Log 10 |
| Release | Zenodo Publication | β Published | DOI: 10.5281/zenodo.17286923 |
| Release | Kaggle Publication | π Pending | Coming Soon |
- 239,664 applications vs typical datasets of 6K-11K
- 1,048,148 reviews with complete metadata
- 28 years of platform evolution (1997-2025)
- 90.8% coverage of accessible Steam catalog
- PostgreSQL + JSONB + pgvector in unified database
- Semantic search using BGE-M3 embeddings
- Graph analysis ready with relationship networks
- Performance optimized with materialized columns
- Complete transparency in methodology and limitations
- Reproducible pipeline with all code provided
- Quality metrics documented for all processing stages
- Peer review ready following Gebru et al. datasheet standards
- 134,000+ successful game metadata retrievals
- Sub-second queries on 239K application database
- Validated across multiple analytical use cases
- Published notebooks demonstrating capabilities
Last Updated: January 6, 2025 | Project Status: Production Complete | Current Phase: Public Release
Dataset developed using systematic AI-human collaboration following RAVGV methodology



