Skip to content

A modernized version of the 2019 Kaggle Steam Store Games dataset, built using current Steam APIs and multi-modal database architecture. Features complete catalog collection, vector embeddings for semantic search, and graph analysis of gaming industry relationships.

License

Notifications You must be signed in to change notification settings

vintagedon/steam-dataset-2025

Repository files navigation

Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

The first analytically-native Steam dataset employing multi-modal database architecture for advanced data science workflows. 239,664 applications with semantic search, graph analysis, and comprehensive metadata.

DOI Dataset License: CC BY 4.0 Python 3.9+ PostgreSQL 16 pgvector

Steam Dataset 2025 moves beyond traditional CSV exports to enable semantic search, graph analysis, and machine learning applications impossible with flat-file datasets. Built exclusively from official Steam Web APIs using systematic RAVGV methodology with complete transparency and reproducibility.


🎯 Quick Start

For Researchers

For Data Scientists

For Developers


🌟 What Makes This Different

Scale & Completeness

Metric Value Coverage
Total Applications 239,664 Complete accessible Steam catalog
User Reviews 1,048,148 Full review corpus with metadata
Developers 54,321 Complete developer ecosystem
Publishers 39,876 Full publisher network
Temporal Range 1997-2025 28 years of platform evolution
Success Rate 56% Transparent quality metrics

Unique Capabilities

βœ… API-Pure Methodology - Exclusively official Steam Web APIs, no scraping or estimates
βœ… Semantic Search - 1024-dimensional BGE-M3 embeddings for content-based discovery
βœ… Multi-Modal Database - PostgreSQL + JSONB + vector embeddings in unified architecture
βœ… Graph Analysis Ready - Publisher/developer relationships and collaboration networks
βœ… Academic Standards - Complete transparency, reproducibility, and peer-review documentation
βœ… Production Tested - 134,000+ successful retrievals with comprehensive validation


πŸ“Š Production Analytics Showcase

Visual insights from 239,664 Steam applications demonstrating analytical capabilities

Genre Co-occurrence Analysis

Genre Co-occurrence Patterns
Multi-dimensional genre relationships reveal market structure

Free-to-Play Market Analysis

Free-to-Play Market Dynamics
Genre-specific monetization strategy analysis

Pricing Strategy Analysis

Pricing Distribution by Genre
Market positioning and pricing tier analysis

Developer Portfolio Analysis

Developer Portfolio Strategies
Quality vs quantity trade-offs in game development

πŸ“Έ View Complete Gallery - 12 production-scale visualizations


πŸ—οΈ Architecture Overview

Steam Dataset 2025 employs a sophisticated multi-modal database architecture combining relational, document, and vector storage.

graph TD
    A[Steam Web API] --> B[ETL Pipeline<br/>239,664 Apps]
    B --> C[PostgreSQL 16<br/>Relational Tables]
    B --> D[JSONB Storage<br/>Semi-Structured Data]
    B --> E[pgvector<br/>BGE-M3 Embeddings]
    
    C --> F[Analytics Layer]
    D --> F
    E --> F
    
    F --> G[CSV/Parquet<br/>Exports]
    F --> H[Semantic Search<br/>Applications]
    F --> I[Graph Analysis<br/>Networks]
    
    style A fill:#1b2838,color:#fff
    style C fill:#336791,color:#fff
    style E fill:#00d084,color:#fff
    style F fill:#4ecdc4,color:#fff
Loading

Why Multi-Modal?

  • Relational Layer: Normalized entities with referential integrity for traditional SQL queries
  • Document Layer: JSONB preserves complete API responses without information loss
  • Vector Layer: Semantic embeddings enable content-based search and similarity analysis
  • Analytics Layer: Materialized columns and optimized indexes for sub-second query performance

πŸ“– Read Full Architecture Guide


πŸ“ Repository Structure

steam-dataset-2025/
β”œβ”€β”€ 🎨 assets/                      # Visualizations and sponsor materials
β”‚   β”œβ”€β”€ sponsors/                   # Sponsor logos and acknowledgments
β”‚   └── steam-fulldataset-dataset-plots-initial/  # Production charts
β”œβ”€β”€ πŸ’Ύ data/                        # Dataset access and samples
β”‚   β”œβ”€β”€ 01_raw/                     # Original API responses (5K sample)
β”‚   β”œβ”€β”€ 02_processed/               # Cleaned and validated data
β”‚   β”œβ”€β”€ 03_enriched/                # Vector embeddings and features
β”‚   └── 04_analytics/               # Final export packages
β”œβ”€β”€ πŸ“š docs/                        # Technical documentation
β”‚   β”œβ”€β”€ analytics/                  # Analysis reports and methodologies
β”‚   β”œβ”€β”€ methodologies/              # Data collection and processing guides
β”‚   └── citation.md                 # Citation guide
β”œβ”€β”€ πŸ—‚οΈ documentation-standards/    # Template and style guides
β”œβ”€β”€ πŸ’» scripts/                     # Complete ETL pipeline (12 phases)
β”‚   β”œβ”€β”€ 01-dataset-foundations/
β”‚   β”œβ”€β”€ 02-steam-data-sample/
β”‚   β”œβ”€β”€ 03-analyze-steam-data-sample/
β”‚   β”œβ”€β”€ 04-postgresql-schema-analysis/
β”‚   β”œβ”€β”€ 05-5000-steam-game-dataset-analysis/
β”‚   β”œβ”€β”€ 06-full-data-set-import/
β”‚   β”œβ”€β”€ 07-vector-embeddings/
β”‚   β”œβ”€β”€ 08-materialization-columns/
β”‚   β”œβ”€β”€ 09-pc-requirements-materialization/
β”‚   β”œβ”€β”€ 10-pc-requirements-validation/
β”‚   β”œβ”€β”€ 11-packaging-the-release/
β”‚   └── 12-notebook-generation/
β”œβ”€β”€ πŸ“¦ steam-dataset-2025-v1/      # Dataset release package
β”‚   β”œβ”€β”€ DATASET_CARD.md             # Academic documentation
β”‚   β”œβ”€β”€ DATA_DICTIONARY.md          # Complete field specifications
β”‚   β”œβ”€β”€ notebook-data/              # Pre-exported notebook datasets
β”‚   β”‚   β”œβ”€β”€ 01-platform-evolution/
β”‚   β”‚   β”œβ”€β”€ 02-semantic-game-discovery/
β”‚   β”‚   └── 03-the-semantic-fingerprint/
β”‚   └── notebooks/                  # Published Jupyter notebooks with PDFs
β”‚       β”œβ”€β”€ 01-steam-platform-evolution-and-marketplace/
β”‚       β”œβ”€β”€ 02-semantic-game-discovery/
β”‚       └── 03-the-semantic-fingerprint/
└── πŸ“ work-logs/                   # Complete development history
    β”œβ”€β”€ 01-dataset-foundations/
    β”œβ”€β”€ 02-steam-data-sample/
    β”œβ”€β”€ 03-analyze-steam-data-sample/
    β”œβ”€β”€ 04-postgresql-schema-analysis/
    β”œβ”€β”€ 05-5000-steam-game-dataset-analysis/
    β”œβ”€β”€ 06-full-data-set-import/
    β”œβ”€β”€ 07-vector-embeddings/
    β”œβ”€β”€ 08-materialization-columns/
    β”œβ”€β”€ 09-pc-requirements-materialization/
    └── 10-dataset-accessibility-packages/

Navigation Guide


πŸ”¬ Technology Stack

Data Collection & Processing

  • Python 3.9+ - Core ETL pipeline and processing infrastructure
  • Steam Web API - Official Valve endpoints with comprehensive rate limiting
  • Systematic Validation - Multi-stage quality assurance and integrity checking
  • Error Handling - Robust retry logic and comprehensive failure tracking

Database & Storage

  • PostgreSQL 16.10 - Production database with advanced features
  • pgvector 0.5.0 - Vector similarity search with HNSW indexes
  • JSONB - Flexible semi-structured data storage preserving API responses
  • Materialized Views - Performance optimization for common query patterns

Advanced Analytics

  • BGE-M3 Embeddings - 1024-dimensional semantic vectors for content analysis
  • Sentence Transformers - High-performance embedding generation
  • pandas & NumPy - Data manipulation and analytical processing
  • matplotlib & seaborn - Publication-quality visualization generation

πŸ““ Available Notebooks

Three production-ready Jupyter notebooks demonstrate dataset capabilities with complete documentation and PDF exports:

Research Questions:

  • How has Steam's catalog evolved across 28 years (1997-2025)?
  • Which genres drive platform growth and pricing strategy changes?
  • What patterns emerge in cross-platform support (Windows/Mac/Linux)?

πŸ““ View Notebook (.ipynb) | πŸ“„ Download PDF


Research Questions:

  • How do vector embeddings capture game concepts beyond keywords?
  • Can semantic search discover similar games across genre boundaries?
  • What representative games best exemplify each major genre?

πŸ““ View Notebook (.ipynb) | πŸ“„ Download PDF


Research Questions:

  • Can descriptions alone predict genres better than metadata?
  • How does class imbalance affect ML performance at scale?
  • Which genres are most predictable from text descriptions?

πŸ““ View Notebook (.ipynb) | πŸ“„ Download PDF


All notebooks include:

  • βœ… Pre-exported CSV/Parquet data files (no database required)
  • βœ… Kaggle-ready with automatic environment detection
  • βœ… Complete PDF exports for offline reading
  • βœ… Reproducible code with fixed random seeds
  • βœ… Publication-quality visualizations

πŸš€ Explore All Notebooks


🎯 Use Cases & Applications

Gaming Industry Research

  • Market Intelligence: Genre evolution, pricing strategies, platform dynamics
  • Publisher Analytics: Portfolio analysis, collaboration networks, market positioning
  • Trend Forecasting: Release patterns, hardware requirements, monetization shifts
  • Competitive Analysis: Market segmentation, success patterns, niche identification

Machine Learning Applications

  • Semantic Search: Content-based game discovery and recommendation systems
  • NLP Research: Sentiment analysis, review classification, description clustering
  • Graph Analysis: Developer networks, publisher relationships, ecosystem mapping
  • Predictive Modeling: Success prediction, genre classification, requirement estimation

Academic Research

  • Digital Economics: Platform economics, marketplace dynamics, pricing theory
  • Data Science Education: Multi-modal databases, vector search, ETL pipelines
  • HCI Studies: User engagement patterns, review behavior, interface effectiveness
  • Methodology Research: API-based collection, reproducibility frameworks, RAVGV methodology

πŸš€ Getting Started

Quick Start for Data Scientists

# 1. Clone repository
git clone https://github.com/vintagedon/steam-dataset-2025.git
cd steam-dataset-2025

# 2. Set up Python environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# 3. Download dataset from Zenodo or Kaggle
# See data/README.md for download instructions

# 4. Explore notebooks
jupyter notebook notebooks/

Quick Start for Researchers

  1. Download Dataset: Access via Zenodo or Kaggle
  2. Read Documentation: Start with Dataset Card
  3. Explore Examples: Review Jupyter notebooks with PDF exports
  4. Cite Properly: Follow Citation Guide for academic use

Quick Start for Developers

  1. Review Architecture: Read Multi-Modal DB Guide
  2. Setup Database: Follow PostgreSQL Schema documentation
  3. Explore Worklogs: Review Worklogs Documentation for ETL implementation
  4. Implement Features: Use Vector Embeddings guide

πŸ“– Complete Documentation

Dataset Documentation

Technical Documentation

Methodology Documentation

Development History


🌐 Dataset Access

Official Releases

Zenodo (Recommended for Research)

DOI

Complete Dataset Package:

  • Full PostgreSQL database dump (2.8GB compressed)
  • CSV export package (85MB core data)
  • Parquet files for big data workflows (45MB compressed)
  • Vector embeddings (520MB)
  • Complete documentation and notebooks

Advantages:

  • Permanent DOI for citation
  • Version controlled releases
  • Academic hosting
  • Long-term preservation

πŸ“₯ Download from Zenodo


Kaggle (Coming Soon)

Interactive Dataset with Community:

  • Pre-loaded notebook environment
  • Discussion forums and competitions
  • Version tracking and forking
  • Collaborative analysis

Advantages:

  • No download required
  • Instant Jupyter environment
  • Community engagement
  • GPU acceleration available

πŸ“Š Explore on Kaggle (Release Pending)


Sample Dataset (Immediate Access)

5K Game Sample - Available directly in repository:

  • 5,000 games with complete metadata
  • Representative sample across genres and eras
  • Immediate download (no external hosting)
  • Perfect for testing and development

πŸ“₯ Access Sample Data (102MB compressed)


🀝 Contributing

We welcome contributions from the data science and gaming research communities.

Ways to Contribute

  • πŸ“Š Analytics Development - New analytical frameworks and visualization approaches
  • πŸ“– Documentation Enhancement - Improved guides and methodological documentation
  • ⚑ Performance Optimization - Database and query performance improvements
  • πŸ”¬ Research Applications - Novel use cases and analytical discoveries

Contribution Process

  1. Fork the repository and create a feature branch
  2. Follow existing code style and documentation standards
  3. Include tests and documentation for new features
  4. Submit pull request with clear description
  5. Participate in code review process

πŸ“‹ Code of Conduct - Community guidelines and expectations


πŸ’Ό Industry Support & Sponsorship

Platinum Sponsor

MSP4 LLC - Leading managed IT services provider supporting innovative data science and research initiatives. MSP4's enterprise infrastructure expertise enables the robust technical foundation that makes large-scale projects like Steam Dataset 2025 possible.


Become a Sponsor

Support open data science research and gain visibility in the analytics community.

Sponsorship Benefits:

  • πŸ† Logo placement in project documentation and README
  • πŸ“Š Early access to analytical findings and market intelligence
  • 🀝 Technical collaboration opportunities
  • 🎯 Industry leadership association with data innovation

Contact for Sponsorship Opportunities


πŸ“œ Citation & Academic Use

BibTeX Citation

@dataset{fountain_2025_steam,
  author       = {Fountain, Donald},
  title        = {{Steam Dataset 2025: Multi-Modal Gaming 
                   Analytics Platform}},
  month        = jan,
  year         = 2025,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.17286923},
  url          = {https://doi.org/10.5281/zenodo.17286923}
}

APA Citation

Fountain, D. (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics 
Platform (Version 1.0.0) [Data set]. Zenodo. 
https://doi.org/10.5281/zenodo.17286923

πŸ“– Complete Citation Guide - Additional formats and attribution guidelines


βš–οΈ License & Legal

Dataset License

This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)

You are free to:

  • βœ… Share - Copy and redistribute in any medium or format
  • βœ… Adapt - Remix, transform, and build upon the material
  • βœ… Commercial Use - Use for any purpose, including commercial

Under these terms:

  • πŸ“ Attribution - Give appropriate credit and link to license
  • πŸ”“ No Additional Restrictions - No legal/technical measures limiting permitted uses

Code License

Repository code and scripts are licensed under MIT License

πŸ“„ View Full License


πŸ“ž Contact & Support

Project Maintainer

Donald Fountain (VintageDon)

Get Help

  • πŸ› Issues & Bugs: GitHub Issues
  • πŸ’¬ Discussions: GitHub Discussions
  • πŸ“§ General Inquiries: Contact via GitHub
  • πŸ’Ό Sponsorship: Partnership opportunities via GitHub

πŸŽ“ Academic Credentials

This dataset follows academic standards for transparency and reproducibility:

βœ… Complete Methodology Documentation - Every collection and processing decision documented
βœ… Reproducible Pipeline - All code and scripts provided for validation
βœ… Quality Assurance - Systematic validation with transparent success metrics
βœ… Limitation Disclosure - Known constraints and biases fully documented
βœ… Version Control - Complete development history preserved
βœ… Peer Review Ready - Structured documentation following Gebru et al. datasheet standards


πŸ† Project Milestones

Phase Milestone Status Documentation
Phase 1 API Foundation & Validation βœ… Complete Work Log 01
Phase 2 Sample Collection (179 apps) βœ… Complete Work Log 02
Phase 3 Schema Analysis & Design βœ… Complete Work Log 03
Phase 4 Database Pipeline (5K apps) βœ… Complete Work Log 04
Phase 5 Analytics Framework βœ… Complete Work Log 05
Phase 6 Full Dataset (239,664 apps) βœ… Complete Work Log 06
Phase 7 Vector Embeddings βœ… Complete Work Log 07
Phase 8 Performance Optimization βœ… Complete Work Log 08
Phase 9 Hardware Extraction βœ… Complete Work Log 09
Phase 10 Dataset Packaging βœ… Complete Work Log 10
Release Zenodo Publication βœ… Published DOI: 10.5281/zenodo.17286923
Release Kaggle Publication πŸ”„ Pending Coming Soon

🌟 Why Steam Dataset 2025?

Largest Public Steam Dataset

  • 239,664 applications vs typical datasets of 6K-11K
  • 1,048,148 reviews with complete metadata
  • 28 years of platform evolution (1997-2025)
  • 90.8% coverage of accessible Steam catalog

First Multi-Modal Architecture

  • PostgreSQL + JSONB + pgvector in unified database
  • Semantic search using BGE-M3 embeddings
  • Graph analysis ready with relationship networks
  • Performance optimized with materialized columns

Academic Standards

  • Complete transparency in methodology and limitations
  • Reproducible pipeline with all code provided
  • Quality metrics documented for all processing stages
  • Peer review ready following Gebru et al. datasheet standards

Production Tested

  • 134,000+ successful game metadata retrievals
  • Sub-second queries on 239K application database
  • Validated across multiple analytical use cases
  • Published notebooks demonstrating capabilities

Last Updated: January 6, 2025 | Project Status: Production Complete | Current Phase: Public Release

Dataset developed using systematic AI-human collaboration following RAVGV methodology


⭐ Star this repository if you find it useful for your research!

GitHub stars GitHub forks

About

A modernized version of the 2019 Kaggle Steam Store Games dataset, built using current Steam APIs and multi-modal database architecture. Features complete catalog collection, vector embeddings for semantic search, and graph analysis of gaming industry relationships.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published