Datatrack

Installation · Usage · Contributing · Code of Conduct

Datatrack

Datatrack is a lightweight and open-source command-line tool designed to help data engineers and platform teams track database schema changes across versions. It ensures that schema updates are transparent and auditable, helping prevent silent failures in downstream pipelines.

Key Features

Capture schema snapshots from SQL-compatible databases (PostgreSQL, SQLite, MySQL, etc.)
Lint schemas for naming issues and structural smells
Verify schema compliance against custom rules
Compare schema versions and generate diffs
Export snapshots and diffs to JSON or YAML formats
Run the full schema audit pipeline with a single command

Why Use Datatrack

Managing schema changes in evolving environments is complex. Even a small change in column name, type, or structure can silently break dashboards or data pipelines. Datatrack helps prevent that by enabling:

Git-like version control for database schemas
Transparent collaboration and visibility within teams
Faster issue detection with automatic diffs and rule checks

Performance & Cost Savings

Datatrack’s parallel and batched snapshot engine delivers significant performance improvements for real-world databases. Benchmarks were run in August 2025 on a MacBook Pro M2, Python 3.11, using SQLite and PostgreSQL.

Database Size	Tables	Serial Time	Parallel Time	Speedup	Time Saved (per 1k runs)	Time Saved (per 50k runs)
Small	12	0.18 s	0.09 s	2×	90 s	75 min
Medium	75	0.95 s	0.32 s	3×	630 s (10.5 min)	8.75 hrs
Large	250	2.80 s	0.80 s	3.5×	2,000 s (33 min)	27 hrs

Key Takeaways

Snapshot time reduced by 65–75% for medium and large databases.
Scales linearly: higher workloads → greater savings.
Faster developer feedback: reduced CI/CD wait times, fewer timeouts.
Lower infrastructure costs: less CPU time means direct savings on cloud compute.

Datatrack Architecture

+-------------------+
|      User/CLI     |
+-------------------+
          |
          v
+-------------------+
|   Typer CLI App   |  (datatrack/cli.py)
+-------------------+
          |
          v
+-------------------+
|   Command Router  |  (CLI commands: snapshot, diff, lint, verify, export, pipeline)
+-------------------+
          |
          v
+-------------------+
|   Tracker Logic   |  (datatrack/tracker.py)
|-------------------|
| - Introspection   |
| - Caching         |
| - Parallel Fetch  |
| - Batched Fetch   |
+-------------------+
          |
          v
+-------------------+
|   SQLAlchemy ORM  |  (DB connection, inspection)
+-------------------+
          |
          v
+-------------------+
|   Database Layer  |  (PostgreSQL, SQLite, MySQL, etc.)
+-------------------+
          |
          v
+-------------------+
|   Export/History  |  (JSON/YAML, snapshot history)
+-------------------+
          |
          v
+-------------------+
|   CI/CD & Audits  |  (Integration, reporting)
+-------------------+

Pipeline Execution Flow (Mermaid Diagram)

flowchart TD
    A[User/CLI] --> B[Typer CLI App]
    B --> C[Pipeline Command (pipeline run)]
    C --> D1[SQLAlchemy DB Connection]
    D1 --> D2[Database (PostgreSQL, MySQL, SQLite, etc.)]
    D2 --> D3[Snapshot: Save latest schema\n(via Tracker Logic: parallel/cached introspection)]
    D3 --> D4[Linting: Check naming, types, ambiguity]
    D4 --> D5[Verify: Apply schema rules (snake_case, reserved words)]
    D5 --> D6[Diff: Compare with previous snapshot]
    D6 --> D7[Export: Save snapshot & diff as JSON]
    D7 --> D8[Export/History/Reporting]

Real-World Impact

For a team running 50,000 large snapshots/month, Datatrack saves ~27 hours of CPU time. At typical cloud compute rates, this translates into hundreds of dollars per year in savings. The bigger win, however, is developer productivity and reliability: faster pipelines, earlier error detection, and less risk of schema-related outages.

Documentation

Please refer to the following docs for detailed guidance:

License

This project is licensed under the MIT License. See the LICENSE file for details.

Maintainer

Developed and maintained by N R Navaneet.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
benchmark_tests		benchmark_tests
datatrack		datatrack
docs		docs
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pypiREADME.md		pypiREADME.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
schema_rules.yaml		schema_rules.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datatrack

Key Features

Why Use Datatrack

Performance & Cost Savings

Key Takeaways

Datatrack Architecture

Pipeline Execution Flow (Mermaid Diagram)

Real-World Impact

Documentation

License

Maintainer

About

Uh oh!

Releases

Packages

Languages

License

nrnavaneet/datatrack

Folders and files

Latest commit

History

Repository files navigation

Datatrack

Key Features

Why Use Datatrack

Performance & Cost Savings

Key Takeaways

Datatrack Architecture

Pipeline Execution Flow (Mermaid Diagram)

Real-World Impact

Documentation

License

Maintainer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages