Installation · Usage · Contributing · Code of Conduct
Datatrack is a lightweight and open-source command-line tool designed to help data engineers and platform teams track database schema changes across versions. It ensures that schema updates are transparent and auditable, helping prevent silent failures in downstream pipelines.
- Capture schema snapshots from SQL-compatible databases (PostgreSQL, SQLite, MySQL, etc.)
- Lint schemas for naming issues and structural smells
- Verify schema compliance against custom rules
- Compare schema versions and generate diffs
- Export snapshots and diffs to JSON or YAML formats
- Run the full schema audit pipeline with a single command
Managing schema changes in evolving environments is complex. Even a small change in column name, type, or structure can silently break dashboards or data pipelines. Datatrack helps prevent that by enabling:
- Git-like version control for database schemas
- Transparent collaboration and visibility within teams
- Faster issue detection with automatic diffs and rule checks
Datatrack’s parallel and batched snapshot engine delivers significant performance improvements for real-world databases. Benchmarks were run in August 2025 on a MacBook Pro M2, Python 3.11, using SQLite and PostgreSQL.
| Database Size | Tables | Serial Time | Parallel Time | Speedup | Time Saved (per 1k runs) | Time Saved (per 50k runs) |
|---|---|---|---|---|---|---|
| Small | 12 | 0.18 s | 0.09 s | 2× | 90 s | 75 min |
| Medium | 75 | 0.95 s | 0.32 s | 3× | 630 s (10.5 min) | 8.75 hrs |
| Large | 250 | 2.80 s | 0.80 s | 3.5× | 2,000 s (33 min) | 27 hrs |
- Snapshot time reduced by 65–75% for medium and large databases.
- Scales linearly: higher workloads → greater savings.
- Faster developer feedback: reduced CI/CD wait times, fewer timeouts.
- Lower infrastructure costs: less CPU time means direct savings on cloud compute.
+-------------------+
| User/CLI |
+-------------------+
|
v
+-------------------+
| Typer CLI App | (datatrack/cli.py)
+-------------------+
|
v
+-------------------+
| Command Router | (CLI commands: snapshot, diff, lint, verify, export, pipeline)
+-------------------+
|
v
+-------------------+
| Tracker Logic | (datatrack/tracker.py)
|-------------------|
| - Introspection |
| - Caching |
| - Parallel Fetch |
| - Batched Fetch |
+-------------------+
|
v
+-------------------+
| SQLAlchemy ORM | (DB connection, inspection)
+-------------------+
|
v
+-------------------+
| Database Layer | (PostgreSQL, SQLite, MySQL, etc.)
+-------------------+
|
v
+-------------------+
| Export/History | (JSON/YAML, snapshot history)
+-------------------+
|
v
+-------------------+
| CI/CD & Audits | (Integration, reporting)
+-------------------+
flowchart TD
A[User/CLI] --> B[Typer CLI App]
B --> C[Pipeline Command (pipeline run)]
C --> D1[SQLAlchemy DB Connection]
D1 --> D2[Database (PostgreSQL, MySQL, SQLite, etc.)]
D2 --> D3[Snapshot: Save latest schema\n(via Tracker Logic: parallel/cached introspection)]
D3 --> D4[Linting: Check naming, types, ambiguity]
D4 --> D5[Verify: Apply schema rules (snake_case, reserved words)]
D5 --> D6[Diff: Compare with previous snapshot]
D6 --> D7[Export: Save snapshot & diff as JSON]
D7 --> D8[Export/History/Reporting]
For a team running 50,000 large snapshots/month, Datatrack saves ~27 hours of CPU time. At typical cloud compute rates, this translates into hundreds of dollars per year in savings. The bigger win, however, is developer productivity and reliability: faster pipelines, earlier error detection, and less risk of schema-related outages.
Please refer to the following docs for detailed guidance:
This project is licensed under the MIT License. See the LICENSE file for details.
Developed and maintained by N R Navaneet.