db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.
- Compare CSV, TSV, and JSON files for differences
- Human-readable and machine-readable (JSON) output
- JSON output to file with filename and path support
- Detects added, removed, and changed rows and columns
- Supports custom key columns for row identity
- Field inclusion/exclusion for focused diffs
- Streaming mode for very large files (memory efficient)
- Can output to terminal or JSON file
- Automatic delimiter and encoding detection
- Python 3.6+ compatible
Install the latest version directly from GitHub:
pip install git+https://github.com/datsom1/db-diff.git
To upgrade to the latest version:
pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git
Suppose you have two CSV files:
one.csv
Id,name,age
1,Cleo,4
2,Pancakes,2
two.csv
Id,name,age
1,Cleo,5
3,Bailey,1
Compare them using:
db-diff one.csv two.csv --key=Id
Sample output:
1 rows changed, 1 rows added, 1 rows removed
1 rows changed
Id: 1
age: "4" => "5"
1 rows added
Id: 3
name: Bailey
age: 1
1 rows removed
Id: 2
name: Pancakes
age: 2
db-diff
is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.
Basic usage:
db-diff [OPTIONS] PREVIOUS CURRENT
PREVIOUS
andCURRENT
are the file paths to the two files you want to compare.- The tool auto-detects file format by extension, or you can specify with
--format
.
Key features:
- Custom Key Column: Use
--key
to specify which column uniquely identifies rows. - Output Formats: Choose between human-readable (
readable
), JSON (json
), or save JSON to a file (jsonfile
). - Field Selection: Use
--fields
to compare only specific columns, or--ignorefields
to exclude columns. - Streaming Mode: For very large files, use
--streaming
(CSV/TSV only, files must be sorted by key). - Encoding: Specify file encoding with
--encoding
. - Show Unchanged: Use
--showunchanged
to display unchanged fields for changed rows. - List Fields: Use
--listfields
to print available columns and exit. - Timing: Use
--time
to display how long the diff operation took.
See all options:
db-diff --help
You can use db-diff
as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.
from db_diff import load_csv, load_json
# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
prev = load_csv(f, key="Id")
# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
curr = load_json(f, key="Id")
from db_diff import compare
# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)
show_unchanged
: If True, includes unchanged fields for changed rows.fields
: Pass a set of field names to only compare those fields.ignorefields
: Pass a set of field names to ignore during comparison.
from db_diff import streaming_compare_csv
diff = streaming_compare_csv(
"one.csv",
"two.csv",
key="Id",
compare_columns={"Id", "name", "age"},
encoding="utf-8",
dialect="excel"
)
The result of compare
or streaming_compare_csv
is a dictionary:
{
"added": [ ... ], # List of added rows (dicts)
"removed": [ ... ], # List of removed rows (dicts)
"changed": [ # List of changed rows
{
"key": "row_id",
"changes": {
"field1": ["old", "new"],
...
},
"unchanged": { ... } # (optional) if show_unchanged=True
},
...
],
"columns_added": [ ... ], # List of columns added
"columns_removed": [ ... ] # List of columns removed
}
from db_diff import human_text
print(human_text(diff, key="Id", current=curr))
from db_diff import load_csv, compare, human_text
with open("one.csv") as f1, open("two.csv") as f2:
prev = load_csv(f1, key="Id")
curr = load_csv(f2, key="Id")
diff = compare(prev, curr, show_unchanged=True)
print(human_text(diff, key="Id", current=curr))
See all available options with:
db-diff --help
A summary of key options:
Option | Description |
---|---|
--key TEXT |
Column to use as a unique ID for each row (default: first column header) |
--output TEXT |
Output format: readable , json , or jsonfile (default: readable) |
--outfilename FILE |
File to write JSON output to (used with --output=jsonfile ) |
--outfilepath DIR |
Directory to save the output file (used with --output=jsonfile ) |
--fields TEXT |
Comma-separated list of fields to compare (all others ignored) |
--ignorefields TEXT |
Comma-separated list of fields to ignore during comparison |
--showunchanged |
Show all fields for changed records, not just changed fields |
--time |
Measure and display elapsed time for the diff operation |
--format TEXT |
Explicitly specify input format: csv , tsv , or json (default: auto-detect) |
--encoding TEXT |
Input file encoding (default: utf-8) |
--streaming |
Use streaming mode for very large CSV/TSV files (requires files to be sorted by key) |
--listfields |
List available fields/columns in the input files and exit |
--version |
Show the version and exit |
-h, --help |
Show help message and exit |
Show unchanged fields for changed rows:
db-diff one.csv two.csv --key=Id --showunchanged
Output as JSON:
db-diff one.csv two.csv --key=Id --output=json
Save JSON output to a file:
db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json
Compare only specific fields:
db-diff one.csv two.csv --key=Id --fields=Id,name
Ignore specific fields:
db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate
Streaming mode for large files:
db-diff large1.csv large2.csv --key=Id --streaming
- For very large files, use
--streaming
mode (CSV/TSV only, files must be sorted by key). - Supports efficient memory usage and fast comparison for millions of rows.
Clone the repository:
git clone https://github.com/datsom1/db-diff.git
cd db-diff
Install dependencies:
pip install -e .
This project is licensed under the Apache License 2.0.
Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff