tbl
is a cli tool for reading and editing parquet files
- be a swiss army knife for reading/editing parquet (kind of like
jq
is for JSON) - make it effortless to manage multi-file multi-schema parquet datasets
- use a cli-native version of polars syntax, so if you know python polars you already mostly know
tbl
- quickly look up schemas, row counts, and per-column storage usage
- migrate from one schema to another, like add/remove/rename a column
- perform these operations on multiple files in parallel
To discuss tbl
, check out the Paradigm Data Tools telegram group.
cargo install tbl-cli
git clone https://github.com/paradigmxyz/tbl
cd tbl
cargo install --path crates/tbl-cli
tbl
can list files and display their statistics, similar to the ls
cli command.
The command tbl ls
produces output:
blocks__00000000_to_00000999.parquet
blocks__00001000_to_00001999.parquet
blocks__00002000_to_00002999.parquet
blocks__00003000_to_00003999.parquet
blocks__00004000_to_00004999.parquet
blocks__00005000_to_00005999.parquet
blocks__00006000_to_00006999.parquet
blocks__00007000_to_00007999.parquet
blocks__00008000_to_00008999.parquet
blocks__00009000_to_00009999.parquet
... 19,660 files not shown
19,041,325 rows stored in 1.05 GB across 19,708 tabular files
See full list of tbl ls
options below.
tbl
can display the schemas of parquet files.
The command tbl schema
produces output:
1 unique schema, 19,041,325 rows, 19,708 files, 1.05 GB
column name │ dtype │ disk size │ full size │ disk %
──────────────────┼──────────┼─────────────┼─────────────┼────────
block_hash │ binary │ 649.97 MB │ 657.93 MB │ 63.78%
author │ binary │ 40.52 MB │ 40.59 MB │ 3.98%
block_number │ u32 │ 76.06 MB │ 75.75 MB │ 7.46%
gas_used │ u64 │ 84.23 MB │ 133.29 MB │ 8.26%
extra_data │ binary │ 46.66 MB │ 76.91 MB │ 4.58%
timestamp │ u32 │ 76.06 MB │ 75.75 MB │ 7.46%
base_fee_per_gas │ u64 │ 41.85 MB │ 49.58 MB │ 4.11%
chain_id │ u64 │ 3.74 MB │ 3.70 MB │ 0.37%
See full list of tbl schema
options below.
tbl
can operate on one file, or many files across multiple directories.
These input selection options can be used with each tbl
subcommand:
input selection | command |
---|---|
Select all tabular files in current directory | tbl (default behavior) |
Select a single file | tbl /path/to/file.parquet |
Select files using a glob | tbl *.parquet |
Select files from multiple directories | tbl /path/to/dir1 /path/to/dir2 |
Select files recursively | tbl /path/to/dir --tree |
tbl
can perform many different operations on the selected files:
operation | command |
---|---|
Rename a column | tbl --rename old_name=new_name |
Cast to a new type | tbl --cast col1=u64 col2=String |
Add new columns | tbl --with-columns name:String date:Date=2024-01-01 |
Drop columns | tbl --drop col1 col2 col3 |
Filter rows | tbl --filter col1=val1 tbl --filter col1!=val1 tbl --filter "col1>val1" tbl --filter "col1<val1" tbl --filter "col1>=val1" tbl --filter "col1<=val1" |
Sort rows | tbl --sort col1 col2:desc |
Select columns | tbl --select col1 col2 col3 |
See full list of transformation operations below.
tbl
can output its results in many different modes:
output mode | description | command |
---|---|---|
Single File | output all results to single file | tbl --output-file /path/to/file.parquet |
Inplace | modify each file inplace | tbl --inplace |
New Directory | create equivalent files in a new directory | tbl --output-dir /path/to/dir |
Interactive | load dataframe in interactive python session | tbl --df |
Stdout | output data to stdout | tbl (default behavior) |
See full list of output options below.
tbl is a tool for reading and editing tabular data files
Usage: tbl has two modes
1. Summary mode: tbl [ls | schema] [SUMMARY_OPTIONS]
2. Data mode: tbl [DATA_OPTIONS]
Get help with SUMMARY_OPTIONS using tbl [ls | schema] -h
Data mode is the default mode. DATA_OPTIONS are documented below
Optional Subcommands:
ls Display list of tabular files, similar to the cli `ls` command
schema Display table representation of each schema in the selected files
General Options:
-h, --help display help message
-V, --version display version
Input Options:
[PATHS]... input path(s) to use
-t, --tree recursively use all files in tree as inputs
Transform Options:
-c, --columns <COLUMN>... select only these columns [alias --select]
--drop <DROP>... drop column(s)
--with-columns <NEW_COL>... insert columns, syntax NAME:TYPE [alias --with]
--rename <RENAME>... rename column(s), syntax OLD_NAME=NEW_NAME
--cast <CAST>... change column type(s), syntax COLUMN=TYPE
--set <COLUMN>... set column values, syntax COLUMN=VALUE
--nullify <COLUMN>... set column values to null
--filter <FILTER>... filter rows by values, syntax COLUMN=VALUE
--sort <SORT>... sort rows, syntax COLUMN[:desc]
--head <HEAD> keep only the first n rows [alias --limit]
--tail <TAIL> keep only the last n rows
--offset <OFFSET> skip the first n rows of table
--value-counts <COLUMN> compute value counts of column(s)
Output Options:
--no-summary skip printing a summary
-n, --n <N> number of rows to print in stdout, all for all
--csv output data as csv
--json output data as json
--jsonl output data as json lines
--hex encode binary columns as hex for output
--inplace modify files in place
--output-file <FILE_PATH> write all data to a single new file
--output-dir <DIR_PATH> rewrite all files into this output directory
--output-prefix <PRE-FIX> prefix to add to output filenames
--output-postfix <POST-FIX> postfix to add to output filenames
--df load as DataFrame in interactive python session
--lf load as LazyFrame in interactive python session
--executable <EXECUTABLE> python executable to use with --df or --lf
--confirm confirm that files should be edited
--dry dry run without editing files
Output Modes:
1. output results in single file --output-file /path/to/file.parquet
2. modify each file inplace --inplace
3. copy files into a new dir --output-dir /path/to/dir
4. load as interactive python --df | --lf
5. output data to stdout (default behavior)
Display list of tabular files, similar to the cli `ls` command
Usage: tbl ls [OPTIONS] [PATHS]...
Arguments:
[PATHS]... input path(s) to use
Options:
-t, --tree recursively list all files in tree
--absolute show absolute paths instead of relative
--n <N> number of file names to print
--sort <SORT> sort by number of rows, files, or bytes [default: bytes]
General Options:
-h, --help display help message
Display table representation of each schema in the selected files
Usage: tbl schema [OPTIONS] [PATHS]...
Arguments:
[PATHS]... input path(s) to use
Options:
-t, --tree recursively list all files in tree
--columns <COLUMNS> columns to print
--n <N> number of schemas to print
--examples show examples
--absolute show absolute paths in examples
--sort <SORT> sort by number of rows, files, or bytes [default: bytes]
General Options:
-h, --help display help message
Parquet is a file format for storing tabular datasets. In many cases parquet is a simpler and faster alternative to using an actual database. Parquet has become an industry standard and its ecosystem of tools is growing rapidly.
The most common tools are duckdb
, pqrs
, and parquet-cli
.
duckdb
is an incredible tool. We recommend checking it out, especially when you're running complex workloads. However there are 3 reasons you might prefer tbl
as a cli tool:
- CLI-Native: Compared to
duckdb
's SQL,tbl
has a cli-native syntax. This makestbl
simpler to use with fewer keystrokes:duckdb "DESCRIBE read_parquet('test.parquet')"
vstbl schema test.parquet
duckdb "SELECT * FROM read_parquet('test.parquet')"
vstbl test.parquet
duckdb "SELECT * FROM read_parquet('test.parquet') ORDER BY co1"
vstbl test.parquet --sort col1
- High Level vs Low Level: Sometimes SQL can also be a very low-level language.
tbl
andpolars
let you operate on a higher level of abstraction which reduces cognitive load:duckdb
:duckdb "SELECT col1, COUNT(col1) FROM read_parquet('test.parquet') GROUP BY col1"
tbl
:tbl test.parquet --value-counts col1
- Operational QoL:
tbl
is built specifically for making it easy to manage large parquet archives. Features like--tree
,--inplace
, and multi-schema commands make life easier for archive management.
There are a few features that we are currently exploring:
- S3 and cloud buckets: ability to read and write cloud bucket parquet files using the same operations that can be performed on local files
- Re-partitioning: ability to change how a set of parquet files are partitioned, such as changing the partition key or partition size
- Direct python syntax: ability to directly use python polars syntax to perform complex operations like
group_by()
,join()
, and more - Idempotent Workflows: ability to interrupt and re-run commands arbitrarily would make migrations more robust