Skip to content

Latest commit

 

History

History
285 lines (217 loc) · 11.8 KB

README.md

File metadata and controls

285 lines (217 loc) · 11.8 KB

tbl ┳━┳

tbl is a cli tool for reading and editing parquet files

Goals of tbl:

  • be a swiss army knife for reading/editing parquet (kind of like jq is for JSON)
  • make it effortless to manage multi-file multi-schema parquet datasets
  • use a cli-native version of polars syntax, so if you know python polars you already mostly know tbl

Example use cases:

  • quickly look up schemas, row counts, and per-column storage usage
  • migrate from one schema to another, like add/remove/rename a column
  • perform these operations on multiple files in parallel

To discuss tbl, check out the Paradigm Data Tools telegram group.

Contents

  1. Installation
  2. Example Usage
    1. Listing files
    2. Looking up schemas
    3. Selecting input files
    4. Performing edits
    5. Selecting output mode
  3. API Reference
    1. tbl
    2. tbl ls
    3. tbl schema
  4. FAQ
    1. What is parquet?
    2. What other parquet cli tools exist?
    3. Why use tbl when duckdb has a cli?
    4. What is the plan for tbl?

Installation

Install from crates.io
cargo install tbl-cli
Install from source
git clone https://github.com/paradigmxyz/tbl
cd tbl
cargo install --path crates/tbl-cli

Example Usage

Listing files

tbl can list files and display their statistics, similar to the ls cli command.

The command tbl ls produces output:

blocks__00000000_to_00000999.parquet
blocks__00001000_to_00001999.parquet
blocks__00002000_to_00002999.parquet
blocks__00003000_to_00003999.parquet
blocks__00004000_to_00004999.parquet
blocks__00005000_to_00005999.parquet
blocks__00006000_to_00006999.parquet
blocks__00007000_to_00007999.parquet
blocks__00008000_to_00008999.parquet
blocks__00009000_to_00009999.parquet
... 19,660 files not shown
19,041,325 rows stored in 1.05 GB across 19,708 tabular files

See full list of tbl ls options below.

Looking up schemas

tbl can display the schemas of parquet files.

The command tbl schema produces output:

1 unique schema, 19,041,325 rows, 19,708 files, 1.05 GB

     column name  │   dtype  │  disk size  │  full size  │  disk %
──────────────────┼──────────┼─────────────┼─────────────┼────────
      block_hash  │  binary  │  649.97 MB  │  657.93 MB  │  63.78%
          author  │  binary  │   40.52 MB  │   40.59 MB  │   3.98%
    block_number  │     u32  │   76.06 MB  │   75.75 MB  │   7.46%
        gas_used  │     u64  │   84.23 MB  │  133.29 MB  │   8.26%
      extra_data  │  binary  │   46.66 MB  │   76.91 MB  │   4.58%
       timestamp  │     u32  │   76.06 MB  │   75.75 MB  │   7.46%
base_fee_per_gas  │     u64  │   41.85 MB  │   49.58 MB  │   4.11%
        chain_id  │     u64  │    3.74 MB  │    3.70 MB  │   0.37%

See full list of tbl schema options below.

Selecting input files

tbl can operate on one file, or many files across multiple directories.

These input selection options can be used with each tbl subcommand:

input selection command
Select all tabular files in current directory tbl (default behavior)
Select a single file tbl /path/to/file.parquet
Select files using a glob tbl *.parquet
Select files from multiple directories tbl /path/to/dir1 /path/to/dir2
Select files recursively tbl /path/to/dir --tree

Performing edits

tbl can perform many different operations on the selected files:

operation command
Rename a column tbl --rename old_name=new_name
Cast to a new type tbl --cast col1=u64 col2=String
Add new columns tbl --with-columns name:String date:Date=2024-01-01
Drop columns tbl --drop col1 col2 col3
Filter rows tbl --filter col1=val1
tbl --filter col1!=val1
tbl --filter "col1>val1"
tbl --filter "col1<val1"
tbl --filter "col1>=val1"
tbl --filter "col1<=val1"
Sort rows tbl --sort col1 col2:desc
Select columns tbl --select col1 col2 col3

See full list of transformation operations below.

Selecting output mode

tbl can output its results in many different modes:

output mode description command
Single File output all results to single file tbl --output-file /path/to/file.parquet
Inplace modify each file inplace tbl --inplace
New Directory create equivalent files in a new directory tbl --output-dir /path/to/dir
Interactive load dataframe in interactive python session tbl --df
Stdout output data to stdout tbl (default behavior)

See full list of output options below.

API Reference

tbl

Output of tbl -h:
tbl is a tool for reading and editing tabular data files

Usage: tbl has two modes
1. Summary mode: tbl [ls | schema] [SUMMARY_OPTIONS]
2. Data mode:    tbl [DATA_OPTIONS]

Get help with SUMMARY_OPTIONS using tbl [ls | schema] -h

Data mode is the default mode. DATA_OPTIONS are documented below

Optional Subcommands:
  ls      Display list of tabular files, similar to the cli `ls` command
  schema  Display table representation of each schema in the selected files

General Options:
  -h, --help                       display help message
  -V, --version                    display version

Input Options:
  [PATHS]...                       input path(s) to use
  -t, --tree                       recursively use all files in tree as inputs

Transform Options:
  -c, --columns <COLUMN>...        select only these columns [alias --select]
      --drop <DROP>...             drop column(s)
      --with-columns <NEW_COL>...  insert columns, syntax NAME:TYPE [alias --with]
      --rename <RENAME>...         rename column(s), syntax OLD_NAME=NEW_NAME
      --cast <CAST>...             change column type(s), syntax COLUMN=TYPE
      --set <COLUMN>...            set column values, syntax COLUMN=VALUE
      --nullify <COLUMN>...        set column values to null
      --filter <FILTER>...         filter rows by values, syntax COLUMN=VALUE
      --sort <SORT>...             sort rows, syntax COLUMN[:desc]
      --head <HEAD>                keep only the first n rows [alias --limit]
      --tail <TAIL>                keep only the last n rows
      --offset <OFFSET>            skip the first n rows of table
      --value-counts <COLUMN>      compute value counts of column(s)

Output Options:
      --no-summary                 skip printing a summary
  -n, --n <N>                      number of rows to print in stdout, all for all
      --csv                        output data as csv
      --json                       output data as json
      --jsonl                      output data as json lines
      --hex                        encode binary columns as hex for output
      --inplace                    modify files in place
      --output-file <FILE_PATH>    write all data to a single new file
      --output-dir <DIR_PATH>      rewrite all files into this output directory
      --output-prefix <PRE-FIX>    prefix to add to output filenames
      --output-postfix <POST-FIX>  postfix to add to output filenames
      --df                         load as DataFrame in interactive python session
      --lf                         load as LazyFrame in interactive python session
      --executable <EXECUTABLE>    python executable to use with --df or --lf
      --confirm                    confirm that files should be edited
      --dry                        dry run without editing files

Output Modes:
1. output results in single file   --output-file /path/to/file.parquet
2. modify each file inplace        --inplace
3. copy files into a new dir       --output-dir /path/to/dir
4. load as interactive python      --df | --lf
5. output data to stdout           (default behavior)

tbl ls

Output of tbl ls -h:
Display list of tabular files, similar to the cli `ls` command

Usage: tbl ls [OPTIONS] [PATHS]...

Arguments:
  [PATHS]...  input path(s) to use

Options:
  -t, --tree         recursively list all files in tree
      --absolute     show absolute paths instead of relative
      --n <N>        number of file names to print
      --sort <SORT>  sort by number of rows, files, or bytes [default: bytes]

General Options:
  -h, --help  display help message

tbl schema

Output of tbl schema -h:
Display table representation of each schema in the selected files

Usage: tbl schema [OPTIONS] [PATHS]...

Arguments:
  [PATHS]...  input path(s) to use

Options:
  -t, --tree               recursively list all files in tree
      --columns <COLUMNS>  columns to print
      --n <N>              number of schemas to print
      --examples           show examples
      --absolute           show absolute paths in examples
      --sort <SORT>        sort by number of rows, files, or bytes [default: bytes]

General Options:
  -h, --help  display help message

FAQ

What is parquet?

Parquet is a file format for storing tabular datasets. In many cases parquet is a simpler and faster alternative to using an actual database. Parquet has become an industry standard and its ecosystem of tools is growing rapidly.

What other parquet cli tools exist?

The most common tools are duckdb, pqrs, and parquet-cli.

Why use tbl when duckdb has a cli?

duckdb is an incredible tool. We recommend checking it out, especially when you're running complex workloads. However there are 3 reasons you might prefer tbl as a cli tool:

  1. CLI-Native: Compared to duckdb's SQL, tbl has a cli-native syntax. This makes tbl simpler to use with fewer keystrokes:
    1. duckdb "DESCRIBE read_parquet('test.parquet')" vs tbl schema test.parquet
    2. duckdb "SELECT * FROM read_parquet('test.parquet')" vs tbl test.parquet
    3. duckdb "SELECT * FROM read_parquet('test.parquet') ORDER BY co1" vs tbl test.parquet --sort col1
  2. High Level vs Low Level: Sometimes SQL can also be a very low-level language. tbl and polars let you operate on a higher level of abstraction which reduces cognitive load:
    1. duckdb: duckdb "SELECT col1, COUNT(col1) FROM read_parquet('test.parquet') GROUP BY col1"
    2. tbl: tbl test.parquet --value-counts col1
  3. Operational QoL: tbl is built specifically for making it easy to manage large parquet archives. Features like --tree, --inplace, and multi-schema commands make life easier for archive management.

What is the plan for tbl?

There are a few features that we are currently exploring:

  1. S3 and cloud buckets: ability to read and write cloud bucket parquet files using the same operations that can be performed on local files
  2. Re-partitioning: ability to change how a set of parquet files are partitioned, such as changing the partition key or partition size
  3. Direct python syntax: ability to directly use python polars syntax to perform complex operations like group_by(), join(), and more
  4. Idempotent Workflows: ability to interrupt and re-run commands arbitrarily would make migrations more robust