You can build this project using maven:
./mvnw clean install -DskipTests
The build produces a shaded Jar that can be run using the hadoop
command:
hadoop jar parquet-cli-1.12.3-runtime.jar org.apache.parquet.cli.Main
For a shorter command-line invocation, add an alias to your shell like this:
alias parquet="hadoop jar /path/to/parquet-cli-1.12.3-runtime.jar org.apache.parquet.cli.Main --dollar-zero parquet"
To run from the target directory instead of using the hadoop
command, first copy the dependencies to a folder:
./mvnw dependency:copy-dependencies
Then, run the command-line and add target/dependencies/*
to the classpath:
java -cp 'target/parquet-cli-1.12.3.jar:target/dependency/*' org.apache.parquet.cli.Main
Note that you shouldn't include the runtime jar used above into the classpath in this case.
In that jar, the org.apache.avro package
is relocated for avoiding conflict with Hadoop's one.
That relocation changes method signatures, so it can cause NoSuchMethodError
depending on the class loading order.
See PARQUET-2142 for details.
The parquet
tool includes help for the included commands:
parquet help
Usage: parquet [options] [command] [command options]
Options:
-v, --verbose, --debug
Print extra debugging information
Commands:
help
Retrieves details on the functions of other commands
meta
Print a Parquet file's metadata
pages
Print page summaries for a Parquet file
dictionary
Print dictionaries for a Parquet column
check-stats
Check Parquet files for corrupt page and column stats (PARQUET-251)
schema
Print the Avro schema for a file
csv-schema
Build a schema from a CSV data sample
convert-csv
Create a file from CSV data
convert
Create a Parquet file from a data file
to-avro
Create an Avro file from a data file
cat
Print the first N records from a file
head
Print the first N records from a file
column-index
Prints the column and offset indexes of a Parquet file
column-size
Print the column sizes of a parquet file
prune
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
trans-compression
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
masking
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
footer
Print the Parquet file footer in json format
bloom-filter
Check bloom filters for a Parquet column
scan
Scan all records from a file
rewrite
Rewrite one or more Parquet files to a new Parquet file
Examples:
# print information for create
parquet help meta
See 'parquet help <command>' for more information on a specific command.