Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ZSTD compression in-process #1360

Merged
merged 4 commits into from
Aug 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ and the `--csv` part will automatically be understood. If you do want to process

* You can include any command-line flags, except the "terminal" ones such as `--help`.

* The `--prepipe`, `--load`, and `--mload` flags aren't allowed in `.mlrrc` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a `./.mlrrc` in it. You can use `--prepipe-bz2`, `--prepipe-gunzip`, and `--prepipe-zcat` in `.mlrrc`, though.
* The `--prepipe`, `--load`, and `--mload` flags aren't allowed in `.mlrrc` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a `./.mlrrc` in it. You can use `--prepipe-bz2`, `--prepipe-gunzip`, `--prepipe-zcat`, and `--prepipe-zstdcat` in `.mlrrc`, though.

* The formatting rule is you need to put one flag beginning with `--` per line: for example, `--csv` on one line and `--nr-progress-mod 1000` on a separate line.

Expand Down
2 changes: 1 addition & 1 deletion docs/src/customization.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ and the `--csv` part will automatically be understood. If you do want to process

* You can include any command-line flags, except the "terminal" ones such as `--help`.

* The `--prepipe`, `--load`, and `--mload` flags aren't allowed in `.mlrrc` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a `./.mlrrc` in it. You can use `--prepipe-bz2`, `--prepipe-gunzip`, and `--prepipe-zcat` in `.mlrrc`, though.
* The `--prepipe`, `--load`, and `--mload` flags aren't allowed in `.mlrrc` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a `./.mlrrc` in it. You can use `--prepipe-bz2`, `--prepipe-gunzip`, `--prepipe-zcat`, and `--prepipe-zstdcat` in `.mlrrc`, though.

* The formatting rule is you need to put one flag beginning with `--` per line: for example, `--csv` on one line and `--nr-progress-mod 1000` on a separate line.

Expand Down
5 changes: 5 additions & 0 deletions docs/src/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -905,3 +905,8 @@ See also the [arrays page](reference-main-arrays.md), as well as the page on

A [data-compression format supported by Miller](reference-main-compressed-data.md).
Files compressed using ZLIB compression normally end in `.z`.

## ZSTD / .zst

A [data-compression format supported by Miller](reference-main-compressed-data.md).
Files compressed using ZSTD compression normally end in`.zst`.
5 changes: 5 additions & 0 deletions docs/src/glossary.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -889,3 +889,8 @@ See also the [arrays page](reference-main-arrays.md), as well as the page on

A [data-compression format supported by Miller](reference-main-compressed-data.md).
Files compressed using ZLIB compression normally end in `.z`.

## ZSTD / .zst

A [data-compression format supported by Miller](reference-main-compressed-data.md).
Files compressed using ZSTD compression normally end in`.zst`.
8 changes: 6 additions & 2 deletions docs/src/manpage.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ MILLER(1) MILLER(1)
Miller offers a few different ways to handle reading data files
which have been compressed.

* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin`
* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin``--zstdin`
* Decompression done outside the Miller process: `--prepipe` `--prepipex`

Using `--prepipe` and `--prepipex` you can specify an action to be
Expand All @@ -285,7 +285,7 @@ MILLER(1) MILLER(1)

Lastly, note that if `--prepipe` or `--prepipex` is specified, it replaces any
decisions that might have been made based on the file suffix. Likewise,
`--gzin`/`--bz2in`/`--zin` are ignored if `--prepipe` is also specified.
`--gzin`/`--bz2in`/`--zin``--zin` are ignored if `--prepipe` is also specified.

--bz2in Uncompress bzip2 within the Miller process. Done by
default if file ends in `.bz2`.
Expand All @@ -302,6 +302,8 @@ MILLER(1) MILLER(1)
`.mlrrc`.
--prepipe-zcat Same as `--prepipe zcat`, except this is allowed in
`.mlrrc`.
--prepipe-zstdcat Same as `--prepipe zstdcat`, except this is allowed
in `.mlrrc`.
--prepipex {decompression command}
Like `--prepipe` with one exception: doesn't insert
`<` between command and filename at runtime. Useful
Expand All @@ -310,6 +312,8 @@ MILLER(1) MILLER(1)
in `.mlrrc` to avoid unexpected code execution.
--zin Uncompress zlib within the Miller process. Done by
default if file ends in `.z`.
--zstdin Uncompress zstd within the Miller process. Done by
default if file ends in `.zstd`.

1mCSV/TSV-ONLY FLAGS0m
These are flags which are applicable to CSV format.
Expand Down
8 changes: 6 additions & 2 deletions docs/src/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ MILLER(1) MILLER(1)
Miller offers a few different ways to handle reading data files
which have been compressed.

* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin`
* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin``--zstdin`
* Decompression done outside the Miller process: `--prepipe` `--prepipex`

Using `--prepipe` and `--prepipex` you can specify an action to be
Expand All @@ -264,7 +264,7 @@ MILLER(1) MILLER(1)

Lastly, note that if `--prepipe` or `--prepipex` is specified, it replaces any
decisions that might have been made based on the file suffix. Likewise,
`--gzin`/`--bz2in`/`--zin` are ignored if `--prepipe` is also specified.
`--gzin`/`--bz2in`/`--zin``--zin` are ignored if `--prepipe` is also specified.

--bz2in Uncompress bzip2 within the Miller process. Done by
default if file ends in `.bz2`.
Expand All @@ -281,6 +281,8 @@ MILLER(1) MILLER(1)
`.mlrrc`.
--prepipe-zcat Same as `--prepipe zcat`, except this is allowed in
`.mlrrc`.
--prepipe-zstdcat Same as `--prepipe zstdcat`, except this is allowed
in `.mlrrc`.
--prepipex {decompression command}
Like `--prepipe` with one exception: doesn't insert
`<` between command and filename at runtime. Useful
Expand All @@ -289,6 +291,8 @@ MILLER(1) MILLER(1)
in `.mlrrc` to avoid unexpected code execution.
--zin Uncompress zlib within the Miller process. Done by
default if file ends in `.z`.
--zstdin Uncompress zstd within the Miller process. Done by
default if file ends in `.zstd`.

1mCSV/TSV-ONLY FLAGS0m
These are flags which are applicable to CSV format.
Expand Down
2 changes: 1 addition & 1 deletion docs/src/new-in-miller-6.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ the `TZ` environment variable. Please see [DSL datetime/timezone functions](refe

### In-process support for compressed input

In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z`, `.bz2`, and `.zst` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.

### Support for reading web URLs

Expand Down
2 changes: 1 addition & 1 deletion docs/src/new-in-miller-6.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ the `TZ` environment variable. Please see [DSL datetime/timezone functions](refe

### In-process support for compressed input

In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z`, `.bz2`, and `.zst` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.

### Support for reading web URLs

Expand Down
12 changes: 6 additions & 6 deletions docs/src/reference-main-compressed-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ Quick links:
</div>
# Compressed data

As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, ZLIB, and
ZSTD formats transparently, and in-process. And (as before Miller 6) you have a
more general `--prepipe` option to support other decompression programs.

## Automatic detection on input

If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
If your files end in `.gz`, `.bz2`, `.z`, or `.zst` then Miller will autodetect by file extension:

<pre class="pre-highlight-in-pair">
<b>file gz-example.csv.gz</b>
Expand Down Expand Up @@ -52,7 +52,7 @@ This will decompress the input data on the fly, while leaving the disk file unmo

## Manual detection on input

If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
If the filename doesn't in in `.gz`, `.bz2`, `-z`, or `.zst` then you can use the flags `--gzin`, `--bz2in`, `--zin`, or `--zstdin` to let Miller know:

<pre class="pre-highlight-non-pair">
<b>mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents</b>
Expand Down Expand Up @@ -94,7 +94,7 @@ If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`.

In your [.mlrrc file](customization.md), `--prepipe` and `--prepipex` are not
allowed as they could be used for unexpected code execution. You can use
`--prepipe-bz2`, `--prepipe-gunzip`, and `--prepipe-zcat` in `.mlrrc`, though.
`--prepipe-bz2`, `--prepipe-gunzip`, `--prepipe-zcat`, and `--prepipe-zstdcat` in `.mlrrc`, though.

Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. `mlr
Expand All @@ -107,7 +107,7 @@ There is a `--prepipe` and a `--prepipex`:

Lastly, note that if `--prepipe` or `--prepipex` is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin` are ignored if
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin`/`--zstdin` are ignored if
`--prepipe` or `--prepipex` is also specified.

## Compressed output
Expand Down
12 changes: 6 additions & 6 deletions docs/src/reference-main-compressed-data.md.in
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Compressed data

As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, ZLIB, and
ZSTD formats transparently, and in-process. And (as before Miller 6) you have a
more general `--prepipe` option to support other decompression programs.

## Automatic detection on input

If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
If your files end in `.gz`, `.bz2`, `.z`, or `.zst` then Miller will autodetect by file extension:

GENMD-CARDIFY-HIGHLIGHT-ONE
file gz-example.csv.gz
Expand All @@ -21,7 +21,7 @@ This will decompress the input data on the fly, while leaving the disk file unmo

## Manual detection on input

If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
If the filename doesn't in in `.gz`, `.bz2`, `-z`, or `.zst` then you can use the flags `--gzin`, `--bz2in`, `--zin`, or `--zstdin` to let Miller know:

GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents
Expand Down Expand Up @@ -50,7 +50,7 @@ If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`.

In your [.mlrrc file](customization.md), `--prepipe` and `--prepipex` are not
allowed as they could be used for unexpected code execution. You can use
`--prepipe-bz2`, `--prepipe-gunzip`, and `--prepipe-zcat` in `.mlrrc`, though.
`--prepipe-bz2`, `--prepipe-gunzip`, `--prepipe-zcat`, and `--prepipe-zstdcat` in `.mlrrc`, though.

Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. `mlr
Expand All @@ -63,7 +63,7 @@ There is a `--prepipe` and a `--prepipex`:

Lastly, note that if `--prepipe` or `--prepipex` is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin` are ignored if
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin`/`--zstdin` are ignored if
`--prepipe` or `--prepipex` is also specified.

## Compressed output
Expand Down
6 changes: 4 additions & 2 deletions docs/src/reference-main-flag-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Notes:
Miller offers a few different ways to handle reading data files
which have been compressed.

* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin`
* Decompression done within the Miller process itself: `--bz2in` `--gzin` `--zin``--zstdin`
* Decompression done outside the Miller process: `--prepipe` `--prepipex`

Using `--prepipe` and `--prepipex` you can specify an action to be
Expand All @@ -95,7 +95,7 @@ compression (or other) utilities, simply pipe the output:

Lastly, note that if `--prepipe` or `--prepipex` is specified, it replaces any
decisions that might have been made based on the file suffix. Likewise,
`--gzin`/`--bz2in`/`--zin` are ignored if `--prepipe` is also specified.
`--gzin`/`--bz2in`/`--zin``--zin` are ignored if `--prepipe` is also specified.


**Flags:**
Expand All @@ -106,8 +106,10 @@ decisions that might have been made based on the file suffix. Likewise,
* `--prepipe-bz2`: Same as `--prepipe bz2`, except this is allowed in `.mlrrc`.
* `--prepipe-gunzip`: Same as `--prepipe gunzip`, except this is allowed in `.mlrrc`.
* `--prepipe-zcat`: Same as `--prepipe zcat`, except this is allowed in `.mlrrc`.
* `--prepipe-zstdcat`: Same as `--prepipe zstdcat`, except this is allowed in `.mlrrc`.
* `--prepipex {decompression command}`: Like `--prepipe` with one exception: doesn't insert `<` between command and filename at runtime. Useful for some commands like `unzip -qc` which don't read standard input. Allowed at the command line, but not in `.mlrrc` to avoid unexpected code execution.
* `--zin`: Uncompress zlib within the Miller process. Done by default if file ends in `.z`.
* `--zstdin`: Uncompress zstd within the Miller process. Done by default if file ends in `.zstd`.

## CSV/TSV-only flags

Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/felixge/fgprof v0.9.3 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/klauspost/compress v1.16.7 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ github.com/johnkerl/lumin v1.0.0 h1:CV34cHZOJ92Y02RbQ0rd4gA0C06Qck9q8blOyaPoWpU=
github.com/johnkerl/lumin v1.0.0/go.mod h1:eLf5AdQOaLvzZ2zVy4REr/DSeEwG+CZreHwNLICqv9E=
github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51 h1:Z9n2FFNUXsshfwJMBgNA0RU6/i7WVaAegv3PtuIHPMs=
github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51/go.mod h1:CzGEWj7cYgsdH8dAjBGEr58BoE7ScuLd+fwFZ44+/x8=
github.com/klauspost/compress v1.16.7 h1:2mk3MPGNzKyxErAw8YaohYh69+pa4sIQSC0fPGCFR9I=
github.com/klauspost/compress v1.16.7/go.mod h1:ntbaceVETuRiXiv4DpjP66DpAtAGkEQskQzEyD//IeE=
github.com/lestrrat-go/envload v0.0.0-20180220234015-a3eb8ddeffcc h1:RKf14vYWi2ttpEmkA4aQ3j4u9dStX2t4M8UM6qqNsG8=
github.com/lestrrat-go/envload v0.0.0-20180220234015-a3eb8ddeffcc/go.mod h1:kopuH9ugFRkIXf3YoqHKyrJ9YfUFsckUU9S7B+XP+is=
github.com/lestrrat-go/strftime v1.0.6 h1:CFGsDEt1pOpFNU+TJB0nhz9jl+K0hZSLE205AhTIGQQ=
Expand Down
24 changes: 22 additions & 2 deletions internal/pkg/cli/option_parse.go
Original file line number Diff line number Diff line change
Expand Up @@ -2200,7 +2200,8 @@ func CompressedDataPrintInfo() {
fmt.Print(`Miller offers a few different ways to handle reading data files
which have been compressed.

* Decompression done within the Miller process itself: ` + "`--bz2in`" + ` ` + "`--gzin`" + ` ` + "`--zin`" + `
* Decompression done within the Miller process itself: ` + "`--bz2in`" + ` ` + "`--gzin`" + ` ` + "`--zin`" + "`--zstdin`" +
`
* Decompression done outside the Miller process: ` + "`--prepipe`" + ` ` + "`--prepipex`" + `

Using ` + "`--prepipe`" + ` and ` + "`--prepipex`" + ` you can specify an action to be
Expand All @@ -2223,7 +2224,7 @@ compression (or other) utilities, simply pipe the output:

Lastly, note that if ` + "`--prepipe`" + ` or ` + "`--prepipex`" + ` is specified, it replaces any
decisions that might have been made based on the file suffix. Likewise,
` + "`--gzin`" + `/` + "`--bz2in`" + `/` + "`--zin`" + ` are ignored if ` + "`--prepipe`" + ` is also specified.
` + "`--gzin`" + `/` + "`--bz2in`" + `/` + "`--zin`" + "`--zin`" + ` are ignored if ` + "`--prepipe`" + ` is also specified.
`)
}

Expand Down Expand Up @@ -2278,6 +2279,16 @@ var CompressedDataFlagSection = FlagSection{
},
},

{
name: "--prepipe-zstdcat",
help: "Same as `--prepipe zstdcat`, except this is allowed in `.mlrrc`.",
parser: func(args []string, argc int, pargi *int, options *TOptions) {
options.ReaderOptions.Prepipe = "zstdcat"
options.ReaderOptions.PrepipeIsRaw = false
*pargi += 1
},
},

{
name: "--prepipe-bz2",
help: "Same as `--prepipe bz2`, except this is allowed in `.mlrrc`.",
Expand Down Expand Up @@ -2314,6 +2325,15 @@ var CompressedDataFlagSection = FlagSection{
*pargi += 1
},
},

{
name: "--zstdin",
help: "Uncompress zstd within the Miller process. Done by default if file ends in `.zstd`.",
parser: func(args []string, argc int, pargi *int, options *TOptions) {
options.ReaderOptions.FileInputEncoding = lib.FileInputEncodingZstd
*pargi += 1
},
},
},
}

Expand Down
Loading