Skip to content

Commit

Permalink
Merge pull request #32 from timbray/topfew-2.0
Browse files Browse the repository at this point in the history
kaizen: prepare for 2.0 release
  • Loading branch information
timbray authored Jun 4, 2024
2 parents d4f3f66 + f35f4e3 commit 9f6a6a4
Show file tree
Hide file tree
Showing 8 changed files with 48 additions and 46 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Topfew is hosted in this GitHub repository
at `github.com/timbray/topfew` and welcomes
contributions.

This is release 1.0 of Topfew, which is probably more
This is release 2.0 of Topfew, which is probably more
or less complete. It is well-tested. Its performance
at processing streams can keep up with most streams
and it is dramatically faster when processing files,
Expand Down
2 changes: 1 addition & 1 deletion INSTALLING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Each Topfew [release](https://github.com/timbray/topfew/releases) comes with binaries built for both the x86 and ARM
flavors of Linux, MacOS, and Windows.

Topfew comes with a Makefile which is uncomplicated. Typing `make` will create an executable named `tf`,
Topfew comes with a Makefile which is uncomplicated. Typing `make` will create an executable named `topfew`,
created by `go build` with no options, in the `./bin` directory.

## Arch Linux
Expand Down
18 changes: 9 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
.PHONY: test

all: test tf
all: test topfew

test: main.go internal/*.go
go test ./... && go vet ./...

# local version you can run
tf:
go build -o bin/tf
topfew:
go build -o bin/topfew

release: test
GOOS=darwin GOARCH=arm64 go build -o tf && gzip < tf > tf-macos-arm.gz
GOOS=darwin GOARCH=amd64 go build -o tf && gzip < tf > tf-macos-x86.gz
GOOS=linux GOARCH=amd64 go build -o tf && gzip < tf > tf-linux-x86.gz
GOOS=linux GOARCH=arm64 go build -o tf && gzip < tf > tf-linux-arm.gz
GOOS=windows GOARCH=amd64 go build -o tf && zip -mq tf-windows-x86.exe.zip tf
GOOS=windows GOARCH=arm64 go build -o tf && zip -mq tf-windows-arm.exe.zip tf
GOOS=darwin GOARCH=arm64 go build -o topfew && gzip < topfew>topfew-macos-arm.gz
GOOS=darwin GOARCH=amd64 go build -o topfew && gzip < topfew>topfew-macos-x86.gz
GOOS=linux GOARCH=amd64 go build -o topfew && gzip < topfew>topfew-linux-x86.gz
GOOS=linux GOARCH=arm64 go build -o topfew && gzip < topfew>topfew-linux-arm.gz
GOOS=windows GOARCH=amd64 go build -o topfew && zip -mq topfew-windows-x86.exe.zip topfew
GOOS=windows GOARCH=arm64 go build -o topfew && zip -mq topfew-windows-arm.exe.zip topfew
38 changes: 20 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,40 @@

A program that finds and prints out the top few records in which a certain field or combination of fields occurs most frequently.

This is release 1.0 of Topfew.
This is release 2.0 of Topfew.

## Examples

To find the IP address that most commonly hits your web site, given an Apache logfile named `access_log`.

`tf --fields 1 access_log`
`topfew --fields 1 access_log`

The same effect could be achieved with

`awk '{print $1}' access_log | sort | uniq -c | sort -rn | head`

But **tf** is usually much faster.
But **topfew** is usually much faster.

Do the same, but exclude high-traffic bots (omitting the filename).

`tf --fields 1 --vgrep googlebot --vgrep bingbot`
`topfew --fields 1 --vgrep googlebot --vgrep bingbot`

Most popular IP addresses from May 2020.

`tf --fields 1 -grep '\[../May/2020'`
`topfew --fields 1 -grep '\[../May/2020'`

Most popular hour/minute of the day for retrievals.

`tf --fields 4 --sed "\\[" "" --sed '^[^:]*:' '' --sed ':..$' ''`
`topfew --fields 4 --sed "\\[" "" --sed '^[^:]*:' '' --sed ':..$' ''`

## Usage

```shell
tf
topfew
-n, --number (output line count) [default is 10]
-f, --fields (field list) [default is the whole record]
-q, --quotedfields [respect "-delimited space-separated fields]
-p, --fieldseparator (regexp) [use provided regexp to separate fields]
-g, --grep (regexp) [may repeat, default is accept all]
-v, --vgrep (regexp) [may repeat, default is reject none]
-s, --sed (regexp) (replacement) [may repeat, default is no changes]
Expand All @@ -48,7 +50,7 @@ tf
-h, -help, --help
filename [default is stdin]
All the arguments are optional; if none are provided, tf will read records
All the arguments are optional; if none are provided, topfew will read records
from the standard input and list the 10 which occur most often.
```
## Options
Expand All @@ -63,7 +65,7 @@ Specifies which fields should be extracted from incoming records and used in com
The fieldlist must be a comma‐separated list of integers identifying field numbers, which start at one, for example 3 and 2,5,6.
The fields must be provided in order, so 3,1,7 is an error.
If no fieldlist is provided, **tf** treats the whole input record as a single field.
If no fieldlist is provided, **topfew** treats the whole input record as a single field.
`-p separator, --fieldseparator separator`
Expand All @@ -74,13 +76,13 @@ This is likely to incur a significant performance cost.
Some files, for example Apache httpd logs, use space-separation but also
allow spaces within fields which are delimited by `"`. The -q/--quotedfields
argument allows **tf** to process these correctly. It is an error to specify both
argument allows **topfew** to process these correctly. It is an error to specify both
-p and -q.
`-g regexp`, `--grep regexp`
The initial **g** suggests `grep`.
This option applies the provided regular expression to each record as it is read and if the regexp does not match the record, **tf** bypasses it.
This option applies the provided regular expression to each record as it is read and if the regexp does not match the record, **topfew** bypasses it.
This option can be provided multiple times; the provided regular expressions will be applied in the order they appear on the command line.
Expand All @@ -101,19 +103,19 @@ This option can be provided many times, and the replacement operations are perf
`--sample`
It can be tricky to get the regular expressions in the `−g`, `−v`, and `−s` options right.
Specifying `-−sample` causes **tf** to print lines to the standard output that display the filtering and field‐editing logic.
Specifying `-−sample` causes **topfew** to print lines to the standard output that display the filtering and field‐editing logic.
It can only be used when processing standard input, not a file.
`-w integer`, `--width integer`
If a file name is specified then **tf**, rather than reading it from end to end, will divide it into segments and process it in multiple parallel threads.
If a file name is specified then **topfew**, rather than reading it from end to end, will divide it into segments and process it in multiple parallel threads.
The optimal number of threads depends in a complicated way on how many cores your CPU has what kind of cores they are, and the storage architecture.
The default is the result of the Go `runtime.NumCPU()` calls and often produces good results.
`-h`, `-help`, `--help`
Describes the function and options of **tf**.
Describes the function and options of **topfew**.
## Records and fields
Expand Down Expand Up @@ -142,10 +144,10 @@ summarizing the request and its result, is delimited by quote characters `"`.
The fetch of `picInfo.xml` signals that this is an actual browser request, likely signifying that
a human was involved; the URL following the `o=` is the resource the human looked at. Here is a
**tf** invocation that yields a list of the top 5 URLs that were fetched by a human:
**topfew** invocation that yields a list of the top 5 URLs that were fetched by a human:
```shell
tf -g picInfo.xml -f 6 -q -s '\?utm.*' '' -s " HTTP/..." "" -s "GET .*\/ongoing" ""
topfew -g picInfo.xml -f 6 -q -s '\?utm.*' '' -s " HTTP/..." "" -s "GET .*\/ongoing" ""
```
Note the `-g` to select only lines with `picInfo.xml`, the `-q` to request correct processing
Expand All @@ -160,8 +162,8 @@ Therefore, the observed effects of combinations of options can vary dramatically
For example, if I want to list the top records containing the string `example` from a file named `big-file` I could do either of the following:
```shell
tf -g example big-file
grep example big-file | tf
topfew -g example big-file
grep example big-file |topfew
```
When I benchmark topfew on a modern Apple-Silicon Mac and an elderly spinning-rust Linux VPS, I observe that the first option is faster on Mac, the second on Linux.
Expand Down
16 changes: 8 additions & 8 deletions doc/tf.1
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ A program that finds and prints out the top few records in which a certain field
.PP
To find the IP address that most commonly hits your web site, given an Apache logfile named \fB\fCaccess_log\fR\&.
.PP
\fB\fCtf \-\-fields 1 access_log\fR
\fB\fCtopfew\-\-fields 1 access_log\fR
.PP
The same effect could be achieved with
.PP
Expand All @@ -15,20 +15,20 @@ But \fBtf\fP is usually much faster.
.PP
Do the same, but exclude high\-traffic bots (omitting the filename).
.PP
\fB\fCtf \-\-fields 1 \-\-vgrep googlebot \-\-vgrep bingbot\fR
\fB\fCtopfew\-\-fields 1 \-\-vgrep googlebot \-\-vgrep bingbot\fR
.PP
Most popular IP addresses from May 2020.
.PP
\fB\fCtf \-\-fields 1 \-grep '\\[../May/2020'\fR
\fB\fCtopfew\-\-fields 1 \-grep '\\[../May/2020'\fR
.PP
Most popular hour/minute of the day for retrievals.
.PP
\fB\fCtf \-\-fields 4 \-\-sed "\\\\[" "" \-\-sed '^[^:]*:' '' \-\-sed ':..$' ''\fR
\fB\fCtopfew\-\-fields 4 \-\-sed "\\\\[" "" \-\-sed '^[^:]*:' '' \-\-sed ':..$' ''\fR
.SH Usage
.PP
.RS
.nf
tf
topfew
\-n, \-\-number (output line count) [default is 10]
\-f, \-\-fields (field list) [default is the whole record]
\-g, \-\-grep (regexp) [may repeat, default is accept all]
Expand All @@ -39,7 +39,7 @@ tf
\-h, \-help, \-\-help
filename [default is stdin]

All the arguments are optional; if none are provided, tf will read records
All the arguments are optional; if none are provided, topfewwill read records
from the standard input and list the 10 which occur most often.
.fi
.RE
Expand Down Expand Up @@ -102,8 +102,8 @@ For example, if I want to list the top records containing the string \fB\fCexamp
.PP
.RS
.nf
tf \-g example big\-file
grep example big\-file | tf
topfew\-g example big\-file
grep example big\-file |topfew
.fi
.RE
.PP
Expand Down
12 changes: 6 additions & 6 deletions internal/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,14 +131,14 @@ func parseFields(spec string) ([]uint, error) {
}

const instructions = `
tf (short for "topfew") finds the most common values in a line-structured input
topfew finds the most common values in a line-structured input
and prints the top few of them out, with their occurrence counts, in decreasing
order of occurrences.
Usage: tf
Usage:topfew
-n, --number (output line count) [default is 10]
-f, --fields (field list) [default is the whole record]
-p, --fieldseparator (field separator regex) [default is white space]
-p, --fieldseparator (field separator regex) [default is white space]
-q, --quotedfields [default is false]
-g, --grep (regexp) [may repeat, default is accept all]
-v, --vgrep (regexp) [may repeat, default is reject none]
Expand All @@ -148,7 +148,7 @@ Usage: tf
-h, -help, --help
filename [default is stdin]
All the arguments are optional; if none are provided, tf will read records
All the arguments are optional; if none are provided, topfew will read records
from the standard input and list the 10 which occur most often.
Field list is comma-separated integers, e.g. -f 3 or --fields 1,3,7. The fields
Expand All @@ -160,7 +160,7 @@ performance.
Some files, for example Apache httpd logs, use space-separation but also
allow spaces within fields which are quoted with ("). The -q/--quotedfields
allows tf to process these correctly. It is an error to specify both
allows topfew to process these correctly. It is an error to specify both
-p and -q.
The regexp-valued fields work as follows:
Expand All @@ -171,7 +171,7 @@ The regexp-valued fields work as follows:
The regexp-valued fields can be supplied multiple times; the filtering
and substitution will be performed in the order supplied.
If the input is a named file, tf will process it in multiple parallel
If the input is a named file, topfew will process it in multiple parallel
threads, which can dramatically improve performance. The --width argument
allows you to specify the number of threads. The default value is not always
optimal; experience with particular data on a particular computer may lead
Expand Down
4 changes: 2 additions & 2 deletions internal/segmenter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ func TestReadSegmentFiltering(t *testing.T) {
t.Error("config!")
}

tmpName := fmt.Sprintf("/tmp/tf-%d", os.Getpid())
tmpName := fmt.Sprintf("/tmp/topfew-%d", os.Getpid())
tmpfile, err := os.Create(tmpName)
if err != nil {
t.Fatal("can't make tmpfile: " + err.Error())
Expand All @@ -90,7 +90,7 @@ func TestReadSegmentFiltering(t *testing.T) {
// ErrBufferFull condition, had to create lines 80k long to execute that, so rather than clutter
// up the filesystem with this junk, we create them synthetically
func TestVeryLongLines(t *testing.T) {
tmpName := fmt.Sprintf("/tmp/tf-%d", os.Getpid())
tmpName := fmt.Sprintf("/tmp/topfew-%d", os.Getpid())
tmpfile, err := os.Create(tmpName)
if err != nil {
t.Fatal("can't make tmpfile: " + err.Error())
Expand Down
2 changes: 1 addition & 1 deletion main.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ func main() {

config, err := topfew.Configure(os.Args[1:]) // skip whatever go puts in os.Args[0]
if err != nil {
fmt.Println("Problem (tf -h for help): " + err.Error())
fmt.Println("Problem (topfew -h for help): " + err.Error())
os.Exit(1)
}

Expand Down

0 comments on commit 9f6a6a4

Please sign in to comment.