Welcome! This is a "rich" tutorial that covers all of the basics of this cantankerous, odd, and ultimately, incredibly fast, joyful, and productive tool called ni
. We have tried to assume as little knowledge as possible in this tutorial, but if you find anything confusing, please contact the developers or the author.
ni
and Perl both suffer from their sharp differences from other tools. This tutorial is structured in 6 parts:
- Intro to
ni
- Perl for
ni
- The real power of
ni
ni
and Ruby, Perl, Python, Lisp & Bashni
odds and endsni
+ Jupyter (TODO)
ni
is write-anywhere, run-everywhere
ni
is a self-modifying quine.
ni
is everything-from-the-command-line.
ni
is concise.
ni
is beautiful.
ni
is two-fisted data science.
ni
works on any Unix-based OS. You should use a bash prompt when calling ni
.
curl -sSL https://spencertipping.com/install-ni | bash
# Alternatively, to install from source:
git clone git@github.com:spencertipping/ni.git
cd ni
ln -s $PWD/ni ~/bin/ni # or whatever to add it to your path
ni
is a stream-processing language. Most operations in ni
are done over a single line, which enables ni
to be fast and memory-efficient.
ni n
generates a stream of consecutive integers starting at 1. The number after determines how many numbers will be generated.
$ ni n5
1
2
3
4
5
In general, ni
will drop you into a less
pager after a command finishes. You can change the default pager by setting the NI_PAGER
environment variable.
ni n0
gives you consecutive integers starting from zero. For example:
$ ni n03
0
1
2
To generate a large number of integers, you can use scientific notation with n
. ni n3.2E5
will give you 320,000 consecutive integers, starting from 1.
The i
operator puts literal text into the stream:
$ ni ihello
hello
You can use quotes with i
to include spaces within strings.
$ ni i"hello there"
hello there
ni
is optimized to work with tab-delimited text. If you want your text to be tab-delimited, put your text inside square brackets.
$ ni i[hello there new friend]
hello there new friend
ni
is deeply connected to bash, so easy access is provided to running bash commands from within ni
.
$ ni n500 e'grep 22'
22
122
220
221
222
223
224
225
226
227
228
229
322
422
ni
commands are composed of operators. Examples introduced in the last section include n
, which generates a stream of integers; e
, which executes a bash command, and i
, which puts literal text to the stream. z
, which compresses a stream. Later, we'll introduce more complex operators like HS
, whichexecutes a Hadoop Streaming MapReduce job.
In the last section, you saw a ni
command linking two operators; n500
was used to generate a stream of integers from 1 to 500, and the e'grep 22'
was used to take the lines that had a 22
in them. If you're not used to working with streams, there's a slightly subtle point to notice.
In general commands are written ni <op_1> <op_2> ... <op_n>
. It is often helpful to think of each command by piping the output of one command to the input of the next ni <op_1> | ni <op_2> | ... | ni <op_n>
.
More compressed syntax is favored, and very short commands are often compressed without spaces in between. A common example is sort (g
) + unique (u
); this is commonly written as gu
in rather than the more explicit g u
. Because ni
commands are highly compressed and difficult to be read by people who are uninitiated, they are sometimes referred to as "spells."
ni
provides very flexible file input and output. In particular, compressing streams in ni
is very simple, and decompression is done by default.
ni
is a streaming tool, so you can redirect output using the standard redirect.
$ ni n5 > five.txt
While this works fine, it is in general not used, in favor of the literal angle-bracket operator \>
described in the next section.
To add a file to the stream in ni
, add the name of the file to the stream.
$ ni five.txt
1
2
3
4
5
$ ni n5 \>five.txt
five.txt
Note that there is no space between \>
and ten.txt
. Because ni
is concise, whitespace is frequently important.
The reason that \>
is favored over >
is because \<
inverts it by reading the data from filenames.
$ ni n5 \>five.txt \<
1
2
3
4
5
It's important to understand how this short spell works; first, a stream of 5 integers is generated; those integers sink to a file called five.txt
, and the text string five.txt
is put out to the stream. Finally, \<
instructs ni
to open the file named five.txt
and put its contents on the stream.
The contents of this section
Start by making some data:
$ rm -rf dir_test && mkdir dir_test
$ echo "hello" > dir_test/file1
$ echo "you" > dir_test/file2
$ echo "genius" > dir_test/file3
We can look at the contents of the directory with ni
:
$ ni dir_test
dir_test/file1
dir_test/file2
ni
has converted the folder into a stream of the names of files (and directories) inside it. You can thus access the files inside a directory using \<
.
$ ni dir_test \<
hello
you
genius
ni
will use bash expansion as well:
$ ni dir_test/*
hello
you
genius
$ ni dir_test/file{1,3}
hello
genius
ni
provides compression in a highly keystroke-efficient way, using the z
operator.
$ ni n10 z \>ten.gz
ten.gz
ni
decompresses its input by default.
cat ten.gz
2�SY3�2�2�2�2�2�2���24�뿊
$ cat ten.gz | ni
1
2
3
4
5
6
7
8
9
10
$ ni n10 z \>ten.gz \<
1
2
3
4
5
6
7
8
9
10
The default compression with z
is gzip
, however there are options for bzip (zb
), xzip (zx
), lzo (zo
), and lz4 (z4
). You can also specify a numeric flag to gzip by using a number other than 4
, for example z9
executes gzip -9
on the stream.
ni
works especially well on data formatted as tab-delimited rows of text; in this section we'll show how to filter, trim, and convert raw text into tab-delimited form.
r
is a powerful and flexible operation for filtering rows; it encompasses the functionality of Unix operators head
and tail
, as well as a large number of filtering operations.
rN
takes the first N rows.
$ ni n10 r3
1
2
3
r-N
drops the first N rows.
$ ni n10 r-3
4
5
6
7
8
9
10
r~N
and r+N
take the last N rows.
$ ni n10 r~3
8
9
10
$ ni n10 r+3
8
9
10
The rxN
option takes the first of every N
rows.
$ ni n10 rx3
1
4
7
10
Adding a number between 0 and 1 will lead to ni
selecting a (deterministic) pseudo-random sample of the stream data.
$ ni n20 r.15
1
3
5
14
These last examples show the value of r
in development; for example, if you are working with a large file or stream, you can check the correctness of your output using r10
, rx100
, r.001
etc. to view smaller samples of large datasets.
Like r
, F
also has many options. The most important is F/regex/
, which splits text into columns based on a regular expression.
$ ni ibubbles ibaubles ibarbaras F/[aeiou]+/
b bbl s
b bl s
b rb r s
The rest of the operations are syntactic sugar, but they're worth knowing.
FD
splits on forward slashes:
$ ni i~/bin/dependency/nightmare.jar FD
~ bin dependency nightmare.jar
FS
splits on runs of whitespace:
$ ni i"here is an example" FS
here is an example
FC
splits on commas, but doesn't handle CSV:
$ ni ibread,eggs,milk i'fruit gushers,index cards' FC
bread eggs milk
fruit gushers index cards
FV
splits CSV fields correctly (i.e. it doesn't split commas in quoted strings). However, ni
splits lines of input on newline characters, so this can't handle newlines in quoted fields.
$ ni i'"hello,there",one,two,three' FV
hello,there one two three
FW
splits on non-word characters (i.e. equivalent to splitting on the regex /\W+/
)
$ ni i'this@#$$gets&(*&^split' FW
this gets split
FP
splits on pipe characters:
$ ni i'need|quotes|around|pipes|because|of|bash' FP
need quotes around pipes because of bash
F:<char>
splits data on a particular character:
$ ni ibubbles ibaubles ibarbaras F:a
bubbles
b ubles
b rb r s
Let's start by generating some text and converting it to columns using F
.
$ ni i"this is how we do it" i"it's friday night" i"and I feel all right" FS
this is how we do it
it's friday night
and I feel all right
In ni
, tab-delimited columns of data are referenced like a spreadsheet: the first column is A
, the second is B
, etc.
The f
operator gives you access to the columns of your data.
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fC
how
night
feel
You can select multiple columns by providing multiple letters:
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fAB
this is
it's friday
and I
You can duplicate a column by using its corresponding letter multiple times:
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fAAC
this this how
it's it's night
and and feel
To select all columns after a particular column, use f<col>.
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fAD.
this we do it
it's
and all right
To select all data between two columns, inclusive, use a dash. ni defaults to
using cut
to process column selections when possible, which normally produces
identical output to its Perl-hosted colum selector; the only difference is on
most platforms when the input has fewer columns than you're selecting. In that
case cut
will fail to append blanks for the columns you selected, resulting in
a ragged right edge.
You can force ni to disallow cut
for a specific operator by prepending a ^{}
configuration block with the col/disallow-cut
option. This option is enabled
here because the code below is run as a cross-platform unit test.
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS ^{col/disallow-cut=1} fB-E
is how we do
friday night
I feel all right
You can also use f
to re-order selected columns:
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fCBAD
how is this we
night friday it's
feel I and all
Columns can also be accessed by using #<number>
. The same options are available by subsitituting A => #0
, B => #1
, C => #2
, ... Z => #25
. This syntax also allows you to access columns 26 and beyond.
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS f#2#1#0#3
how is this we
night friday it's
feel I and all
It is also possible to increase readability by inserting commas between specified columns.
$ ni i"this is how we do it" i"it's friday night" \
i"and I feel all right" FS fA,#3.
this we do it
it's
and all right
Under the hood, ni
is using either cut
or Perl to rearrange the columns. cut
is much faster of the two, but it's only used when the output columns are in the same relative order as the input columns.
x
is a shorthand for certain column exchange operations. All of these operations can be written with f
, usually with a higher number of keystrokes.
$ ni i"Ain't nobody dope as me" \
i"I'm dressed so fresh, so clean" \
i"So fresh and so clean, clean" FS
Ain't nobody dope as me
I'm dressed so fresh, so clean
So fresh and so clean, clean
x
exchanges the first two columns (same as fBA.
)
$ ni i"Ain't nobody dope as me" \
i"I'm dressed so fresh, so clean" \
i"So fresh and so clean, clean" FS x
nobody Ain't dope as me
dressed I'm so fresh, so clean
fresh So and so clean, clean
x
with a single column exchanges that column with column 1.
$ ni i"Ain't nobody dope as me" \
i"I'm dressed so fresh, so clean" \
i"So fresh and so clean, clean" FS xD
as nobody dope Ain't me
fresh, dressed so I'm so clean
so fresh and So clean, clean
x
with mulitple columns exchanges those into the first columns, in order.
$ ni i"Ain't nobody dope as me" \
i"I'm dressed so fresh, so clean" \
i"So fresh and so clean, clean" FS xEB
me nobody dope as Ain't
so dressed so fresh, I'm clean
clean, fresh and so So clean
With a single column of data, as in the example, the simple command g
will give you lexicographic sorting in ascending order.
$ ni ib ia ic g
a
b
c
To do more complicated sorts, you can give g
columns (A-Z
) and modifiers. For example, -
after a column reverses the sort.
$ ni ib ia ic gA-
c
b
a
n
after a column does a numeric sort.
$ ni i10 i5 i0.3 gAn
0.3
5
10
You can sort multiple columns in order:
$ ni i[b 6] i[b 3] i[a 2] i[a 1] i[c 4] i[c 5] i[a 0] gABn
a 0
a 1
a 2
b 3
b 6
c 4
c 5
The columns can be sorted in any order:
$ ni i[b 0] i[b 4] i[a 2] i[a 1] i[c 4] i[c 0] i[a 0] gBnA
a 0
b 0
c 0
a 1
a 2
b 4
c 4
Often you will want numeric sorting in a more keystroke-efficient way than gn<column>-
. The o
(sort rows ascending, numerically) and O
(sort rows descending, numerically) operator has been provided for this purpose.
$ ni i[b 6] i[b 3] i[a 2] i[a 1] i[c 4] i[c 5] i[a 0] oB
a 0
a 1
a 2
b 3
c 4
c 5
b 6
$ ni i[b 6] i[b 3] i[a 2] i[a 1] i[c 4] i[c 5] i[a 0] OB
b 6
c 5
c 4
b 3
a 2
a 1
a 0
u
is ni
syntax for uniq
, which takes sorted rows and returns the unique values.
$ ni i[b 6] i[b 3] i[a 2] i[a 1] i[c 4] i[c 5] i[a 0] fAgu
a
b
c
c
is ni
's version of uniq -c
, which counts the number of identical consecutive rows in a stream. The main difference is that ni
's c
tab-delimits the output, while uniq -c
space-delimits.
$ ni i[b 6] i[b 3] i[a 2] i[a 1] i[c 4] i[c 5] i[a 0] fAgc
3 a
2 b
2 c
Sorts cannot be chained together using g. If you write a command like $ ni ... gA gBn
, there is no guarantee that the output will have a sorted first column after the second sort. If you want to sort by the first column ascending lexicographically and the second column ascending numerically in the same sort, you should use a more explicit g
operator: $ni ... gABn
.
If you have data that is already partially sorted, for example, when working with the input of the reduce step of a MapReduce job, you may want to perform an additional sort of the already partially-sorted data without sorting the entire stream.
Let's simulate this by sorting one column of the data and sinking the result to a tempfile.
$ ni i[b ba bar] i[b bi bif] i[b ba baz] \
i[q qa qat] i[q qu quux] i[b ba bake] \
i[u ub uber] gA \>tmp \<
b ba bake
b ba bar
b ba baz
b bi bif
q qa qat
q qu quux
u ub uber
If we want the data sorted as if we had done gAB-
, we cannot simply do gB-
to the data we have; this will blow away the sort on the first column.
$ ni i[b ba bar] i[b bi bif] i[b ba baz] \
i[q qa qat] i[q qu quux] i[b ba bake] \
i[u ub uber] gA \>tmp \< gB-
u ub uber
q qu quux
q qa qat
b bi bif
b ba bake
b ba bar
b ba baz
Instead we used the grouped sort ggAB-
.
$ ni i[b ba bar] i[b bi bif] i[b ba baz] \
i[q qa qat] i[q qu quux] i[b ba bake] \
i[u ub uber] gA \>tmp \< ggAB-
b bi bif
b ba bake
b ba bar
b ba baz
q qu quux
q qa qat
u ub uber
The first column indicates the key column (i.e. the sorted column we want to hold constant); the anything after the first column is treated like the arguments to the sorting operator, g
.
Sorting large amounts of data requires buffering to disk. Be careful about how much data you sort; large sorts are a source of headaches and slow performance.
$ ni nE7 F// fB gc \>first_numeral_counts.txt
Running this command you may see see the ni
monitor for the first time, which showing the steps of the computation, and what steps are taking time. The whole process takes about 2 minutes on my computer.
General sorting is not yet yet a point of strength for ni
. If your data is larger than a gigabyte uncompressed, you may want to take advantage of massively distributing the workload through Hadoop operations.
ni
has 2 join operators, j
and J
. j
is an inner join simplified functionality similar to the unix join
command. Like the unix join
, j
only works properly when both streams are sorted.
J
is a left join. it does not impose any requirements on sorting, but requires that the right side of the join fit into memory, and only allows a single value for each key on the right side of the join.
Joining and filtering have a lot in common, since they both require a key. Later in this section, some more uses of r
to filter datasets are demonstrated.
You can use the j
operator to inner-join two streams.
$ ni i[foo bar] i[foo car] i[foo dar] i[that no] i[this yes] j[ i[foo mine] i[not here] i[this OK] i[this yipes] ]
foo bar mine
foo car mine
foo dar mine
this yes OK
this yes yipes
Without any options, j
will join on the first tab-delimited column of both streams, however, j
can join on multiple columns by referencing the columns by letter:
$ ni i[M N foo] i[M N bar] i[M O qux] i[X Y cat] i[X Z dog] \
jAB[ i[M N hi] i[X Y bye] ]
M N foo hi
M N bar hi
X Y cat bye
In general, the streams you are joining should be pre-sorted (though j
will not fail if the streams aren't sorted).
The join here is slightly asymmetric; the left side of the join is streamed, while the right side is buffered into memory. This is useful to keep in mind for joins in a Hadoop Streaming context; the left side of the join should be larger (i.e. have more records) than the right side.
$ ni i[foo bar] i[foo car] i[that no] i[this yes] i[foo dar] \
J[ i[this yipes] i[this OK] i[foo mine] i[not here] ]
foo bar mine
foo car mine
that no
this yes OK
foo dar mine
There are a number of things to notice here. First, we can reiterate that neither the right nor the left side of the join need to be sorted. Second, notice that this is a left join--the row i[that no]
passes through even though there is no associated key on the right. On the right side of the join, notice that the value associated with the last element with the same key is used.
r
can more generally be thought of as "take rows where the predicate that follows evaluates to true."
Because an empty column is falsey (it evaluates to false), we can filter it using r
.
$ ni i[one_column] i[two columns] i[three columns here] rB
two columns
three columns here
A very common motif in ni
(especially in the MapReduce context) is to filter a large dataset down to a much smaller one with a certain set of keys or values. You can also filter a column based on another dataset using ri
and the index of the column to filter.
$ ni i[one_column] i[two columns] i[three columns here] \
riA[ione_column ithree]
one_column
three columns here
In this context, r
can take a regex as an option: $ ni <data> r/<regex>/
takes all rows where the regex has a match. We can rewrite our example for e
$ ni n500 e'grep 22'
as:
$ ni n500 r/22/
22
122
220
221
222
223
224
225
226
227
228
229
322
422
To use escaped characters in a regex, it's often more efficient to wrap in quotes:
$ ni n1000 r-500 r'/^(\d)\1+$/'
555
666
777
888
999
To write this same command without quotes requires a lot of escaping: $ ni n1000 r-500 r/^\(\\d\)\\1+$/
The simplest way to build up a ni
spell is by writing one step of the spell, checking that step's output for correctness, then writing another step.
In general, ni
spells will start producing output very quickly (or can be coerced to produce output quickly). Once the output of one step in the spell looks good, you can move on to the next step.
As you advance through this tutorial, you'll want a quicker way to understand at a high level what a particular ni
spell is doing. For this, use ni --explain ...
. Using the example from the file output section:
$ ni --explain n10 \>ten.txt \<
["n",1,11]
["file_write","ten.txt"]
["file_read"]
Each line represents one step of the pipeline defined by the spell, and the explanation shows how the ni
parser interprets what's written. The explanations are usually concise, but they can help you make sure your code is doing what it's supposed to.
ni
is a bottom-up, ad hoc language; ni
spells can be developed efficiently from the command line, or from a command line-like environment, like a Jupyter notebook.
Congrats on making it to the end of the first part. Hopefully you're starting to see the power in ni
's conciseness. If you haven't gotten a chance to develop or play with ni
code yet, there will likely be some accompanying exercises for this tutorial in the near future, or you can write some yourself and contribute to the development of this fascinating language.
The next chapter covers all the Perl you need to be productive in ni
. You need some, but not too much.