In this section, we introduce ni
-specific Perl extensions.
ni
was developed at Factual, Inc., which works with
mobile location data; these geographically-oriented operators are open-sourced
and highly efficient. There's also a blog post if you're interested in
learning more.
The next set of operators work with multiple lines of input data; this allows reductions over multiple lines to take place. This section ends with a discussion of more I/O functions.
Geohashes are an efficient way of encoding a position on the globe, and is also useful for determining neighboring locations.
The geohashing algorithm works by splitting first on longitude, then by latitude. Thus, geohashes with an odd number of binary bits of precision will be (approximately) squares, and geohashes with an even number of digits will be (approximately) rectangles with their longer side parallel to the equator.
base-32 characters | Approximate geohash size |
---|---|
1 | 5,000km x 5,000km |
2 | 1,250km x 625km |
3 | 160km x 160km |
4 | 40km x 20km |
5 | 5km x 5km |
6 | 1.2km x 600m |
7 | 150m x 150m |
8 | 40m x 20m |
9 | 5m x 5m |
10 | 1.2m x 60cm |
11 | 15cm x 15cm |
12 | 4cm x 2cm |
llg($lat, $lng, $precision)
returns either a geohash in a special base-32 alphabet, or as a long integer. ghe
is a synonym of llg
provided for backwards the purpose of backwards compatibility.
If $precision > 0
, the geohash is specified with $precison
base-32 characters. When geohashes are specified in this way, they are referred to in short as gh<$precision>
, for example gh6, gh8, or gh12. If $precision < 0
, it returns an integer geohash with -$precision
bits of precision.
Examples:
$ ni i[34.058566 -118.416526] p'llg(a, b, 7)'
9q5cc25
$ ni i[34.058566 -118.416526] p'llg(a, b, -35)'
10407488581
The default is to encode with 12 base-32 characters, i.e. a gh12, or 60 bits of precision.
$ ni i[34.058566 -118.416526] p'llg(a, b)'
9q5cc25twby7
The parentheses are also often unnecessary, because of the prototyping:
$ ni i[34.058566 -118.416526] p'llg a, b, 9'
9q5cc25tw
ni
provides two prototypes for geohash decoding:
gll($gh_base32)
Returns the corresponding latitude and longitude (in that order) of the center point corresponding to that geohash. ghd
is a synonym of gll
provided for backwards the purpose of backwards compatibility.
gll($gh_int, $precision)
decodes the input integer as a geohash with $precision
bits and returns the latitude and longitude (in that order) of the center point corresponding to that geohash. As with llg
, parentheses are not always necessary.
Examples:
$ ni i[34.058566 -118.416526] p'r gll llg a, b'
34.058565851301 -118.416526280344
$ ni i[34.058566 -118.416526] p'r gll llg(a, b, -41), 41'
34.0584754943848 -118.416652679443
g3b
(geohash base-32 to binary) and gb3
(geohash binary to base-32) transcode between base-32 and binary geohashes.
$ ni i9q5cc25tufw5 p'r g3b a'
349217367909022597
gb3
takes a binary geohash and its precision in binary bits, and returns a base-32 geohash.
$ ni i[349217367909022597 9q5cc25tufw5] p'r gb3 a, 60; r gb3 g3b b, 60;'
9q5cc25tufw5
9q5cc25tufw5
It is also useful to compute the distance between the center points of geohashes; this is implemented through gh_dist
.
$ ni i[95qcc25y 95qccdnv] p'gh_dist a, b, mi'
1.23981551084308
When the units are not specified, gh_dist
gives its answers in kilometers, which can be specified explicitly as "km". Other options include feet ("ft") and meters ("m"). To specify meters, you will need to use quotation marks, as the bareword m
is taken.
$ ni i[95qcc25y 95qccdnv] p'gh_dist a, b'
1.99516661267524
You can also pass in binary geohashes, along with a precision.
$ ni i[95qcc25y 95qccdnv] p'gh_dist g3b a, g3b b, 40'
1.99516661267524
$ ni i[95qcc25y 95qccdnv] p'gh_dist g3b a, g3b b, 40, "m"'
1995.16661267524
This subroutine implements the haversine distance formula. Like gh_dist
, it also takes units ("mi", "km", "ft", "m"), and will return the distance in kilometers if no other information is given.
$ ni 1p'lat_lon_dist 31.21984, 121.41619, 34.058686, -118.416762'
10426.7380460312
For plotting, it is useful to get the latitude and longitude coordinates of the box that is mapped to a particular geohash. ghb
returns the coordinates of that box in order: northernmost point, southernmost point, easternmost point, westernmost point.
$ ni 1p'r ghb "95qc"'
18.6328123323619 18.45703125 -125.156250335276 -125.5078125
tpe
and tep
are the most flexible and powerful time functions in ni
's arsenal for converting between Unix timestamps and date. Other important functions here are used for localization by lat/long or geohash, and several syntactic sugars are provided for common mathematical and semntic time operations.
tpe(@time_pieces)
: Returns the epoch time in GMT (see important note below)
and assumes that the pieces are year, month, day, hour, minute, and second,
in that order.
$ ni 1p'tpe(2017, 1, 22, 8, 5, 13)'
1485072313
You can also specify a format string and call the function as tpe($time_format, @time_pieces)
.
$ ni 1p'tpe("mdYHMS", 1, 22, 2017, 8, 5, 13)'
1485072313
tep($epoch_time)
: returns the year, month, day, hour, minute, and second in human-readable format from the epoch time.
$ ni 1p'r tep tpe 2017, 1, 22, 8, 5, 13'
2017 1 22 8 5 13
A specific format string can also be provided, in which case tep
is called as tep($time_format, $epoch_time)
.
tep($raw_timestamp + tsec($lat, $lng))
returns the approximate date and time at the location $lat, $lng
at a Unix timestamp of $raw_timestamp
.
For example, let's say you have the Unix timestamp and want to know what time it is at the coordinates: 34.0585660 N, 118.4165260 W.
$ ni i[34.058566 -118.416526] \
p'my $epoch_time = 1485079513; my $tz_offset = tsec(a, b);
my @local_time_parts = tep($epoch_time + $tz_offset);
r @local_time_parts'
2017 1 22 2 41 13
This correction cuts the globe into 4-minute strips by degree of longitude. It is meant for approximation of local time, not the actual time, which depends much more on politics and introduces many more factors to think about.
ghl
and gh6l
get local time from an input geohash input in base-32 (ghl
) or base-10 representation of a the geohash-60 in binary (gh6l
).
$ ni i[34.058566 -118.416526] p'llg a, b' \
p'my $epoch_time = 1485079513;
my @local_time_parts = tep ghl($epoch_time, a);
r @local_time_parts'
2017 1 22 2 41 13
$ ni i[34.058566 -118.416526] p'llg a, b, -60' \
p'my $epoch_time = 1485079513;
my @local_time_parts = tep gh6l($epoch_time, a);
r @local_time_parts'
2017 1 22 2 41 13
ISO 8601 is a standardized time format defined by the International Organization
for Standards. More information on formats is available
here. ni
implements decoding for
date-time-timezone formatted ISO data into Unix timestamps.
A similar non-ISO format (allowing for a space between date and time, and decimal precision to the time, which is preserved), is also supported.
$ ni i2017-06-24T18:23:47+00:00 i2017-06-24T19:23:47+01:00 \
i2017-06-24T15:23:47-03:00 i2017-06-24T13:08:47-05:15 \
i20170624T152347-0300 i20170624T182347Z \
i2017-06-24T18:23:47+00:00 i"2017-06-24 18:23:47.683" p'i2e a'
1498328627
1498328627
1498328627
1498328627
1498328627
1498328627
1498328627
1498328627.683
ni
can also format epoch timestamps in an ISO 8601-compatible form. To do this, input an epoch timestamp and either a floating number of hours, or a timezone format string, for example "+10:00"
.
$ ni i2017-06-24T18:23:47+00:00 p'i2e a' \
p'r e2i a; r e2i a, -1.5; r e2i a, "+3";
r e2i a, "-05:45"'
2017-06-24T18:23:47Z
2017-06-24T16:53:47-01:30
2017-06-24T21:23:47+03:00
2017-06-24T12:38:47-05:45
These all represent the same instant:
$ ni i2017-06-24T18:23:47+00:00 p'i2e a' \
p'r e2i a; r e2i a, -1.5; r e2i a, "+3";
r e2i a, "-05:45"' p'i2e a'
1498328627
1498328627
1498328627
1498328627
Takes 7 mandatory arguments: year, month, day, hour, minute, second, and time zone. The time zone can either be reported as a string -08:00
or a number -8
.
$ ni 1p'r tpi tep(tpe(2018, 1, 14, 9, 12, 31)), "Z"'
2018-01-14T09:12:31Z
To verify correctness:
$ ni 1p'r i2e tpi tep(1515801233), "Z"'
1515801233
Takes a US formatted time and converts it to epoch time.
$ ni i'2/14/18 0:46' p'r usfe a'
1518569160
These functions give the 3-letter abbreviation for day of week, hour of day, and hour of week, and year + month.
$ ni i[34.058566 -118.416526] p'llg a, b, -60' \
p'my $epoch_time = 1485079513; dow gh6l($epoch_time, a)'
Sun
$ ni i[34.058566 -118.416526] p'llg a, b, -60' \
p'my $epoch_time = 1485079513; hod gh6l($epoch_time, a)'
2
$ ni i[34.058566 -118.416526] p'llg a, b, -60' \
p'my $epoch_time = 1485079513; how gh6l($epoch_time, a)'
Sun_02
$ ni i[34.058566 -118.416526] p'llg a, b, -60' \
p'my $epoch_time = 1485079513; ym gh6l($epoch_time, a)'
2017-01
These functions truncate dates, which is useful for bucketing times; they're much faster than calls to the POSIX library, which can make them more practical in performance-environments (HS
, for example).
$ ni i1494110651 p'r tep ttd(a); r tep tth(a);
r tep tt15(a); r tep ttm(a); r tep a'
2017 5 6 0 0 0
2017 5 6 22 0 0
2017 5 6 22 30 0
2017 5 6 22 44 0
2017 5 6 22 44 11
So far, we have only seen many ways to operate on data, but only one way to reduce it, the counting operator c
. Buffered readahead allows us to perform operations on many lines at onceThese operations are used to convert columnar data into arrays of lines.
In general, the types of reductions that can be done with buffered readahead and multiline reducers can also be done with streaming reduce (discussed in a later chapter); however, the syntax for buffered readahead is often much simpler. Generating arrays of lines using readahead operations is a common motif in ni
scripts:
rl(n)
: readn
lines@lines = rl(n)
:n
is optional, and if it is not given, only one line will be read.
rw
: read while@lines = rw {condition}
: read lines while a condition is met
ru
: read until@lines = ru {condition}
: read lines until a condition is met
re
: read equal@lines = re {condition}
: read lines while the value of the condition is equal.
r1
: read equal@lines = r1
: Read all the lines in the stream.
$ ni n10 p'r rl 3'
1 2 3
4 5 6
7 8 9
10
$ ni n10p'r rw {a < 7}'
1 2 3 4 5 6
7
8
9
10
$ ni n10p'r ru {a % 4 == 0}'
1 2 3
4 5 6 7
8 9 10
$ ni n10p'r re {int(a**2/30)}'
1 2 3 4 5
6 7
8 9
10
You have now seen two ways to generate arrays of lines: buffered readahead and data closures. In order to access data from a specific column of a line array, you will need to use multiline operators a_
through l_
, which are the multiline analogs to the line-based operators a/a()
through l/l()
.
$ ni i[j can] i[j you] i[j feel] \
i[k the] i[k love] i[l tonight] \
p'my @lines = re {a}; r @lines;'
j can j you j feel
k the k love
l tonight
re {a}
will give us all of the lines where a()
has the same value (i.e. the first column is the same. What gets stored in the variablemy @lines
is the array of text lines where a()
has the same value: ("j\tcan", "j\tyou", "j\tfeel")
.
To get data from one particular column of this data, we can use the multiline selectors; let's say we only wanted the second column of that data--we can get that by applying b_
to @lines
.
$ ni i[j can] i[j you] i[j feel] \
i[k the] i[k love] i[l tonight] \
p'my @lines = re {a}; r b_(@lines)'
can you feel
the love
tonight
That's given us the second element of each line that started with the
As with most everything in ni
, we can shorten this up considerably; we can drop the parentheses around @lines
, and use reA
as a shorthand for re {a}
$ ni i[j can] i[j you] i[j feel] \
i[k the] i[k love] i[l tonight] \
p'my @lines = reA; r b_ @lines'
can you feel
the love
tonight
And if we want to do this in a single line, we can combine multiline selection with buffered readahed:
$ ni i[j can] i[j you] i[j feel] \
i[k the] i[k love] i[l tonight] \
p'r b_ reA'
can you feel
the love
tonight
The last two examples are used commonly. The former, more explicit one is often used when we want more than one reduction:
$ ni i[j can] i[j you] i[j feel] \
i[k the] i[k love] i[l tonight] \
p'my @lines = reA; r b_ @lines; r a_ @lines'
can you feel
j j j
the love
k k
tonight
l
Another common motif is getting the value of
In the previous section, we covered reA
as syntactic sugar for re {a}
$ ni i[a x first] i[a x second] \
i[a y third] i[b y fourth] p'r c_ reA'
first second third
fourth
This syntactic sugar is slightly different for reB, reC, ..., reL
.
Let's take a look at what happens with re {b}
:
$ ni i[a x first] i[a x second] \
i[a y third] i[b y fourth] p'r c_ re {b}'
first second
third fourth
In general, that's not quite what we want; when we do reduction like this, we've probably already sorted our data appropriately, so we'll want to reduce over all of the columns; writing a function that reduces over columns a
and b
is a little clunky, so we use reb
as a syntactic sugar for that function.
$ ni i[a x first] i[a x second] \
i[a y third] i[b y fourth] p'r c_ reB'
first second
third
fourth
This is our desired output.
Let's say we get some temperature data and we want to compute some statistics around it:
$ ni i[LA 75] i[LA 80] i[LA 79] i[CHI 62] i[CHI 27] i[CHI 88] \
p'my $city = a; my @temps = b_ rea; r $city, mean(@temps), std(@temps)'
LA 78 2.16024689946929
CHI 59 24.9933324442073
To present this data, we'll want to add a header of City, Average Temperature, Temperature Standard Deviation
. To do this, we can prepend to the stream using the ^
operator.
$ ni i[LA 75] i[LA 80] i[LA 79] i[CHI 62] i[CHI 27] i[CHI 88] \
p'my $city = a; my @temps = b_ rea;
r $city, mean(@temps), std(@temps)' \
^[i[City "Average Temperature" "Temperature Standard Deviation"] ]
City Average Temperature Temperature Standard Deviation
LA 78 2.16024689946929
CHI 59 24.9933324442073
If we wanted to add data for another city, we could do it using append with the +
operator
$ ni i[LA 75] i[LA 80] i[LA 79] i[CHI 62] i[CHI 27] i[CHI 88] \
p'my $city = a; my @temps = b_ rea;
r $city, mean(@temps), std(@temps)' \
^[i[City "Average Temperature" "Temperature Standard Deviation"] ] +[ i[ABQ 67 10] ]
City Average Temperature Temperature Standard Deviation
LA 78 2.16024689946929
CHI 59 24.9933324442073
ABQ 67 10
Now, let's display this data nicely; the mdtable
operation does this automatically:
$ ni i[LA 75] i[LA 80] i[LA 79] i[CHI 62] i[CHI 27] i[CHI 88] \
p'my $city = a; my @temps = b_ rea;
r $city, mean(@temps), std(@temps)' \
^[i[City "Average Temperature" "Temperature Standard Deviation"] ] +[ i[ABQ 67 10] ] mdtable
|City|Average Temperature|Temperature Standard Deviation|
|:----:|:----:|:----:|
|LA|78|2.16024689946929|
|CHI|59|24.9933324442073|
|ABQ|67|10|
In Markdown, the above output turns into a nicely-formatted table.
City | Average Temperature | Temperature Standard Deviation |
---|---|---|
LA | 78 | 2.16024689946929 |
CHI | 59 | 24.9933324442073 |
ABQ | 67 | 10 |
$ ni ::data[n5] 1p'a(data)'
and$ ni ::data[n5] 1p'a data'
will raise syntax errors, sincea/a()
are not prepared to deal with the more than one line data in the closure.$ ni ::data[n5] 1p'a_ data'
works, becausea_
operates on each line.
$ ni ::data[n5] 1p'a_ data'
1
2
3
4
5
These are useful for collecting data with an unknown shape.
$ ni i[m 1 x] i[m 2 y s t] \
i[m 3 yo] i[n 5 who] i[n 6 let the dogs] p'r b__ reA'
1 x 2 y s t 3 yo
5 who 6 let the dogs
In particular, these operations can be used in conjunction with Hadoop streaming to join together data that have been computed separately.
These functions pair well with the multiline reducers introduced in the previous section.
Recall that text is numbers (and numbers are text) in Perl. Thus, there are two sets of methods for finding the minimum and maximum, depending on whether you want it in a numeric context (min
, max
) or in a string context (minstr
, maxstr
)
$ ni i[1 2 3] p'r min F_; r max F_'
1
3
$ ni i[c a b] p'r minstr F_; r maxstr F_'
a
c
Be careful using these functions with both numeric and string arguments in the same line:
# this code emits warnings and has unexpected output
$ ni i[1 2 3 a b c] p'r min F_; r max F_; r minstr F_; r maxstr F_'
c
3
1
c
These functions provide the sum, product, and average of a numeric array:
$ ni i[2 3 4] p'r sum(F_), prod(F_), mean(F_)'
9 24 3
uniq
returns an array of the unique elements of an array.
$ ni i[a c b c c a] p'my @uniqs = uniq F_; r sort @uniqs'
a b c
freqs
returns a reference to a hash that contains the count of each unique element in the input array.
$ ni i[a c b c c a] p'my %h = %{freqs F_};
r($_, $h{$_}) for sort keys %h'
a 2
b 1
c 3
These functions take two arguments: the first is a code block or a function, and the second is an array.
any
and all
return the logical OR and logical AND, respectively of the code block mapped over the array.
$ ni i[2 3 4] p'r any {$_ > 3} F_; r all {$_ > 3} F_'
1
0
argmax
returns the value in the array that achieves the maximum of the block passed in, and argmin
returns the value in the array that achieves the minimum of the function.
$ ni i[aa bbb c] p'r argmax {length} F_; r argmin {length} F_'
bbb
c
If there are any ties, the first element is picked.
$ ni i[aa bbb c ddd e] p'r argmax {length} F_; r argmin {length} F_'
bbb
c
Given a list of keys @ks
and a list of values @vs
; how would you construct a hash where the $ks[i]
is associated to $vs[i]
for every index i
? One way is to write a loop; zip
short-circuits this loop.
$ ni 1p'my @ks = ("u", "v"); my @vs = (1, 10); r zip \@ks, \@vs'
u 1 v 10
This output can be cast as a hash:
$ ni 1p'my @ks = ("u", "v");
my @vs = (1, 10); my %h = zip \@ks, \@vs; r $h{"v"}'
10
It is also possible to zip
more than 2 arrays:
$ ni 1p'my @ks = ("u", "v");
my @vs = (1, 10); my @ws =("foo", "bar");
r zip \@ks, \@vs, \@ws'
u 1 foo v 10 bar
If the arrays are of unequal length, the data will be output only up to the length of the shortest array.
$ ni 1p'my @ks = ("u", "v");
my @vs = (1, 10, "nope", 100, 1000,); r zip \@ks, \@vs'
u 1 v 10
To generate examples for our buffered readahead, we'll take a short detour the builtin ni
operation cart
.
$ ni 1p'cart [10, 20], [1, 2, 3]'
10 1
10 2
10 3
20 1
20 2
20 3
The output of cart
will have the first column varying the least, the second column varying the second least, etc. so that the value of the last column will change for every row if its values are all distinct.
Note that cart
takeas array references (in square brackets), and returns array references. ni
will interpret these array references as rows, and expand them. Thus r
, when applied to cart
, will likely not produce your desired results.
$ ni 1p'r cart [1], ["a", "b", "c"]'
ARRAY(0x7ff2bb109568) ARRAY(0x7ff2ba8c93c8) ARRAY(0x7ff2bb109b80)
take
and drop
get (or get rid of) the specified number of elements from an array.
$ ni i[1 2 3 4 5 6 7] p'r take 3, F_; r drop 3, F_'
1 2 3
4 5 6 7
take_while
and drop_while
take an additional block; they take
or drop
while the function is true.
$ ni i[1 2 3 4 5 6 7] p'r take_while {$_ < 3} F_;
r drop_while {$_ < 3} F_'
1 2
3 4 5 6 7
Some day, ni
will get lazy sequences, and this will be useful. Until that day...
Hash constructors are useful for filtering or adding data. There are many hash constructor functions all with the same syntax. For example, my %h = ab_ @lines
returns a hash where the keys are taken from column a
and the values are taken from column b
. my %h = fd_ @lines
will return a hash where
$ ni i[a 1] i[b 2] i[foo bar] \
p'my @lines = rw {1};
my %h = ab_ @lines; my @sorted_keys = sort keys %h;
r @sorted_keys; r map {$h{$_}} @sorted_keys'
a b foo
1 2 bar
There's a nice piece of Perl syntactic sugar for the last return statement that gets all of the values out of a hash associated with the elements of an array:
$ ni i[a 1] i[b 2] i[foo bar] \
p'my @lines = rw {1};
my %h = ab_ @lines; my @sorted_keys = sort keys %h;
r @sorted_keys; r @h{@sorted_keys}'
a b foo
1 2 bar
Hashes are often combined with closures to do filtering operations:
$ ni ::passwords[idragon i12345] i[try this] i[123 fails] \
i[dragon does work] i[12345 also works] \
i[other ones] i[also fail] \
rp'^{%h = ab_ passwords} exists($h{+a})'
dragon does work
12345 also works
Constructing hashes from closures can get very large; if you have more than 1 million keys in the hash, the hash will likely grow to several gigabytes in size. If you need to use a hash with millions of keys, you'll often be better off using a Bloom filter, described in Chapter 6.
This is useful for doing reduction on data you've already reduced; for example, you've counted the number of neighborhoods in each city in each country and now want to count the number of neighborhoods in each country.
$ ni i[x k 3] i[x j 2] i[y m 4] i[y p 8] i[y n 1] p'r acS reA'
x 5
y 13
If you have ragged data, where a value may not exist for a particular column, a convience method SNN
gives you the non-null values. See the next section for the convenience methods kbv_dsc
.
$ ni i[y m 4 foo] i[y p 8] i[y n 1 bar] \
p'%h = dcSNN reA;
@sorted_keys = kbv_dsc %h;
r($_, $h{$_}) for @sorted_keys'
foo 4
bar 1
This is syntactic sugar for Perl's sort function applied to keys of a hash.
$ ni i[x k 3] i[x j 2] i[y m 4] i[y p 8] i[y n 1] i[z u 0] \
p'r acS reA' p'r kbv_dsc(ab_ rl(3))'
y x z
$ ni i[x k 3] i[x j 2] i[y m 4] i[y p 8] i[y n 1] i[z u 0] \
p'r acS reA' p'r kbv_asc(ab_ rl(3))'
z x y
Escaping quotes is a pain; ni
provides squote
and dquote
to single- and double-quote strings; squo
and dquo
provide literal single and double quotes.
$ ni 1p'r squote a, dquote a, squo, dquo'
'1' "1" ' "
$ ni i[how are you] p'r jjc(F_), jjp(F_), jju(F_), jjw(F_)'
how,are,you how|are|you how_are_you how are you
$ ni i[how are you] p'r jjcc(F_), jjpp(F_), jjuu(F_), jjww(F_)'
how,,are,,you how||are||you how__are__you how are you
$ ni i[how are you] p'r jjc(F_), jjp(F_), jju(F_), jjw(F_)' p'r ssc a; r ssp b; r ssu c; r ssw d;'
how are you
how are you
how are you
how are you
$ ni i[how are you] p'r jjcc(F_), jjpp(F_), jjuu(F_), jjww(F_)' p'r sscc a; r sspp b; r ssuu c; r ssww d;'
how are you
how are you
how are you
how are you
ss
and jj
methods are also implemented for:
- forward slash:
ssf
,ssff
,jjf
,jjff
- single quote:
ssq
, e.g. - double quote:
ssQ
, e.g. - tab:
sst
, e.g. - newline:
ssn
e.g. - colon:
ssC
, e.g. - dot (i.e. period:
'.'
):ssd
- hyphen (i.e. en-dash:
'-'
):ssh
$ ni i[how are you] p'r jjCC(F_), jjff(F_), jjdd(F_), jjhh(F_)'
how::are::you how//are//you how..are..you how--are--you
$ ni ifoobar p'r startswith a, "fo"; r endswith a, "obar";'
1
1
ni
tends to work better with globs (*
) than with lists of file paths; this allows the restriction of file paths from MapReduce Jobs
$ ni ihdfst:///user/bilow/tmp/test_ni_job/part-* \
ihdfst:///user/bilow/tmp/test_ni_job p'r restrict_hdfs_path a, 100'
hdfst:///user/bilow/tmp/test_ni_job/part-00*
hdfst:///user/bilow/tmp/test_ni_job/part-00*
wf $filename, @lines
: write @lines
to a file called $filename
$ ni i[file1 1] i[file1 2] i[file2 yo] p'wf a, b_ reA'
file1
file2
$ ni file1
1
2
$ ni file2
yo
$ ni i[file3 1] i[file3 2] i[file4 yo] i[file3 hi] p'af a, b_ reA'
file3
file4
file3
$ ni file3
1
2
hi
$ ni file4
yo
af
is not highly performant; in general, if you have to write many lines to a file, you should process and sort the data in such a way that all lines can be written to the same file at once with wf
. wf
will blow your files away though, so be careful.
We'll spend the rest of this chapter discussing more exotic types of I/O, including JSON
, zip
, xlsx
, and various web resources.
ni
implements a very fast JSON parser that is great at pulling out string and numeric fields. As of this writing (2018-07-30), the JSON destructurer does not support list-based fields in JSON.
$ ni i'{"a": 1, "foo":"bar", "c":3.14159}' D:foo,:c,:a
bar 3.14159 1
json_decode
will fully decode JSON into a Perl object (in this case, a hash reference. Decoding, as you might expect, is much slower than destructuring.
$ ni i'{"a": 1, "foo":"bar", "c":3.14159}' p'my $h = json_decode a; r $h->{"a"}, $h->{"foo"}, $h->{"c"};'
1 bar 3.14159
The syntax of the row to JSON instructions is similar to hash construction syntax in Ruby.
$ ni i'{"a": 1, "foo":"bar", "c":3.14159}' D:foo,:c,:a p'json_encode {foo => a, c => b, a => c, treasure=>"trove"}'
{"a":1,"c":3.14159,"foo":"bar","treasure":"trove"}
The output JSON will have its keys in sorted order.
Some aspects of JSON parsing are not quite there yet, so you may want to use (also very fast) raw Perl to destructure your JSONs.
See core/pl/json.pm
get_scalar
get_array
get_flat_hash
- To list all the files in a
zip
archive:$ ni zip:///path/to/file.zip
- To list the sheets in an
tar
ortar.gz
archive:$ ni tar:///path/to/file.tar
- To list the sheets in an Excel workbook:
$ ni xlsx:///path/to/file.xlsx
- To get the data from a single file in a
zip
archive:$ ni zipentry:///path/to/file.zip:<sub_file_name>
- To get the data from a single file in a
tar
ortar.gz
archive:$ ni tarentry:///path/to/file.tar:<sub_file_name>
- To get the data from a single sheet in an excel workbook as TSV:
$ ni xlsxsheet:///path/to/file.xlsx:<sheet-number>
- Read from web resources:
$ ni http://path/to/file.txt
$ ni https://path/to/file.txt
$ ni sftp://path/to/file.txt
$ ni s3cmd://path/to/file.txt
- Write to web resources:
$ ni data \>http://path/to/file.txt
$ ni data \>https://path/to/file.txt
$ ni data \>s3cmd://path/to/file.txt
- Note this does not work for
sftp
.
If you've made it this far in the tutorial, you now have enough tools to be extremely productive in ni
. If you're ready to get off the crazy ride of this tutorial and get to work, here's a great point to stop. Before you go, though, it will help to take a few minutes to think through ni
's philosophy and style, and how those two intersect.
Many data science projects start with a request that you have never seen before; in this case, it's often easier to start from very little and be able to build rapidly, than it is to have a huge personal library of Python scripts and Jupyter notebooks that you can bend to fit the task at hand.
ni
is great at generating throwaway code, and this code can be made production-ready through ni
scripting, discussed in the next chapter.
Just because you can read a Python script and understand what it does at a basic level does not mean you can code in Python, and Python can trick very intelligent people into thinking they know what they're doing even when they don't. The opposite is true of ni
. It will be inherently obvious when you don't know something in ni
, because if you don't know something, it'll be likely that the part of the spell you're reading won't make sense.
This is a gruff approach to programming, but it's not unfriendly. ni
doesn't allow you to just get by--your only option is mastering ni
one piece at a time.
ni
is a domain-specific language; its domain is processing single lines and chunks of data that fit in memory
- Because of this philosophy,
ni
is fantastic for data munging and cleaning. - Because of this philosophy, large-scale sorting is not a
ni
-ic operation, while gzip compression is. - Because of this philosophy,
ni
relies heavily on Hadoop for big data processing. Without Hadoop, most sufficiently complicated operations become infeasible from a runtime perspective once the amount of data exceeds a couple of gigabytes uncompressed.
Some jobs that are difficult for ni
, and some ways to resolve them:
- Sorting
- Challenge: Requires buffering of entire stream (possibly to disk, which is slow)
- Solution: Hadoop Streaming will do much of the heavy lifting for you in any sort, and will distribute the computation so it's easy.
- Matrix Multiplication
- Challenge: Difficult to implement in a streaming context
- Solution: Numpy operations
- SQL Joins
- Challenge: SQL joins can take more than one line
- Solution:
- Small Data: There are several options here. You can use
N'...'
and do the join in numpy, you can useN'...'
withimport pandas as pd
to do the join in pandas, or you can pipe data in and out ofjoin -t $'\t'
- Large Data: HDFS Joins
- Small Data: There are several options here. You can use
- Iterating a
ni
process over directories where the directory provides contextual information about its contents.- Challenge: This is something
ni
can probably do, but I'm not sure how to do it offhand. - Solution: Write out the
ni
spell for the critical part, embed the spell in a script written in Ruby/Python, and call it usingbash -c
.
- Challenge: This is something