-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathdata_wrangling.nim
454 lines (382 loc) · 19.4 KB
/
data_wrangling.nim
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
# cannot import `Value` because it clashes with `mustache.values.Value`. This is fixed for
# Nim version >= 1.5 (fully qualified type for formulas used) but important on < 1.5 (cannot
# fully qualify types there)
import nimib except Value
import nimibook
import datamancer
nbInit(theme = useNimibook)
nbText: """
# Data wrangling using the `DataFrame` from [Datamancer](https://github.com/SciNim/Datamancer)
The third major data type often encountered is a `DataFrame`.
Data frames can be thought of as multiple, named tensors of possibly different types
in one object. A data frame library then is supposed to make working with such data
as convenient and powerful as possible.
In the specific case of Datamancer, the data structure is essentially an
`OrderedTable[string, Column]`, where `Column` is a variant object storing one
of 5 different `Tensor[T]` types.
In order to use Datamancer, you must first import it. In addition to that, in this tutorial
we will also import the [Arraymancer](https://github.com/mratsim/Arraymancer) tensor library,
which we will use to demonstrate that you can create dataframes from Arraymancer tensors.
"""
nbCode:
import datamancer
import arraymancer
nbText: """
## Construction of a `DataFrame`
A `DataFrame` from the Datamancer library can be constructed in two different ways. Either
from an input CSV file or from existing sequences or tensors.
Construction from a CSV file is performed using the `readCsv` procedure. It provides multiple
different options (different separators, skipping lines, header symbols, ...), but for a
regular comma separated value file, the defaults are fine. For example:
```nim
let df1 = readCsv("foo.csv")
echo df1
```
Alternatively, if one already has a mix of sequences and tensors of the same length:
"""
nbCode:
let s1 = [1, 2, 3]
let s2 = @["hello", "foo", "bar"]
let s3 = @[1.5, 2.5, 3.5].toTensor
let df2 = toDf(s1, s2, s3)
echo df2
echo "Column names: ", df2.getKeys() ## getKeys only returns the column names
nbText: """
which creates a data frame with three columns named `"s1", "s2", "s3"`. We can see that
mixing different input types is not a problem. The supported types are
- `float`
- `int`
- `string`
- `bool`
and a mix of them in one column.
Printing a data frame by default prints the first 20 rows. This can be adjusted by calling
the `pretty` procedure manually and handing the number of rows (-1 for all).
In addition one can always view a data frame in the browser by doing `showBrowser(df)` where
`df` is the data frame to view.
If one wishes to name the columns differently from construction (they can be renamed later
as well), it is done by:
"""
nbCode:
let df3 = toDf({"Id" : s1, "Word" : s2, "Number" : s3})
echo df3
nbText: """
Finally, one can also create a `DataFrame` starting from an empty object and
assigning sequences, tensors or scalar values manually:
"""
nbCodeInBlock:
var df = newDataFrame()
df["x"] = @[1, 2, 3] ## assign a sequence. This sets the `DataFrame` length to 3
df["y"] = @[4.0, 5.0, 6.0].toTensor ## assign a tensor. Input now `must` match length 3
try:
df["z"] = @[5, 6] ## raises
except ValueError: discard ## type of exception might change in the future
df["z"] = constantColumn(1, df.len) ## assign a constant column of integers.
nbText: """
## Accessing data underlying a column
The data stored in a column of a data frame can always be accessed easily. Because the
data is stored in a variant object, the user needs to supply the data type to use to
read the data as. Nim does *not* allow return type overloading, which means we cannot
use the runtime information about the types to return the "correct" tensor. All we can
make sure is that accessing the data with the *correct* type is a no-op.
This has the downside that an invalid type will produce a runtime error. On the upside
it allows us to perform type conversions directly, for instance reading an integer column
as floats or any column as strings.
The syntax is as follows:
"""
nbCodeInBlock:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
let t1: Tensor[int] = df["x", int] ## this is a no-op
let t2: Tensor[float] = df["x", float] ## converts integers to floats
let t3: Tensor[float] = df["y", float] ## also a no-op
let t4: Tensor[string] = df["x", string] ## convert to string
try:
let t5: Tensor[bool] = df["x", bool] ## would produce a runtime error
except ValueError: discard ## type of exception might be changed in the future
nbText: """
where we indicate the types explicitly on the left hand side for clarity.
This means we can in principle always access individual elements of a data frame column
by getting the tensor and accessing elements from it. Of course this has some overhead,
but due to reference semantics it is relatively cheap (no data is copied, unless type
conversions need to be performed).
## Computing single column aggregations
As we saw in the previous section, accessing a tensor of a column is cheap. We can
use that to perform aggregations on full columns:
"""
nbCodeInBlock:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
echo df["x", int].sum
echo df["y", float].mean
nbText: """
and in that sense any operation acting on tensors can be used.
## Data frame operations
In the more general case (the reason one uses a data frame in the first place) we
don't want to only consider a single column.
Many different operations are supported, but can be grouped into a few general procedures.
Some of the procedures of Datamancer take so called `FormulaNodes`. They are essentially
a domain specific language to succinctly express operations on data frame columns
without the need to fully refer to them. Their basic construction and usage should become
clear in the code below. The Datamancer documentation contains a much deeper introduction
into the specifics here:
[Formula introduction](https://scinim.github.io/Datamancer/datamancer.html#formulas)
### `select` - Selecting a subset of columns
If we have a data frame with multiple columns we may want to keep only
a subset of these going forward. This can be achieved using `select`:
"""
nbCodeInBlock:
var df = newDataFrame()
for i in 0 ..< 100:
df["x" & $i] = @[1 + i, 2 + i, 3 + i]
echo df.select("x1", "x50", "x99")
nbText: """
which drops every column not selected.
The inverse is also possible using `drop`:
"""
nbCodeInBlock:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0], "z" : @["a", "b", "c"]})
echo df.drop("x")
nbText: """
### `rename` - Renaming a column
`rename`, as the name implies, is used to rename columns. Usage is rather simple. We'll
get our first glance at the `f{}` macro to generate a `FormulaNode` here:
"""
nbCodeInBlock:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
echo df.rename(f{"foo" <- "x"})
nbText: """
So we can see that we simply assign `<-` the old name "x" to the new name "foo".
### `arrange` - Sorting a data frame
Often we wish to sort a data frame by one or more columns. This is done using `arrange`.
It can take one or more columns to sort by, where for multiple columns the order
of the inputs decides the precedence of what to sort by first, the later columns only
used to break ties between the former.
The sort order is handled in the same way as in Nim's standard library, i.e. using
an `order` argument that takes either `SortOrder.Ascending` or `SortOrder.Descending`.
The default order is ascending order.
"""
nbCodeInBlock:
let df = toDf({ "x" : @[4, 2, 7, 4], "y" : @[2.3, 7.1, 3.3, 1.0],
"z" : @["b", "c", "d", "a"]})
echo df.arrange("x") ## sort by `x` in ascending order (default)
echo df.arrange("x", order = SortOrder.Descending) ## sort in descending order
echo df.arrange(["x", "z"]) ## sort by two columns, first `x` then `z` to break ties
nbText: """
### `unique` - Removing duplicate rows
Another useful operation is removal of duplicate entries. `unique` is the procedure
to use. If no argument is given uniqueness is determined based on *all* existing
columns. This is not always the most desired option of course, which is why `unique`
accepts a variable number of columns. Then only uniqueness among these columns is
considered.
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 2, 2, 4], "y" : @[5.0, 6.0, 7.0, 8.0, 9.0],
"z" : @["a", "b", "b", "d", "e"]})
echo df.unique() ## consider uniqueness of all columns, nothing removed
echo df.unique("x") ## only consider `x`, only keeps keeps 1st, 2nd, last row
echo df.unique(["x", "z"]) ## considers `x` and `z`, one more unique (4th row)
nbText: """
### `mutate` - Creating new or modifying existing columns
`mutate` is the procedure to use to add new columns to a data frame or modify
existing ones. For this procedure we need to hand formulas using the `f{}` macro
again. Here it is advisable to name the formulas. Instead of the above assignment
operator `<-` we now use the "x depends on y" operator `~`.
Further, to refer to a column in the computation we perform we will use accented
quotes. This is all the complexity of that macro we will discuss in this introduction.
Let's compute the sum of two columns to get a feel:
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12] })
echo df.mutate(f{"x+y" ~ `x` + `y`})
nbText: """
Of course we can use constants and local Nim symbols as well:
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 3]})
echo df.mutate(f{"x+5" ~ `x` + 5 })
let y = 2.0
echo df.mutate(f{"x + local y" ~ `x` + y})
nbText: """
Note: There is a slight subtlety at play here. If you look closely at the output of
these two `mutate` commands you see that in the first case the resulting column is
of type `int`, whereas in the second case it's `float`. That is because the type
of the column is deduced based on the types in the rest of the formula. `5` is an
`int` so `x` is read as integers in the first case, whereas `y` is a `float` and so
`x` is read as a `float`. See the Datamancer documentation on details and how to
specify the types manually.
And as stated we can also overwrite columns:
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 3] })
echo df.mutate(f{"x" ~ `x` + `x`})
nbText: """
Under the hood these formulas are converted into a closure that takes a data frame
as an input. The column references are extracted and converted into a preamble
that reads the corresponding tensors. Then we run over the relevant tensors and
perform the described operation for each element. The result is assigned to
a resulting tensor, which is assigned as the new column.
The only restriction on the body of the formula is that it's a valid Nim expression
(if one mentally replaces column references by tensor elements) that returns a
value of a valid data type for a data frame.
If one wishes the same behavior as `mutate` but does not require the columns anymore
that are not explicitly created / modified using a formula, there is `transmute` for
this purpose. Otherwise it is equivalent to `mutate`.
### `filter` - Removing rows based on a predicate
These mentioned formulas can of course also return boolean values. In combination
with the `filter` procedure this allows us to remove rows of a data frame that
fail to pass a condition (or a "predicate").
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @["a", "b", "c", "d", "e"] })
echo df.filter(f{ `x` < 3 or `y` == "e" })
nbText: """
### `summarize` - Computing aggregations on a full data frame
The approach described in "Computing single column aggregations" can be useful for
simple single column operations, but does not scale well. That's what `summarize` is
for. Here we use the last operator used in the `f{}` macro, namely the reduction
`<<` operator:
"""
nbCodeInBlock:
let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @[5, 10, 15, 20, 25] })
echo df.summarize(f{float: mean(`x`) }) ## compute mean, auto creates a column name
echo df.summarize(f{float: "mean(x)" << mean(`x`) }) ## same but with a custom name
echo df.summarize(f{"mean(x)+sum(y)" << mean(`x`) + sum(`y`) })
nbText: """
Keen eyes will notice the `float:` at the beginning of the first two examples. This is
a "type hint" for the formula, because the symbol "mean" is overloaded in Nim. But not
by a few distinct procedures, but generically. At this moment there are no heuristics
involved to choose one type over another in a generic case. Therefore, we don't know
what type `x` should be read as. So we overwrite the input type manually and give the
macro a hint.
If we leave out the type information you will be greeted with a message of the type
information found of `mean` and to consider giving such a type hint.
The situation is slightly different for the last case, in which an addition is involved.
Due to some heuristic rules involving the most basic operators (maths and boolean) we can
determine here that the input is probably supposed to be float.
### `group_by`
`summarize` and the other procedures can be spiced up if used in combination with
`group_by`.
`group_by` by itself doesn't perform any operations. It simply returns a new data frame
with the exact same data that is now "grouped" by one or more columns. These columns should
be columns containing *discrete* data. This grouping can be used (manually or indirectly)
via the `groups` iterator. It yields all "sub data frames" contained in the grouped data
frame. These sub data frames are those of duplicate entries in the columns that we have
grouped by. It essentially yields everything as a sub data frame that would be reduced
to a single row if using `unique` on the same columns as grouped by.
This should become clearer with an example:
"""
nbCodeInBlock:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C"],
"Num" : @[1, 5, 3, 4, 8, 7, 2] })
.group_by("Class")
for t, subDf in groups(df):
echo "Sub data frame: ", t
echo subDf
nbText: """
We can see we have 3 sub data frames. One for each discrete value found in column `Class`.
The actually interesting applications of `groub_by` though is its combination with one
of the other procedures shown above, in particular `summarize`, `filter` and `mutate`.
For a grouped data frame these operations will then performed *group wise*. Operations
that only use information of a single row are unaffected by this. But any formula that
includes a reference to a full column (`mean, sum, ...`) will compute this value per
group.
A few examples:
- `summarize`
"""
nbCodeInBlock:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").summarize(f{int: "sum(Num)" << sum(`Num`)})
nbText: """
We can see this computes the sum for each class now.
- `filter`:
"""
nbCodeInBlock:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").filter(f{ sum(`Num`) <= 9 })
nbText: """
and again, the filtering is done per group. In this sense a filtering operation
that uses a reducing formula as input would usually not make too much sense anyway.
- `mutate`:
"""
nbCodeInBlock:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").mutate(f{"Num - mean" ~ `Num` - mean(`Num`)})
nbText: """
where we subtract the mean (of each class!) from each observation.
If one uses multiple columns to group by, we get instead the sub data frame corresponding
to each unique combination of discrete values. Feel free to play around and try out
such an example!
### `gather` - Converting a wide format data frame to long format
As one of the last things to cover, we will quickly talk about data frames in wide and
long format. In a way the example data frame above with a column "Class" and a column
"Num" can be considered a data frame in "long" format. Long format in the sense that
we have one discrete column "Class" that maps to different "Num" values. Because the
column "Class" contains *discrete* values, We can imagine "transposing" the data frame
to columns "A", "B", "C" instead with the values for each of these *groups* as the values
in the corresponding columns. Let's look at:
- this data frame
- the output of grouping that data frame by "Class"
- the same data frame in wide format
for clarity:
"""
nbCodeInBlock:
let dfLong = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo "Long format:\n", dfLong
echo "----------------------------------------"
echo "Grouping by `Class`:"
for _, subDf in groups(dfLong.group_by("Class")):
echo subDf
echo "----------------------------------------"
let dfWide = toDf({"A" : [1, 8, 0], "B" : [3, 4, 0], "C" : [5, 7, 2]})
echo "Wide format:\n", dfWide
nbText: """
As we can see, the difference between wide and long format is the way the `groub_by` results
are "assembled". As different columns for each group (wide format) or as two (key, value)
columns (long format).
The conversion from wide -> long format is always possible. But the the mapping of long -> wide
format requires there to be the the same number of entries for each class. If that condition
is not satisfied, there will be missing values in the columns of the separate classes.
Depending on circumstances one might have input data in either order. However, in particular
for plotting purposes the long format is often more convenient as it allows to classify the
discrete classes using different colors, shapes etc. automatically.
Therefore, there is the `gather` procedure to convert a wide format data frame into a
long format one. It takes the columns to be "gathered", the name of the column containing
the "keys" (the column from which a value came) and a name for the column of the "values"
that were "gathered". We can use it to recover the ("Class", "Num") data frame from
the last one:
"""
nbCodeInBlock:
let df = toDf({"A" : [1, 8, 0], "B" : [3, 4, 0], "C" : [5, 7, 2]})
echo df.gather(df.getKeys(), ## get all keys to gather
key = "Class", ## the name of the `key` column
value = "Num")
nbText: """
which is exactly the same data frame as in the examples before.
(Note: the inverse procedure to convert a long format data frame back into wide format
is currently still missing. It will be added soon)
### `innerJoin` - joining two data frames by a common column
As the last common example of data frame operations, we shall consider joining two
data frames by a common column.
"""
nbCodeInBlock:
let df1 = toDf({ "Class" : @["A", "B", "C", "D", "E"],
"Num" : @[1, 5, 3, 4, 6] })
let df2 = toDf({ "Class" : ["E", "B", "A", "D", "C"],
"Ids" : @[123, 124, 125, 126, 127] })
echo innerJoin(df1, df2, by = "Class")
nbText: """
where we joined two data frames by the "Class" column, resulting in a data frame with
3 columns. The matching rows for the classes were put together aligning corresponding
"Num" and "Ids" values.
Of course joining two data frames is only a sensible option for a column containing
discrete data so that equal elements in that column for both input data frames can
be found.
This already covers the *majority* of the API of Datamancer. There are more procedures,
but the presented ones should be all that is needed in the vast majority of use cases.
Check out the [Datamancer documentation](https://scinim.github.io/Datamancer/datamancer.html)
for a full picture and in particular for a better and more thorough introduction to the
formula syntax.
"""
nbSave