Skip to content

Commit

Permalink
Merge pull request #1008 from JuliaStats/nl/nullable
Browse files Browse the repository at this point in the history
Port to NullableArrays and CategoricalArrays
  • Loading branch information
quinnj authored and nalimilan committed Jul 8, 2017
2 parents 2931693 + bba462f commit c03d516
Show file tree
Hide file tree
Showing 52 changed files with 1,287 additions and 1,222 deletions.
2 changes: 0 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@

language: julia
julia:
- 0.4
- 0.5
- nightly
os:
Expand All @@ -17,4 +16,3 @@ script:
after_success:
- julia -e 'cd(Pkg.dir("DataFrames")); Pkg.clone("https://github.com/MichaelHatherly/Documenter.jl"); include(joinpath("docs", "make.jl"))'
- julia -e 'cd(Pkg.dir("DataFrames")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'

5 changes: 3 additions & 2 deletions REQUIRE
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
julia 0.4
DataArrays 0.3.4
julia 0.5
NullableArrays 0.0.8
CategoricalArrays 0.0.6
StatsBase 0.8.3
GZip
SortingAlgorithms
Expand Down
2 changes: 0 additions & 2 deletions appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
environment:
matrix:
- JULIAVERSION: "julialang/bin/winnt/x86/0.4/julia-0.4-latest-win32.exe"
- JULIAVERSION: "julialang/bin/winnt/x64/0.4/julia-0.4-latest-win64.exe"
- JULIAVERSION: "julialang/bin/winnt/x86/0.5/julia-0.5-latest-win32.exe"
- JULIAVERSION: "julialang/bin/winnt/x64/0.5/julia-0.5-latest-win64.exe"
- JULIAVERSION: "julianightlies/bin/winnt/x86/julia-latest-win32.exe"
Expand Down
37 changes: 0 additions & 37 deletions benchmark/datamatrix.jl

This file was deleted.

56 changes: 0 additions & 56 deletions benchmark/datavector.jl

This file was deleted.

69 changes: 0 additions & 69 deletions benchmark/results.csv

Large diffs are not rendered by default.

4 changes: 1 addition & 3 deletions benchmark/runbenchmarks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@
using DataFrames
using Benchmark

benchmarks = ["datavector.jl",
"datamatrix.jl",
"io.jl"]
benchmarks = [ "io.jl"]

# TODO: Print summary to stdout_stream, while printing results
# to file with appends.
Expand Down
2 changes: 1 addition & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
using Documenter, DataFrames, DataArrays
using Documenter, DataFrames

# Build documentation.
# ====================
Expand Down
1 change: 1 addition & 0 deletions docs/src/lib/utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
{docs}
eltypes
head
categorical!
complete_cases
complete_cases!
describe
Expand Down
3 changes: 1 addition & 2 deletions docs/src/man/formulas.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ If you would like to specify both main effects and an interaction term at once,
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))
```

You can control how categorical variables (e.g., `PooledDataArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:
You can control how categorical variables (e.g., `CategoricalArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:

```julia
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df, contrasts = Dict(:X => HelmertCoding())))
Expand All @@ -47,4 +47,3 @@ contrasts!(mf, X = HelmertCoding())
```

The construction of model matrices makes it easy to formulate complex statistical models. These are used to good effect by the [GLM Package.](https://github.com/JuliaStats/GLM.jl)

74 changes: 36 additions & 38 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,75 +2,75 @@

## Installation

The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using DataArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.
The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using NullableArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.

## The `NA` Value
## The `Nullable` Type

To get started, let's examine the `NA` value. Type the following into the REPL:
To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:

```julia
NA
Nullable()
```

One of the essential properties of `NA` is that it poisons other items. To see this, try to add something like `1` to `NA`:

And this represents a missing value:
```julia
1 + NA
Nullable()
```

## The `DataArray` Type

Now that we see that `NA` is working, let's insert one into a `DataArray`. We'll create one now using the `@data` macro:
`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`:

```julia
dv = @data([NA, 3, 2, 5, 4])
Nullable(1) + Nullable()
```

To see how `NA` poisons even complex calculations, let's try to take the mean of the five numbers stored in `dv`:
Note that operations mixing `Nullable` and scalars (e.g. `1 + Nullable()`) are not supported.

## The `NullableArray` Type

`Nullable` objects can be stored in a standard `Array` just like any value:

```julia
mean(dv)
v = Nullable{Int}[1, 3, 4, 5, 4]
```

In many cases we're willing to just ignore `NA` values and remove them from our vector. We can do that using the `dropna` function:
But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects.

```julia
dropna(dv)
mean(dropna(dv))
nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
```

Instead of removing `NA` values, you can try to conver the `DataArray` into a normal Julia `Array` using `convert`:
In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function:

```julia
convert(Array, dv)
dropnull(nv)
mean(dropnull(nv))
```

This fails in the presence of `NA` values, but will succeed if there are no `NA` values:
Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`:

```julia
dv[1] = 3
convert(Array, dv)
convert(Array, nv)
```

In addition to removing `NA` values and hoping they won't occur, you can also replace any `NA` values using the `convert` function, which takes a replacement value as an argument:
This fails in the presence of `null` values, but will succeed if there are no `null` values:

```julia
dv = @data([NA, 3, 2, 5, 4])
mean(convert(Array, dv, 11))
nv[1] = 3
convert(Array, nv)
```

Which strategy for dealing with `NA` values is most appropriate will typically depend on the specific details of your data analysis pathway.

Although the examples above employed only 1D `DataArray` objects, the `DataArray` type defines a completely generic N-dimensional array type. Operations on generic `DataArray` objects work in higher dimensions in the same way that they work on Julia's Base `Array` type:
In addition to removing `null` values and hoping they won't occur, you can also replace any `null` values using the `convert` function, which takes a replacement value as an argument:

```julia
dm = @data([NA 0.0; 0.0 1.0])
dm * dm
nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
mean(convert(Array, nv, 0))
```

Which strategy for dealing with `null` values is most appropriate will typically depend on the specific details of your data analysis pathway.

## The `DataFrame` Type

The `DataFrame` type can be used to represent data tables, each column of which is a `DataArray`. You can specify the columns using keyword arguments:
The `DataFrame` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments:

```julia
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
Expand Down Expand Up @@ -110,22 +110,22 @@ describe(df)
To focus our search, we start looking at just the means and medians of specific columns. In the example below, we use numeric indexing to access the columns of the `DataFrame`:

```julia
mean(df[1])
median(df[1])
mean(dropnull(df[1]))
median(dropnull(df[1]))
```

We could also have used column names to access individual columns:

```julia
mean(df[:A])
median(df[:A])
mean(dropnull(df[:A]))
median(dropnull(df[:A]))
```

We can also apply a function to each column of a `DataFrame` with the `colwise` function. For example:

```julia
df = DataFrame(A = 1:4, B = randn(4))
colwise(cumsum, df)
colwise(c->cumsum(dropnull(c)), df)
```

## Accessing Classic Data Sets
Expand All @@ -135,10 +135,8 @@ To see more of the functionality for working with `DataFrame` objects, we need a
For example, we can access Fisher's iris data set using the following functions:

```julia
using RDatasets
iris = dataset("datasets", "iris")
iris = DataFrames.loadiris()
head(iris)
```

In the next section, we'll discuss generic I/O strategy for reading and writing `DataFrame` objects that you can use to import and export your own data files.

2 changes: 1 addition & 1 deletion docs/src/man/joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ full = join(names, jobs, on = :ID)

Output:

| Row | ID | Name | Job |
| Row | ID | Name | Job |
|-----|----|------------|----------|
| 1 | 1 | "John Doe" | "Lawyer" |
| 2 | 1 | "Jane Doe" | "Doctor" |
Expand Down
35 changes: 20 additions & 15 deletions docs/src/man/pooling.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,49 @@
# Pooling Data (Representing Factors)
# Categorical Data

Often, we have to deal with factors that take on a small number of levels:

```julia
dv = @data(["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"])
v = ["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"]
```

The naive encoding used in a `DataArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `PooledDataArray` does:
The naive encoding used in an `Array` or in a `NullableArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:

```julia
pdv = @pdata(["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"])
cv = CategoricalArray(["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"])
```

In addition to representing repeated data efficiently, the `PooledDataArray` allows us to determine the levels of the factor at any time using the `levels` function:
A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type.

In addition to representing repeated data efficiently, the `CategoricalArray` type allows us to determine efficiently the allowed levels of the variable at any time using the `levels` function (note that levels may or may not be actually used in the data):

```julia
levels(pdv)
levels(cv)
```

By default, a `PooledDataArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function:
The `levels!` function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables.

By default, a `CategoricalArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function:

```julia
pdv = compact(pdv)
cv = compact(cv)
```

Often, you will have factors encoded inside a DataFrame with `DataArray` columns instead of `PooledDataArray` columns. You can do conversion of a single column using the `pool` function:
Often, you will have factors encoded inside a DataFrame with `Array` or `NullableArray` columns instead of `CategoricalArray` or `NullableCategoricalArray` columns. You can do conversion of a single column using the `categorize` function:

```julia
pdv = pool(dv)
cv = categorize(v)
```

Or you can edit the columns of a `DataFrame` in-place using the `pool!` function:
Or you can edit the columns of a `DataFrame` in-place using the `categorical!` function:

```julia
df = DataFrame(A = [1, 1, 1, 2, 2, 2],
B = ["X", "X", "X", "Y", "Y", "Y"])
pool!(df, [:A, :B])
categorical!(df, [:A, :B])
```

Pooling columns is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl) When fitting regression models, `PooledDataArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `PooledDataArray`. This allows one to analyze categorical data efficiently.
Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` and `NullableCategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`/`NullableCategoricalArray`. This allows one to analyze categorical data efficiently.

See the [CategoricalArrays package](https://github.com/nalimilan/CategoricalArrays.jl) for more information regarding categorical arrays.
Loading

0 comments on commit c03d516

Please sign in to comment.