Merge pull request #1008 from JuliaStats/nl/nullable

Port to NullableArrays and CategoricalArrays
JuliaData · Jul 8, 2017 · c03d516 · c03d516
2 parents 2931693 + bba462f
commit c03d516
Show file tree

Hide file tree

Showing 52 changed files with 1,287 additions and 1,222 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,7 +1,6 @@
 
 language: julia
 julia:
- - 0.4
  - 0.5
  - nightly
 os:
@@ -17,4 +16,3 @@ script:
 after_success:
  - julia -e 'cd(Pkg.dir("DataFrames")); Pkg.clone("https://github.com/MichaelHatherly/Documenter.jl"); include(joinpath("docs", "make.jl"))'
  - julia -e 'cd(Pkg.dir("DataFrames")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
-
diff --git a/REQUIRE b/REQUIRE
@@ -1,5 +1,6 @@
-julia 0.4
-DataArrays 0.3.4
+julia 0.5
+NullableArrays 0.0.8
+CategoricalArrays 0.0.6
 StatsBase 0.8.3
 GZip
 SortingAlgorithms

diff --git a/appveyor.yml b/appveyor.yml
@@ -1,7 +1,5 @@
 environment:
  matrix:
- - JULIAVERSION: "julialang/bin/winnt/x86/0.4/julia-0.4-latest-win32.exe"
- - JULIAVERSION: "julialang/bin/winnt/x64/0.4/julia-0.4-latest-win64.exe"
  - JULIAVERSION: "julialang/bin/winnt/x86/0.5/julia-0.5-latest-win32.exe"
  - JULIAVERSION: "julialang/bin/winnt/x64/0.5/julia-0.5-latest-win64.exe"
  - JULIAVERSION: "julianightlies/bin/winnt/x86/julia-latest-win32.exe"

diff --git a/benchmark/datamatrix.jl b/benchmark/datamatrix.jl
diff --git a/benchmark/datavector.jl b/benchmark/datavector.jl
diff --git a/benchmark/results.csv b/benchmark/results.csv
diff --git a/benchmark/runbenchmarks.jl b/benchmark/runbenchmarks.jl
@@ -5,9 +5,7 @@
 using DataFrames
 using Benchmark
 
-benchmarks = ["datavector.jl",
- "datamatrix.jl",
- "io.jl"]
+benchmarks = [ "io.jl"]
 
 # TODO: Print summary to stdout_stream, while printing results
 # to file with appends.

diff --git a/docs/make.jl b/docs/make.jl
@@ -1,4 +1,4 @@
-using Documenter, DataFrames, DataArrays
+using Documenter, DataFrames
 
 # Build documentation.
 # ====================

diff --git a/docs/src/lib/utilities.md b/docs/src/lib/utilities.md
@@ -12,6 +12,7 @@
  {docs}
  eltypes
  head
+ categorical!
  complete_cases
  complete_cases!
  describe

diff --git a/docs/src/man/formulas.md b/docs/src/man/formulas.md
@@ -33,7 +33,7 @@ If you would like to specify both main effects and an interaction term at once,
 mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))
 ```
 
-You can control how categorical variables (e.g., `PooledDataArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:
+You can control how categorical variables (e.g., `CategoricalArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:
 
 ```julia
 mm = ModelMatrix(ModelFrame(Z ~ X*Y, df, contrasts = Dict(:X => HelmertCoding())))
@@ -47,4 +47,3 @@ contrasts!(mf, X = HelmertCoding())
 ```
 
 The construction of model matrices makes it easy to formulate complex statistical models. These are used to good effect by the [GLM Package.](https://github.com/JuliaStats/GLM.jl)
-
diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -2,75 +2,75 @@
 
 ## Installation
 
-The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using DataArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.
+The DataFrames package is available through the Julia package system. Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed `using NullableArrays, DataFrames` to bring all of the relevant variables into your current namespace. In addition, we will make use of the `RDatasets` package, which provides access to hundreds of classical data sets.
 
-## The `NA` Value
+## The `Nullable` Type
 
-To get started, let's examine the `NA` value. Type the following into the REPL:
+To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:
 
 ```julia
-NA
+Nullable()
 ```
 
-One of the essential properties of `NA` is that it poisons other items. To see this, try to add something like `1` to `NA`:
-
+And this represents a missing value:
 ```julia
-1 + NA
+Nullable()
 ```
 
-## The `DataArray` Type
-
-Now that we see that `NA` is working, let's insert one into a `DataArray`. We'll create one now using the `@data` macro:
+`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`:
 
 ```julia
-dv = @data([NA, 3, 2, 5, 4])
+Nullable(1) + Nullable()
 ```
 
-To see how `NA` poisons even complex calculations, let's try to take the mean of the five numbers stored in `dv`:
+Note that operations mixing `Nullable` and scalars (e.g. `1 + Nullable()`) are not supported.
+
+## The `NullableArray` Type
+
+`Nullable` objects can be stored in a standard `Array` just like any value:
 
 ```julia
-mean(dv)
+v = Nullable{Int}[1, 3, 4, 5, 4]
 ```
 
-In many cases we're willing to just ignore `NA` values and remove them from our vector. We can do that using the `dropna` function:
+But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects.
 
 ```julia
-dropna(dv)
-mean(dropna(dv))
+nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
 ```
 
-Instead of removing `NA` values, you can try to conver the `DataArray` into a normal Julia `Array` using `convert`:
+In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function:
 
 ```julia
-convert(Array, dv)
+dropnull(nv)
+mean(dropnull(nv))
 ```
 
-This fails in the presence of `NA` values, but will succeed if there are no `NA` values:
+Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`:
 
 ```julia
-dv[1] = 3
-convert(Array, dv)
+convert(Array, nv)
 ```
 
-In addition to removing `NA` values and hoping they won't occur, you can also replace any `NA` values using the `convert` function, which takes a replacement value as an argument:
+This fails in the presence of `null` values, but will succeed if there are no `null` values:
 
 ```julia
-dv = @data([NA, 3, 2, 5, 4])
-mean(convert(Array, dv, 11))
+nv[1] = 3
+convert(Array, nv)
 ```
 
-Which strategy for dealing with `NA` values is most appropriate will typically depend on the specific details of your data analysis pathway.
-
-Although the examples above employed only 1D `DataArray` objects, the `DataArray` type defines a completely generic N-dimensional array type. Operations on generic `DataArray` objects work in higher dimensions in the same way that they work on Julia's Base `Array` type:
+In addition to removing `null` values and hoping they won't occur, you can also replace any `null` values using the `convert` function, which takes a replacement value as an argument:
 
 ```julia
-dm = @data([NA 0.0; 0.0 1.0])
-dm * dm
+nv = NullableArray(Nullable{Int}[Nullable(), 3, 2, 5, 4])
+mean(convert(Array, nv, 0))
 ```
 
+Which strategy for dealing with `null` values is most appropriate will typically depend on the specific details of your data analysis pathway.
+
 ## The `DataFrame` Type
 
-The `DataFrame` type can be used to represent data tables, each column of which is a `DataArray`. You can specify the columns using keyword arguments:
+The `DataFrame` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments:
 
 ```julia
 df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
@@ -110,22 +110,22 @@ describe(df)
 To focus our search, we start looking at just the means and medians of specific columns. In the example below, we use numeric indexing to access the columns of the `DataFrame`:
 
 ```julia
-mean(df[1])
-median(df[1])
+mean(dropnull(df[1]))
+median(dropnull(df[1]))
 ```
 
 We could also have used column names to access individual columns:
 
 ```julia
-mean(df[:A])
-median(df[:A])
+mean(dropnull(df[:A]))
+median(dropnull(df[:A]))
 ```
 
 We can also apply a function to each column of a `DataFrame` with the `colwise` function. For example:
 
 ```julia
 df = DataFrame(A = 1:4, B = randn(4))
-colwise(cumsum, df)
+colwise(c->cumsum(dropnull(c)), df)
 ```
 
 ## Accessing Classic Data Sets
@@ -135,10 +135,8 @@ To see more of the functionality for working with `DataFrame` objects, we need a
 For example, we can access Fisher's iris data set using the following functions:
 
 ```julia
-using RDatasets
-iris = dataset("datasets", "iris")
+iris = DataFrames.loadiris()
 head(iris)
 ```
 
 In the next section, we'll discuss generic I/O strategy for reading and writing `DataFrame` objects that you can use to import and export your own data files.
-
diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md
@@ -15,7 +15,7 @@ full = join(names, jobs, on = :ID)
 
 Output:
 
-| Row | ID | Name | Job | 
+| Row | ID | Name | Job |
 |-----|----|------------|----------|
 | 1 | 1 | "John Doe" | "Lawyer" |
 | 2 | 1 | "Jane Doe" | "Doctor" |

diff --git a/docs/src/man/pooling.md b/docs/src/man/pooling.md
@@ -1,44 +1,49 @@
-# Pooling Data (Representing Factors)
+# Categorical Data
 
 Often, we have to deal with factors that take on a small number of levels:
 
 ```julia
-dv = @data(["Group A", "Group A", "Group A",
-  "Group B", "Group B", "Group B"])
+v = ["Group A", "Group A", "Group A",
+ "Group B", "Group B", "Group B"]
 ```
 
-The naive encoding used in a `DataArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `PooledDataArray` does:
+The naive encoding used in an `Array` or in a `NullableArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does:
 
 ```julia
-pdv = @pdata(["Group A", "Group A", "Group A",
- "Group B", "Group B", "Group B"])
+cv = CategoricalArray(["Group A", "Group A", "Group A",
+  "Group B", "Group B", "Group B"])
 ```
 
-In addition to representing repeated data efficiently, the `PooledDataArray` allows us to determine the levels of the factor at any time using the `levels` function:
+A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type.
+
+In addition to representing repeated data efficiently, the `CategoricalArray` type allows us to determine efficiently the allowed levels of the variable at any time using the `levels` function (note that levels may or may not be actually used in the data):
 
 ```julia
-levels(pdv)
+levels(cv)
 ```
 
-By default, a `PooledDataArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function:
+The `levels!` function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables.
+
+By default, a `CategoricalArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function:
 
 ```julia
-pdv = compact(pdv)
+cv = compact(cv)
 ```
 
-Often, you will have factors encoded inside a DataFrame with `DataArray` columns instead of `PooledDataArray` columns. You can do conversion of a single column using the `pool` function:
+Often, you will have factors encoded inside a DataFrame with `Array` or `NullableArray` columns instead of `CategoricalArray` or `NullableCategoricalArray` columns. You can do conversion of a single column using the `categorize` function:
 
 ```julia
-pdv = pool(dv)
+cv = categorize(v)
 ```
 
-Or you can edit the columns of a `DataFrame` in-place using the `pool!` function:
+Or you can edit the columns of a `DataFrame` in-place using the `categorical!` function:
 
 ```julia
 df = DataFrame(A = [1, 1, 1, 2, 2, 2],
  B = ["X", "X", "X", "Y", "Y", "Y"])
-pool!(df, [:A, :B])
+categorical!(df, [:A, :B])
 ```
 
-Pooling columns is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl) When fitting regression models, `PooledDataArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `PooledDataArray`. This allows one to analyze categorical data efficiently.
+Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` and `NullableCategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`/`NullableCategoricalArray`. This allows one to analyze categorical data efficiently.
 
+See the [CategoricalArrays package](https://github.com/nalimilan/CategoricalArrays.jl) for more information regarding categorical arrays.