-
Notifications
You must be signed in to change notification settings - Fork 367
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1008 from JuliaStats/nl/nullable
Port to NullableArrays and CategoricalArrays
- Loading branch information
Showing
52 changed files
with
1,287 additions
and
1,222 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
{docs} | ||
eltypes | ||
head | ||
categorical! | ||
complete_cases | ||
complete_cases! | ||
describe | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,44 +1,49 @@ | ||
# Pooling Data (Representing Factors) | ||
# Categorical Data | ||
|
||
Often, we have to deal with factors that take on a small number of levels: | ||
|
||
```julia | ||
dv = @data(["Group A", "Group A", "Group A", | ||
"Group B", "Group B", "Group B"]) | ||
v = ["Group A", "Group A", "Group A", | ||
"Group B", "Group B", "Group B"] | ||
``` | ||
|
||
The naive encoding used in a `DataArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `PooledDataArray` does: | ||
The naive encoding used in an `Array` or in a `NullableArray` represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the `CategoricalArray` type does: | ||
|
||
```julia | ||
pdv = @pdata(["Group A", "Group A", "Group A", | ||
"Group B", "Group B", "Group B"]) | ||
cv = CategoricalArray(["Group A", "Group A", "Group A", | ||
"Group B", "Group B", "Group B"]) | ||
``` | ||
|
||
In addition to representing repeated data efficiently, the `PooledDataArray` allows us to determine the levels of the factor at any time using the `levels` function: | ||
A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type. | ||
|
||
In addition to representing repeated data efficiently, the `CategoricalArray` type allows us to determine efficiently the allowed levels of the variable at any time using the `levels` function (note that levels may or may not be actually used in the data): | ||
|
||
```julia | ||
levels(pdv) | ||
levels(cv) | ||
``` | ||
|
||
By default, a `PooledDataArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function: | ||
The `levels!` function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables. | ||
|
||
By default, a `CategoricalArray` is able to represent 2<sup>32</sup>differents levels. You can use less memory by calling the `compact` function: | ||
|
||
```julia | ||
pdv = compact(pdv) | ||
cv = compact(cv) | ||
``` | ||
|
||
Often, you will have factors encoded inside a DataFrame with `DataArray` columns instead of `PooledDataArray` columns. You can do conversion of a single column using the `pool` function: | ||
Often, you will have factors encoded inside a DataFrame with `Array` or `NullableArray` columns instead of `CategoricalArray` or `NullableCategoricalArray` columns. You can do conversion of a single column using the `categorize` function: | ||
|
||
```julia | ||
pdv = pool(dv) | ||
cv = categorize(v) | ||
``` | ||
|
||
Or you can edit the columns of a `DataFrame` in-place using the `pool!` function: | ||
Or you can edit the columns of a `DataFrame` in-place using the `categorical!` function: | ||
|
||
```julia | ||
df = DataFrame(A = [1, 1, 1, 2, 2, 2], | ||
B = ["X", "X", "X", "Y", "Y", "Y"]) | ||
pool!(df, [:A, :B]) | ||
categorical!(df, [:A, :B]) | ||
``` | ||
|
||
Pooling columns is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl) When fitting regression models, `PooledDataArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `PooledDataArray`. This allows one to analyze categorical data efficiently. | ||
Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` and `NullableCategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`/`NullableCategoricalArray`. This allows one to analyze categorical data efficiently. | ||
|
||
See the [CategoricalArrays package](https://github.com/nalimilan/CategoricalArrays.jl) for more information regarding categorical arrays. |
Oops, something went wrong.