Skip to content

Commit

Permalink
Restructure vector tutorial, use R example
Browse files Browse the repository at this point in the history
  • Loading branch information
asinghvi17 committed Sep 25, 2024
1 parent eee67bc commit f2bf141
Showing 1 changed file with 84 additions and 45 deletions.
129 changes: 84 additions & 45 deletions chapters/02-attribute-operations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ using CategoricalArrays

## Introduction

<!-- ### Vector data attributes -->

Attribute data is non-spatial information associated with geographic (geometry) data.
A bus stop provides a simple example: its position would typically be represented by latitude and longitude coordinates (geometry data), in addition to its name.
Expand All @@ -48,6 +49,7 @@ point_df = data.frame(name = "London bus stop", point_vector)
point_sf = sf::st_as_sf(point_df, coords = c("X", "Y"))
```

<!-- ### Raster data attributes -->

Another example is the elevation value (attribute) for a specific grid cell in raster data.
Unlike the vector data model, the raster data model stores the coordinate of the grid cell indirectly, meaning the distinction between attribute and spatial information is less clear.
Expand Down Expand Up @@ -77,12 +79,12 @@ Geospatial data frames have a `geometry` column which can contain a range of geo

Data frames (and geospatial tables like geographic databases, shapefiles, GeoParquet, GeoJSON, etc.) have one column per attribute variable (such as "name") and one row per observation or *feature* (e.g., per bus station).

Many operations are available for attribute data, as shown in the wonderful [DataFrames.jl documentation](https://juliadata.org/stable/man/attributes/).
Many operations are available for attribute data, as shown in the wonderful [DataFrames.jl documentation](https://dataframes.juliadata.org/stable/).

::: {.callout-note}
## Geometry column names
## Geometry in geographic tables

The geometry column of geographic tables in Julia is typically called `geometry` or `geom`, but any name can be used.
The column of a geographic table that holds geometry is typically called `geometry` or `geom`, but any name can be used.

You can discover the names of the geometry columns in a geospatial table using `GI.geometrycolumns(table)` - typically, `first(GI.geometrycolumns(table))` is assumed to be the geometry column.

Expand All @@ -102,8 +104,9 @@ We also recommend the following resources for further reading:
- https://juliadatascience.io/
- https://github.com/bkamins/JuliaForDataAnalysis

### Basic `DataFrame` operations

Before using these capabilities it is worth re-capping how to discover the basic properties of vector data objects.
Before using these capabilities, it is worth re-capping how to discover the basic properties of vector data objects.
Let's start by inspecting the `world.gpkg` dataset from `data/`:

```{julia}
Expand All @@ -112,6 +115,8 @@ world = GeoDataFrames.read("data/world.gpkg")

We can get a visual overview of the dataset by showing it (simply type the variable name in the REPL). From this we can see an abbreviated view of its contents.

<!-- #### Inspection and description -->

But what is it? We can check the type:

```{julia}
Expand All @@ -136,9 +141,11 @@ Notice that the first column, `:geom`, is composed of `IGeometry{wkbMultiPolygon

We can also get some geospatial information - `GI.geometrycolumns(world)` returns `{julia} GI.geometrycolumns(world)`, and `GI.crs(world)` returns `{julia} GI.crs(world)`.

### Dropping geometries
::: {.callout-note collapse="false"}

## Dropping geometries

We can drop the geometry column by subsetting the `DataFrame`:
We can drop the geometry column by subsetting the `DataFrame`, as you'll see in @sec-vec-attr-subsetting.

```{julia}
world_without_geom = world[:, Not(GI.geometrycolumns(world)...)]
Expand All @@ -149,14 +156,16 @@ Dropping the geometry column before working with attribute data can be sometimes
For most cases, however, it makes sense to **keep** the geometry column.
Becoming skilled at geographic attribute data manipulation means becoming skilled at manipulating data frames.

### Vector attribute subsetting
:::

### Vector attribute subsetting {#sec-vec-attr-subsetting}

There are multiple ways to subset data in Julia.
First, and probably most simply, we can index into the DataFrame object using a few kinds of selectors. This can select rows and columns.

Indices placed inside square brackets placed directly after a data frame object name specify the elements to keep.
Indices are placed inside square brackets placed directly after a data frame object name, and specify the elements to keep.

Rows are always selected first, and then columns go in the second position. We can select the first 5 rows of the `:pop_est` column like so:
Rows are referred to using integers, and columns may be referred to using integers or symbols (`:name`).

::: {.callout-note collapse="true"}

Expand All @@ -170,6 +179,9 @@ You can also pass vectors of indices or `bo`olean values to select specific elem
In DataFrames.jl, you can construct a view over all rows by using the `!` operator, like `world[!, :pop]` (in place of `world[:, :pop]`). This syntax is also needed when modifying the entire column, or creating a new column.
:::

Rows are always the first argument, and then columns go in the second position. We can select the first 5 rows of the `:pop_est` column, like so:


```{julia}
world[1:5, :pop]
```
Expand All @@ -182,24 +194,19 @@ world[5:end, [:pop, :continent]]

and note that this returns a new DataFrame with only the selected columns.

We can also drop all missing values in a column using the `dropmissing` function:
We can also select using negations via the `Not` function:

```{julia}
world_with_pop = dropmissing(world, :pop)
world[1:5 ,Not(:pop)]
```

There is also a mutating version of `dropmissing`, called `dropmissing!`, which modifies the input in place.

We can also subset by a boolean vector, computed on some predicate. Let's select all countries whose populations are greater than 30 million, but less than 1 billion.
```{julia}
countries_to_select = 30_000_000 .< world_with_pop.pop .< 1_000_000_000
```
or

```{julia}
world_with_pop[countries_to_select, :]
world[Not(1:150) , :]
```

A more concise way to achieve the same result is `world_with_pop[30_000_000 .< world_with_pop.pop .< 1_000_000_000, :]`.
You can pass any collection of indices to `Not`, and it will cause all elements in the dataframe that are not in that collection to be selected.


Here's a small exercise: guess the number of rows and columns in the `DataFrame` objects returned by each of the following commands, then check your answer by executing the commands in Julia.
Expand All @@ -215,6 +222,35 @@ world[:, 888] # an index representing a non-existent column
```


We can also drop all missing values in a column using the `dropmissing` function:

```{julia}
world_with_area = dropmissing(world, :area_km2)
```

There is also a mutating version of `dropmissing`, called `dropmissing!`, which modifies the input in place.

<!-- #### Selecting via predicate -->

We can also subset by a boolean vector, computed on some predicate.
Earlier on, we saw that we could extract a column as a vector using `df.columnname`.

We can use this vector of values to create a _boolean vector_ (sometimes called a _logical_ vector in R) that we can use to index into the DataFrame.

Let's select all countries whose surface area is smaller than 10,000 km^2.
```{julia}
countries_to_select = world_with_area.area_km2 .< 10_000
```

This is a simple vector, with boolean elements and the same length as the number of rows in the DataFrame.
We use it to select all rows in the DataFrame where its value is `true`.

```{julia}
world_with_area[countries_to_select, :]
```

A more concise way to achieve the same result, without the intermediate array, is `world_with_area[world_with_area.area_km2 .< 10_000, :]`.
This syntax is applicable to columns too!

There are ways to achieve this result using all of the DataFrame manipulation packages mentioned above.

Expand All @@ -226,7 +262,7 @@ There are ways to achieve this result using all of the DataFrame manipulation pa
DataFrames.jl also defines a `subset` function, which is another way to achieve this result:

```{julia}
subset(world_with_pop, :pop => ByRow(x -> !ismissing(x) && 30_000_000 < x < 1_000_000_000))
subset(world_with_area, :area_km2 => ByRow(x -> x < 10_000))
```

## DataFramesMeta.jl
Expand All @@ -237,9 +273,9 @@ DataFramesMeta.jl provides a convenient syntax for subsetting DataFrames using a
#| eval: false
using DataFramesMeta
@chain world_with_pop begin
@subset @byrow (!ismissing(:pop) && 30_000_000 < :pop < 1_000_000_000)
select(:name_long, :pop)
@chain world_with_area begin
@subset @byrow (:area_km2 < 10_000)
select(:name_long, :area_km2)
end
```

Expand All @@ -251,9 +287,9 @@ TidierData.jl provides a convenient syntax for subsetting DataFrames using a DSL
#| eval: false
using TidierData
@chain world_with_pop begin
@subset @byrow (!ismissing(:pop) && 30_000_000 < :pop < 1_000_000_000)
select(:name_long, :pop)
@chain world_with_area begin
@subset @byrow (:area_km2 < 10_000)
select(:name_long, :area_km2)
end
```

Expand All @@ -265,38 +301,41 @@ Query.jl provides a convenient syntax for subsetting DataFrames using a DSL that
#| eval: false
using Query
@from row in world_with_pop |>
@where !ismissing(row.pop) && 30_000_000 < row.pop < 1_000_000_000 |>
@select {name_long = row.name_long, pop = row.pop} |>
@from row in world_with_area |>
@where row.area_km2 < 10_000 |>
@select {name_long = row.name_long, area_km2 = row.area_km2} |>
DataFrame
```

:::

#### Subsetting by predicate

We saw how we could use a boolean vector to index into a DataFrame to select rows where the boolean is `true`.
However, this means we have to create the boolean vector, and while powerful, it can be clunky.

Instead, DataFrames.jl offers several ways we can do this. First is the `subset` function, which we just saw in the tabset above:

```{julia}
small_countries = subset(world_with_area, :area_km2 => ByRow(<(10_000)))
```

### Operations with DataFramesMeta.jl

DataFrames.jl functions are mature, stable and widely used, making them a rock solid choice, especially in contexts where reproducibility and reliability are key.

Functions from the DataFrames manipulation packages mentioned earlier (DataFramesMeta.jl, TidierData.jl, and Query.jl) are also available, and quite stable at this point.
They offer "tidy" workflows which can sometimes be more intuitive and productive for interactive data analysis, as well as easier to reason about.

```{julia}
using DataFramesMeta
result = @chain world_with_area begin
@subset @byrow (:area_km2 < 10_000)
end
```
















You can subset and



Expand Down Expand Up @@ -372,7 +411,7 @@ fig

## Color tables in rasters

Rasters.jl does not currently support color tables in rasters. This should come at some point, though.
Rasters.jl does not currently support color tables in rasters. This should come at some point, though. ArchGDAL, the backend, does support these.

:::

Expand Down

0 comments on commit f2bf141

Please sign in to comment.