Skip to content

Commit

Permalink
Merge pull request #103 from TidierOrg/agg_window_fix
Browse files Browse the repository at this point in the history
  • Loading branch information
drizk1 authored Feb 2, 2025
2 parents fa150d6 + f522ac3 commit d3a5a96
Show file tree
Hide file tree
Showing 20 changed files with 337 additions and 289 deletions.
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
# TidierDB.jl updates
## v0.7.0 - 2025-01-30
- fixes bug when using `agg()` with window ordering and framing
- include default support for all of the following window functions
- `lead`, `lag`, `dense_rank`, `nth_value`, `ntile`, `rank_dense`, `row_number`, `first_value`, `last_value`, `cume_dist`
- add ability to change what functions are on this list to avoid the use of agg in the following manner
- `push!(TidierDB.window_agg_fxns, :kurtosis);`

## v0.7.0 - 2025-01-26
- `db_table` now supports viewing a dataframe directly - `db_table(db, df, "name4db")`
- `copy_to` will copy a table to the DuckDB db, instead of creating a view
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "TidierDB"
uuid = "86993f9b-bbba-4084-97c5-ee15961ad48b"
authors = ["Daniel Rizk <rizk.daniel.12@gmail.com> and contributors"]
version = "0.7.0"
version = "0.7.1"

[deps]
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ TidierDB.jl currently supports the following top-level macros:
| **Helper Functions** | `across`, `desc`, `if_else`, `case_when`, `n`, `starts_with`, `ends_with`, `contains`, `as_float`, `as_integer`, `as_string`, `is_missing`, `missing_if`, `replace_missing` |
| **TidierStrings.jl Functions** | `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove` |
| **TidierDates.jl Functions** | `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime`, `mdy`, `ymd`, `dmy` |
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, and nearly all aggregate sql fxns supported
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, and nearly all aggregate sql fxns supported |
| **Window Functions** | `@window_order`, `@window_frame`, or `_by`, `_order`, and `_frame` within `@mutate` |

`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work.
`@mutate` supports all builtin SQL functions as well.
Expand Down Expand Up @@ -159,7 +160,7 @@ end
```
WITH cte_1 AS (
SELECT *
FROM mtcars
FROM 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv' AS mtcars
WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
Expand All @@ -171,9 +172,9 @@ SELECT cyl, mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE
cte_4 AS (
SELECT *
FROM cte_3
WHERE mpg_efficiency in ('moderate', 'efficient'))
WHERE mpg_efficiency in ('moderate', 'efficient'))
SELECT *
FROM cte_4
FROM cte_4
ORDER BY mpg_rounded DESC
```

Expand Down
39 changes: 37 additions & 2 deletions docs/examples/UserGuide/key_differences.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,9 @@ dfv = db_table(db, df, "dfv"); # create a view (not a copy) of the dataframe on

# In TidierDB, when performing `@group_by` then `@mutate`, the table will be ungrouped after applying all of the mutations in the clause to the grouped data. To perform subsequent grouped operations, the user would have to regroup the data. This is demonstrated below.


@chain t(dfv) begin
@group_by(groups)
@summarize(mean_percent = mean(percent))
@mutate(mean_percent = mean(percent))
@collect
end

Expand All @@ -44,6 +43,42 @@ dfv = db_table(db, df, "dfv"); # create a view (not a copy) of the dataframe on
@collect
end

# TidierDB also supports `_by` for grouping directly within a mutate clause (a feature coming to TidierData in the the future)

@chain t(dfv) begin
@mutate(mean_percent = mean(percent),
_by = groups)
@collect
end

# ## Window Functions

# SQL and TidierDB allow for the use of window functions. When ordering a window function, `@arrange` should not be used. Rather, `@window_order` or, preferably, `_order` (and `_frame`) in `@mutate` should be used.
# The following window functions are included by default
# - `lead`, `lag`, `dense_rank`, `nth_value`, `ntile`, `rank_dense`, `row_number`, `first_value`, `last_value`, `cume_dist`
# The following aggregate functions are included by default
# - `maximum`, `minimum`, `mean`, `std`, `sum`, `cumsum`
# Window and aggregate functions not listed in the above can be either wrapped in `agg(kurtosis(column))` or added to an internal vector using
# - `push!(TidierDB.window_agg_fxns, :kurtosis);`
@chain t(dfv) begin
@mutate(row_id = row_number(),
_by = groups,
_order = value # _frame is an available argument as well.
)
@arrange(groups, value)
@aside @show_query _
@collect
end

# The above query could have alternatively been written as
@chain t(dfv) begin
@group_by groups
@window_order value
@mutate(row_id = row_number())
@arrange(groups, value)
@collect
end

# ## Differences in `case_when()`

# In TidierDB, after the clause is completed, the result for the new column should is separated by a comma `,`
Expand Down
9 changes: 5 additions & 4 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,15 @@ TidierDB.jl currently supports:

| **Category** | **Supported Macros and Functions** |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Data Manipulation** | `@arrange`, `@group_by`, `@filter`, `@select`, `@mutate` (supports `across`), `@summarize`/`@summarise` (supports `across`), `@distinct`, `@relocate` |
| **Joining** | `@left_join`, `@right_join`, `@inner_join`, `@anti_join`, `@full_join`, `@semi_join`, `@union`, `@union_all`, `@intersect`, `@setdiff` |
| **Data Manipulation** | `@arrange`, `@group_by`, `@filter`, `@select`, `@mutate` (supports `across`), `@summarize`/`@summarise` (supports `across`), `@distinct`, `@relocate` |
| **Joining/Setting** | `@left_join`, `@right_join`, `@inner_join`, `@anti_join`, `@full_join`, `@semi_join`, `@union`, `@union_all`, `@intersect`, `@setdiff` |
| **Slice and Order** | `@slice_min`, `@slice_max`, `@slice_sample`, `@order`, `@window_order`, `@window_frame` |
| **Utility** | `@show_query`, `@collect`, `@head`, `@count`, `show_tables`, `@create_view` , `drop_view` |
| **Helper Functions** | `across`, `desc`, `if_else`, `case_when`, `n`, `starts_with`, `ends_with`, `contains`, `as_float`, `as_integer`, `as_string`, `is_missing`, `missing_if`, `replace_missing` |
| **TidierStrings.jl Functions** | `str_detect`, `str_replace`, `str_replace_all`, `str_remove_all`, `str_remove` |
| **TidierDates.jl Functions** | `year`, `month`, `day`, `hour`, `min`, `second`, `floor_date`, `difftime`, `mdy`, `ymd`, `dmy` |
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, and nearly all aggregate sql fxns supported
| **Aggregate Functions** | `mean`, `minimum`, `maximum`, `std`, `sum`, `cumsum`, and nearly all aggregate sql fxns supported |
| **Window Functions** | `@window_order`, `@window_frame`, or `_by`, `_order`, and `_frame` within `@mutate` |

`@summarize` supports any SQL aggregate function in addition to the list above. Simply write the function as written in SQL syntax and it will work.
`@mutate` supports all builtin SQL functions as well.
Expand Down Expand Up @@ -153,7 +154,7 @@ end
```
WITH cte_1 AS (
SELECT *
FROM mtcars
FROM 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv' AS mtcars
WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
Expand Down
14 changes: 7 additions & 7 deletions src/TidierDB.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,24 @@ using GZip
@slice_min, @slice_sample, @rename, copy_to, duckdb_open, duckdb_connect, @semi_join, @full_join,
@anti_join, connect, from_query, @interpolate, add_interp_parameter!, update_con, @head,
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq, oracle, databricks, SQLQuery, show_tables,
t, @union, @create_view, drop_view, @compute, warnings, @relocate, @union_all, @setdiff, @intersect
t, @union, @create_view, drop_view, @compute, warnings, @relocate, @union_all, @setdiff, @intersect#, add_window_agg_fxn

abstract type SQLBackend end

struct clickhouse <: SQLBackend end
struct duckdb <: SQLBackend end
struct sqlite <: SQLBackend end
struct mysql <: SQLBackend end
struct mssql <: SQLBackend end
struct sqlite <: SQLBackend end # COV_EXCL_LINE
struct mysql <: SQLBackend end # COV_EXCL_LINE
struct mssql <: SQLBackend end # COV_EXCL_LINE
struct postgres <: SQLBackend end
struct athena <: SQLBackend end
struct athena <: SQLBackend end # COV_EXCL_LINE
struct snowflake <: SQLBackend end
struct gbq <: SQLBackend end
struct oracle <: SQLBackend end
struct oracle <: SQLBackend end # COV_EXCL_LINE
struct databricks <: SQLBackend end

const _warning_ = Ref(false)

const window_agg_fxns = [:lead, :lag, :dense_rank, :nth_value, :ntile, :rank_dense, :row_number, :first_value, :last_value, :cume_dist]
current_sql_mode = Ref{SQLBackend}(duckdb())

function set_sql_mode(mode::SQLBackend) current_sql_mode[] = mode end
Expand Down
38 changes: 34 additions & 4 deletions src/docstrings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,8 @@ const docstring_arrange =
"""
@arrange(sql_query, columns...)
Order SQL table rows based on specified column(s).
Order SQL table rows based on specified column(s). Of note, `@arrange` should not be used when performing ordered window functions,
`@window_order`, or preferably the `_order` argument in `@mutate` should be used instead
# Arguments
- `sql_query::SQLQuery`: The SQL query to arrange
Expand Down Expand Up @@ -1855,17 +1856,20 @@ julia> @chain db_table(db, df, "df_view") begin
"""


const docstring_aggregate_functions =
const docstring_aggregate_and_window_functions =
"""
Aggregate Functions
Aggregate and Window Functions
Nearly all aggregate functions from any database are supported both `@summarize` and `@mutate`.
With `@summarize`, an aggregate functions available on a SQL backend can be used as they are in sql with the same syntax (`'` should be replaced with `"`)
`@mutate` supports them as well, however, unless listed below, the function call must be wrapped with `agg()`
- `maximum`, `minimum`, `mean`, `std`, `sum`, `cumsum`
- Aggregate Functions:`maximum`, `minimum`, `mean`, `std`, `sum`, `cumsum`
- Window Functions: `lead`, `lag`, `dense_rank`, `nth_value`, `ntile`, `rank_dense`, `row_number`, `first_value`, `last_value`, `cume_dist`
If a function is needed regularly, instead of wrapping it in `agg`, it can also be added to `window_agg_fxns` with `push!` as demonstrated below
The list of DuckDB aggregate functions and their syntax can be found [here](https://duckdb.org/docs/sql/functions/aggregates.html#general-aggregate-functions)
Please refer to your backend documentation for a complete list with syntac, but open an issue on TidierDB if your run into roadblocks.
# Examples
Expand Down Expand Up @@ -1919,5 +1923,31 @@ julia> @chain db_table(db, df, "df_agg") begin
8 │ AE bb -4.5 37 0.0121342 47799.0 218.629
9 │ AG bb -2.5 135 0.0121342 47799.0 218.629
10 │ AI bb -0.5 521 0.0121342 47799.0 218.629
julia> push!(TidierDB.window_agg_fxns, :regr_slope);
julia> @chain db_table(db, df, "df_agg") begin
@mutate(
slope = regr_slope(value1, value2), # no longer wrapped in `agg` following the above
_by = groups
)
@select !percent
@arrange(groups)
@collect
end
10×5 DataFrame
Row │ id groups value1 value2 slope
│ String String Float64 Int64 Float64
─────┼─────────────────────────────────────────────
1 │ AB aa -7.5 6 0.00608835
2 │ AD aa -5.5 20 0.00608835
3 │ AF aa -3.5 70 0.00608835
4 │ AH aa -1.5 264 0.00608835
5 │ AJ aa 0.5 1034 0.00608835
6 │ AA bb -8.5 3 0.0121342
7 │ AC bb -6.5 11 0.0121342
8 │ AE bb -4.5 37 0.0121342
9 │ AG bb -2.5 135 0.0121342
10 │ AI bb -0.5 521 0.0121342
```
"""
6 changes: 3 additions & 3 deletions src/mutate_and_summ.jl
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ function process_mutate_expression(expr, sq, select_expressions, cte_name)
push!(sq.metadata, Dict("name" => col_name, "type" => "UNKNOWN", "current_selxn" => 1, "table_name" => cte_name))
end
else
throw("Unsupported expression format in @mutate: $(expr)")
throw("Unsupported expression format in @mutate: $(expr)") # COV_EXCL_LINE
end
end

Expand Down Expand Up @@ -244,7 +244,7 @@ function process_summary_expression(expr, sq, summary_str)

push!(summary_str, summary_operation * " AS " * summary_column)
else
throw("Unsupported expression format in @summarize: $(expr)")
throw("Unsupported expression format in @summarize: $(expr)") # COV_EXCL_LINE
end
end

Expand Down Expand Up @@ -322,7 +322,7 @@ macro summarize(sqlquery, expressions...)
sq.is_aggregated = true # Mark the query as aggregated
sq.post_aggregation = true # Indicate ready for post-aggregation operations
else
error("Expected sqlquery to be an instance of SQLQuery")
error("Expected sqlquery to be an instance of SQLQuery") # COV_EXCL_LINE
end
sq
end
Expand Down
55 changes: 27 additions & 28 deletions src/parsing_athena.jl
Original file line number Diff line number Diff line change
Expand Up @@ -73,15 +73,7 @@ function expr_to_sql_trino(expr, sq; from_summarize::Bool)
str = "$(arg_str)"
return "$(str) $(window_clause)"
end
elseif !isempty(sq.window_order) && isa(x, Expr) && x.head == :call
function_name = x.args[1] # This will be `lead`
args = x.args[2:end] # Capture all arguments from the second position onward
window_clause = construct_window_clause(sq)

# Create the SQL string representation of the function call
arg_str = join(map(string, args), ", ") # Join arguments into a string
str = "$(function_name)($(arg_str))" # Construct the function call string
return "$(str) $(window_clause)"

#stringr functions, have to use function that removes _ so capture can capture name
elseif @capture(x, strreplaceall(str_, pattern_, replace_))
return :(REGEXP_REPLACE($str, $pattern, $replace, 'g'))
Expand Down Expand Up @@ -142,25 +134,32 @@ function expr_to_sql_trino(expr, sq; from_summarize::Bool)
return Expr(:call, Symbol("CAST"), column, Symbol("AS STRING"))
elseif x.args[1] == :case_when
return parse_case_when(x)
elseif isa(x, Expr) && x.head == :call && x.args[1] == :! && x.args[1] != :!= && length(x.args) == 2
inner_expr = expr_to_sql_duckdb(x.args[2], sq, from_summarize = false) # Recursively transform the inner expression
return string("NOT (", inner_expr, ")")
elseif x.args[1] == :str_detect && length(x.args) == 3
column, pattern = x.args[2], x.args[3]
if pattern isa String
return string(column, " LIKE \'%", pattern, "%'")
elseif pattern isa Expr
pattern_str = string(pattern)[2:end]
return string("REGEXP_LIKE", column, ", '", pattern_str, "')")
end
elseif isa(x, Expr) && x.head == :call && x.args[1] == :n && length(x.args) == 1
if from_summarize
return "COUNT(*)"
else
window_clause = construct_window_clause(sq)
return "COUNT(*) $(window_clause)"
end
end
elseif isa(x, Expr) && x.head == :call && x.args[1] == :! && x.args[1] != :!= && length(x.args) == 2
inner_expr = expr_to_sql_duckdb(x.args[2], sq, from_summarize = false) # Recursively transform the inner expression
return string("NOT (", inner_expr, ")")
elseif x.args[1] == :str_detect && length(x.args) == 3
column, pattern = x.args[2], x.args[3]
if pattern isa String
return string(column, " LIKE \'%", pattern, "%'")
elseif pattern isa Expr
pattern_str = string(pattern)[2:end]
return string("REGEXP_LIKE", column, ", '", pattern_str, "')")
end
elseif x.args[1] == :n && length(x.args) == 1
return from_summarize ? "COUNT(*)" : "COUNT(*) $(construct_window_clause(sq))"
elseif string(x.args[1]) in String.(window_agg_fxns)
if from_summarize
return x
else
args = x.args[2:end]
window_clause = construct_window_clause(sq)
arg_str = join(map(string, args), ", ")
str_representation = "$(string(x.args[1]))($(arg_str))"
return "$(str_representation) $(window_clause)"
end
end
elseif isa(x, SQLQuery)
return "(__(" * finalize_query(x) * ")__("
end
return x
end
Expand Down
Loading

0 comments on commit d3a5a96

Please sign in to comment.