Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Bug) Window Frame and Filter create unexpected SQL query (wrong order) #1135

Closed
ghost opened this issue Feb 7, 2023 · 1 comment · Fixed by #1137
Closed

(Bug) Window Frame and Filter create unexpected SQL query (wrong order) #1135

ghost opened this issue Feb 7, 2023 · 1 comment · Fixed by #1137

Comments

@ghost
Copy link

ghost commented Feb 7, 2023

Context:
I am using dbplyr::window_order and dbplyr::window_frame to create windowed sums of a variable. For some rows I might not have enough data points for a full windowed sum, i.e., I am at the end of my data, but need data for the next few days. In this case the windowed sum would be NA or a partial sum, depending on what row you are (see example in reprex). Then, to remove these rows without a full window, I use dplyr::filter.

Problem:
Even though the windowed sum calculation comes before the use of dplyr::filter, the values are actually calculated as if the dplyr::filter was applied before the windowed calculations, yielding wrong summations. See the reprex below for a small example, and the expected vs actual SQL query generated.

Example:

# Create the following table 
data <- tibble::tibble(
                 identification = c(1,1,1),
                 date = c(1,2,3),
                 value = c(1,1,1)
               )
# Write the data to a remote server before continuing

Screenshot 2023-02-07 at 3 56 35 PM

# Calculate windowed sums
sums <- data %>%
    dplyr::group_by(identification) %>%
    dbplyr::window_order(identification, date) %>%
    dbplyr::window_frame(
              from = 1,
              to = 2
            ) %>%
    dplyr::mutate(
             summed = sum(value, na.rm = TRUE)
           ) %>%
    dplyr::ungroup()

Screenshot 2023-02-07 at 3 58 20 PM

The problem happens here:

# Remove the rows without a full window
dplyr::filter(sums, date <= 1)

Output:
Screenshot 2023-02-07 at 4 00 41 PM
Expected:
Screenshot 2023-02-07 at 4 00 41 PM copy

This is the query generated:

SELECT
  *,
  SUM("value") OVER (PARTITION BY "identification" ORDER BY "identification", "date" ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) AS "summed"
FROM "vrvkvwtftvhrekibsploqvthpfjmzt"
WHERE ("date" <= 1.0)

Notice that the filter is added directly to the select statement, instead of being added to a second select statement after the windowed sum is calculated. This is the query I expected (since filtering comes after the creation of the windowed sums):

SELECT
  *
FROM (
  SELECT
    *,
    SUM("value") OVER (PARTITION BY "identification" ORDER BY "identification", "date" ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) AS "summed"
  FROM "vrvkvwtftvhrekibsploqvthpfjmzt"
  WHERE ("date" <= 1.0)
  );
@mgirlich
Copy link
Collaborator

mgirlich commented Feb 8, 2023

Thanks. Fixed in the dev version (which you can install via devtools::install_github("tidyverse/dbplyr")).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant