Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update .proto files to the same version that Arrow is using #181

Merged
merged 23 commits into from
Sep 12, 2022

Conversation

paleolimbot
Copy link
Contributor

Currently fails spectacularly because:

  • Installing duckdb with remotes and DUCKDB_R_EXTENSIONS=substrait fails with ModuleNotFoundError: No module named 'substrait_config'; the INSTALL 'substrait'; LOAD 'substrait'; SQL command also fails (IO Error: Failed to download extension "substrait" at URL "http://extensions.duckdb.org/deadbeeff/osx_arm64/substrait.duckdb_extension.gz")
  • ARROW-16989: [C++] Substrait ProjectRel is interpreted incorrectly apache/arrow#13528 changed the ProjectRel to mean "add these columns" (i.e., mutate()) rather than "clear the column list and create new columns" (i.e., transmute()). I haven't looked into how to drop columns (something involving substrait::substrait$RelCommon$Emit$create())

@paleolimbot
Copy link
Contributor Author

It looks like it's on its way, but we need apache/arrow#13914 to merge before we can use Arrow's substrait compiler to pass any mutate() tests.

DuckDB's substrait compiler now requires strong typing (previously it was perfectly happy to accept substrait$Type$create(), which we used to signify "we have no clue what the type is").

@paleolimbot paleolimbot marked this pull request as ready for review September 8, 2022 15:54
@paleolimbot
Copy link
Contributor Author

Ok, the CMD check is clean! There are a lot of files in the diff because I updated the internal copy of substrait, but the main behaviour change is substrait_project(), which now appends (with the option to reorder or drop) instead of replaces. I kept the old version as substrait_select() (now implemented using substrait_project() because we used it in a lot of places although I could remove it entirely if that's confusing. There are also changes to the duckdb compiler because the substrait.Expression.ScalarFunction now uses the new list of substrait.FunctionArgument instead of a straight list of substrait.Expression. The function translations there are still a mess (I'm planning to clean that up in a PR for #194).

# needs install.packages("duckdb") to get the latest version (0.5.0)
library(substrait, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

ggplot2::mpg |> 
  duckdb_substrait_compiler() |> 
  filter(hwy > 30) |> 
  arrange(cty) |> 
  collect()
#> # A tibble: 22 × 11
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           2    2008     4 manu… f        20    31 p     comp…
#>  2 hyundai      sonata       2.4  2008     4 manu… f        21    31 r     mids…
#>  3 toyota       camry        2.4  2008     4 manu… f        21    31 r     mids…
#>  4 toyota       camry        2.4  2008     4 auto… f        21    31 r     mids…
#>  5 toyota       camry sol…   2.4  2008     4 manu… f        21    31 r     comp…
#>  6 toyota       camry sol…   2.4  2008     4 auto… f        22    31 r     comp…
#>  7 nissan       altima       2.5  2008     4 auto… f        23    31 r     mids…
#>  8 nissan       altima       2.5  2008     4 manu… f        23    32 r     mids…
#>  9 honda        civic        1.6  1999     4 auto… f        24    32 r     subc…
#> 10 honda        civic        1.6  1999     4 auto… f        24    32 r     subc…
#> # … with 12 more rows

Created on 2022-09-08 by the reprex package (v2.0.1)

Copy link
Contributor

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your approach here of keeping both substrait_select() and substrait_project(). Just one small comment about variable naming.

R/project-rel.R Outdated
# fields created by earlier arguments are accessible from later arguments
quos <- rlang::enquos(..., .named = TRUE)
expressions <- list()
types <- list()

# Keep a list of columns that we need to drop as part of this relation
drop_columns <- character()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having an argument named .drop_columns and then this variable names drop_columns is a little confusing - mind renaming this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent point! I updated the name and the comments to reflect why they're different variables in the way that I wrote it.

@paleolimbot paleolimbot deleted the update-protos branch September 12, 2022 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants