Skip to content

[R] Cannot read datasets partitioned by columns starting with dots #32061

@asfimport

Description

@asfimport

As in the title.
It might be due to the fact that files starting with dots are hidden.
No issues if the dot appears elsewhere.

Reprex:

library(dplyr)
library(arrow)

packageVersion("arrow")
#> [1] '8.0.0'

path_arrow_tmp <- tempfile()

mtcars %>% 
   dplyr::group_by(cyl) %>% 
   arrow::write_dataset(
      path = path_arrow_tmp
   )

base::list.files(path_arrow_tmp, recursive = TRUE, all.files = TRUE)
#> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"

mtcars_load <- path_arrow_tmp %>% 
   arrow::open_dataset() %>% 
   dplyr::collect()

setequal(mtcars$mpg, mtcars_load$mpg)
#> [1] TRUE

# Change grouping by ".cyl"

path_arrow_tmp_grp <- tempfile()

mtcars %>% 
   dplyr::mutate(.cyl = cyl) %>% 
   dplyr::group_by(.cyl) %>% 
   arrow::write_dataset(
      path = path_arrow_tmp_grp
   )

# the files are there
base::list.files(path_arrow_tmp_grp, recursive = TRUE, all.files = TRUE)
#> [1] ".cyl=4/part-0.parquet" ".cyl=6/part-0.parquet" ".cyl=8/part-0.parquet"

# 0 files detected
path_arrow_tmp_grp %>% 
   arrow::open_dataset()
#> FileSystemDataset with 0 Parquet files

# Specify partitioning manually
# still no files

path_arrow_tmp_grp %>% 
   arrow::open_dataset(
      partitioning = ".cyl",
      hive_style = TRUE
   )
#> FileSystemDataset with 0 Parquet files
#> .cyl: int32

Environment: #> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Switzerland.1252
#> ctype C
#> tz Europe/Berlin
#> date 2022-06-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
#> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.3)
#> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.1)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1)
#> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.1)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
#> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.1)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
#> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
#> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2)
Reporter: Lorenzo Gaborini

Related issues:

Note: This issue was originally created as ARROW-16720. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions