Support at least one binary open standard out of the box #315

chainsawriot · 2023-08-29T14:45:57Z

Except those plain text formats, all binary formats supported by this package out of the box are proprietary formats (Excel, SAS, Stata, SPSS), provided by openxlsx, haven, and readxl. These formats are popular and I support that they should remain the default. However, a proposal is to support at least one open binary format, which is 3 vs 1. I believe it's fairer. It also allows one to convert proprietary formats to a fast but open binary format out of the box.

From our list, there are Apache Parquet, feather, fst, and OASIS ODS. I think Parquet is the ideal candidate for this because it is fast and popular. One drawback is that Desktop application for opening Parquet file is not ubiquitous. ODS on the other hand is much slower but has an edge that Excel, LibreOffice, and Google Sheets all support it.

Disclosures of Possible Conflicts of Interest: I am also the maintainer of readODS

The text was updated successfully, but these errors were encountered:

chainsawriot · 2023-09-05T09:48:38Z

#340

schochastics · 2023-09-13T08:47:46Z

To understand this correctly: this is about moving arrow or readODS from Suggests to Import?
Maybe the dependencies should be taken into consideration?

rang::resolve("readODS")
#> resolved: 1 package(s). Unresolved package(s): 0 
#> $`cran::readODS`
#> The latest version of `readODS` [cran] at 2023-08-14 was 2.0.0, which has 30 unique dependencies (18 with no dependencies.)
rang::resolve("arrow")
#> resolved: 1 package(s). Unresolved package(s): 0 
#> $`cran::arrow`
#> The latest version of `arrow` [cran] at 2023-08-14 was 12.0.1.1, which has 14 unique dependencies (9 with no dependencies.)

^{Created on 2023-09-13 with reprex v2.0.2}

chainsawriot · 2023-09-13T09:12:13Z

@schochastics Yes, it is about moving either arrow or readODS from Suggests to Import.

There was a time when all packages were in Imports. However, it would increase the checking time (installing dependencies) and therefore formats that considered of secondary importance were moved to Suggests. And yes, dependency should also be taken into consideration.

chainsawriot · 2023-09-13T09:30:32Z

I think this is a better comparison: the additional packages that need to be installed by introducing it to Imports. readODS might have a lot of dependencies but most of them are overlapped with the existing packages in Imports. The difference is just one between arrow and readODS. Probably the issue now has more to do with utility.

original_deps <- c("tools", "stats", "utils", "foreign", "haven", "curl", "data.table", "readxl", "tibble", "stringi", "writexl", "lifecycle", "R.utils")

ori <- rang::resolve(original_deps, snapshot_date = Sys.Date())
#> Warning: Some package(s) can't be resolved: cran::tools, cran::stats,
#> cran::utils
nrow(rang:::.generate_installation_order(ori))
#> [1] 38

arrow <- rang::resolve(c(original_deps, "arrow"), snapshot_date = Sys.Date())
#> Warning: Some package(s) can't be resolved: cran::tools, cran::stats,
#> cran::utils
nrow(rang:::.generate_installation_order(arrow))
#> [1] 41

readODS <- rang::resolve(c(original_deps, "readODS"), snapshot_date = Sys.Date())
#> Warning: Some package(s) can't be resolved: cran::tools, cran::stats,
#> cran::utils
nrow(rang:::.generate_installation_order(readODS))
#> [1] 40

^{Created on 2023-09-13 with reprex v2.0.2}

chainsawriot · 2023-09-13T09:38:23Z

arrow has two advantages

It provides support for both parquet and feather.
setclass can be extended with one more class: arrow_table

The disadvantage is no desktop software support.

readODS has desktop software support: LibreOffice, Excel, and Google Sheets. Arguably more adoption. It also supports two formats: ods and fods. But it has no potential for improving rio. Also, the future of data science is more likely to be on arrow than ods.

schochastics · 2023-09-13T09:45:26Z

hmm tough decision, but i think my vote is on arrow given its importance for DS.

chainsawriot · 2023-09-13T10:04:03Z

Let's go with arrow then.

chainsawriot · 2023-09-13T18:03:51Z

TODOs

Move arrow to Imports
? add arrow_table as a possible class?

chainsawriot · 2023-09-13T20:25:17Z

Update Internal data

wlandau · 2023-09-18T18:48:15Z

Regarding #315 (comment), another disadvantage of arrow is that it is a really heavy and burdensome package dependency. It takes several minutes to compile, and it has platform-dependent compilation issues such as apache/arrow#30556. On top of that, popular Shiny-related packages like datamods and esquisse depend on rio but do not need arrow.

I noticed rio moved arrow to Imports recently, and this is making it hard for my team to containerize Shiny apps at work. Would it be possible to consider switching arrow back to Suggests? From https://github.com/dreamRs/datamods/blob/6a1331830f397f6fd5fdc742758f6901b690dadc/README.md?plain=1#L83-L84, it looks like rio is fundamental to how datamods works, but the latter only specifically mentions formats like Excel and SPSS.

chainsawriot · 2023-09-18T19:36:09Z

@wlandau Thank you for the input. Unfortunately, your input came at a very bad time point, where rio 1.0.0 (which is supposed to be a stable release) is already on CRAN.

Of course, I don't want you to have bad experience using rio. What I see is a conflict of visions here: we want to add an open format for computational reproducibility; but in real world usage, people use rio for importing Excel and SPSS.

I believe in Agile and we can make mistakes too. Because you represent the user community to give us feedback, we will listen to it.

I am willing to remove arrow from Imports (although I made a mistake to blog about rio 1.0.0 already). But please give me at least some days to think about how to mitigate the impact of this on-and-off arrow feature. At least we will need some time to make setclass = 'arrow' optional. I promise you I will deliver this to CRAN before Friday.

In these few days, please, if you really don't want rio to have arrow, use remotes::install_version("rio", "0.5.30") until Friday.

chainsawriot · 2023-09-19T11:24:17Z

@wlandau I just wanted to let you know that rio 1.0.1 is on CRAN with no arrow dependency. Thank you very much again for your feedback.

https://cran.r-project.org/web/packages/rio/

cc @schochastics

wlandau · 2023-09-19T13:46:46Z

@chainsawriot, thank you very much for accommodating. This change really helps my team develop our infrastructure and tools.

jsonbecker · 2024-05-11T22:33:03Z

This decision has kind of sat wrong with me for a long time. I get the concern about arrow having a heftier compile time, but parquet support is of growing importance. I wonder if the right move is actually to go the other direction? Removing foreign, haven, readxl, and writexl would cut dependencies from 38 to 17 by my count. Removing data.table and tibble goes down to 9.

If rio can't come "batteries included" for some of the most important binary types because that would be too heavy, perhaps rio shouldn't come "batteries included" where possible to dramatically reduce its footprint. That would make it easier for others to depend on rio.

That said, I think depending on rio is generally a bad decision for package developers. I'm not familiar with either datamods or esquisse, but the vision of rio at the start, and its greatest power, is a single interface, especially for R beginners, that's consistent for reading all manner of data types. It's a convenience wrapper, which feels like a bad choice for a dependency.

So another option might be to work upstream on reverse dependencies that don't seem appropriate to remove their reliance on rio and going the other way and bulking up the default install.

jsonbecker · 2024-05-11T22:35:57Z

In fact, as of this writing, esquisse does not have a rio dependency:

> tools::package_dependencies(package = "rio", reverse = TRUE)
$rio
 [1] "allMT"               "boxr"                "bruceR"
 [4] "childfree"           "cloudstoR"           "datamods"
 [7] "dataquieR"           "DistPlotter"         "dpmr"
[10] "editData"            "epiCleanr"           "estadistica"
[13] "ExPanDaR"            "framecleaner"        "genogeographer"
[16] "gesisdata"           "heterogen"           "IGoRRR"
[19] "importinegi"         "ISRaD"               "kibior"
[22] "metaConvert"         "mmstat4"             "NormalityAssessment"
[25] "normfluodbf"         "octopus"             "pewdata"
[28] "PRISMA2020"          "psData"              "ropercenter"
[31] "tfrmtbuilder"        "varsExplore"         "welo"

chainsawriot · 2024-05-12T15:20:00Z

@jsonbecker rio is in the Suggests. Soft dependency, maybe. But we still need to check for it when we run revdepcheck.

Thank you for the feedback. I agree with you that the package was meant to be an easy, unified wrapper for interactive usage. But as things naturally evolved, we also need to adapt to the (new) reality that R package developers also use rio perhaps for the file format detection and the default collection of supported file formats such as excel and spss. It is difficult to judge whether the usage of rio as a dependency is a bad choice. For instance, I would say gesisdata (despite the name, not an official GESIS product) makes sense to use rio because the file download depends on file extension and most files in the GESIS Data archive are SPSS, STATA, and Excel. datamonds is perhaps a similar story.

With this reality, it increases the complexity for adjusting the supported formats in the "Default" and "Suggest" tier. Increasing the default formats is nice (like @schochastics and I did for rio v1.0.0 to support parquet 3d91cd5), but also with "do not need x" concerns like the one from @wlandau ¹ . Decreasing the default formats is surely a breaking change. As I said #307 , we should avoid and prioritize stability ².

Having said that, please keep this discourse going. Maybe we can find a good solution to this.

A hidden issue is the maintenance cost of making a format the default, e.g. to deal with the CRAN issues when perhaps arrow breaks. Let's assume my time is infinite to ease the discussion. ↩
And therefore, I am now prioritizing fixing features implemented in the v0.x series but in a broken state, like the testing and fixing the compression mechanism. Rather than adding new formats. ↩

chainsawriot · 2024-05-14T09:32:27Z

Just a slight update: To understanding the packages that use rio, we should also look at recursive dependencies. rio is indeed a hard dependency of esquisse, via datamods.

tools::package_dependencies(packages = "rio", reverse = TRUE, recursive = TRUE)
#> $rio
#>  [1] "allMT"               "boxr"                "bruceR"             
#>  [4] "childfree"           "cloudstoR"           "datamods"           
#>  [7] "dataquieR"           "DistPlotter"         "dpmr"               
#> [10] "editData"            "epiCleanr"           "estadistica"        
#> [13] "ExPanDaR"            "framecleaner"        "genogeographer"     
#> [16] "gesisdata"           "heterogen"           "IGoRRR"             
#> [19] "importinegi"         "ISRaD"               "kibior"             
#> [22] "metaConvert"         "mmstat4"             "NormalityAssessment"
#> [25] "normfluodbf"         "octopus"             "pewdata"            
#> [28] "PRISMA2020"          "psData"              "ropercenter"        
#> [31] "tfrmtbuilder"        "varsExplore"         "welo"               
#> [34] "ChineseNames"        "PsychWordVec"        "TestAnaAPP"         
#> [37] "esquisse"            "moreparty"           "safetyGraphics"     
#> [40] "vvdoctor"            "ggplotAssist"        "rrtable"            
#> [43] "SemNetCleaner"       "presenter"           "tidybins"           
#> [46] "validata"            "shinyrecipes"        "FMAT"               
#> [49] "webr"                "scicomptools"

^{Created on 2024-05-14 with reprex v2.1.0}

chainsawriot · 2024-05-25T21:16:20Z

Keep an eye on this

https://github.com/r-lib/nanoparquet/

chainsawriot · 2024-05-28T20:03:48Z

According to cransay, nanoparquet has been submitted to CRAN.

chainsawriot · 2024-06-02T09:27:25Z

nanoparquet is now on CRAN.

https://cran.r-project.org/web//packages/nanoparquet/index.html

Min R version is 4.0.0.

chainsawriot · 2024-06-05T19:05:51Z

@wlandau I am thinking about adding back parquet support. But this time, I would like to try it with nanoparquet. I tried and the compilation took around 10s. Also, binary package of it is available from P3M.

I don't want to repeat the same thing like v1.0.0, i.e. you only noticed the added arrow only after a new version of rio was on CRAN. I was wondering whether you can try and evaluate the development version with nanoparquet later and provide your feedback? If it is not practical, then I will put it in Suggests, like the last time.

Thank you very much!

cc. @jsonbecker @schochastics

chainsawriot · 2024-06-19T12:54:47Z

Maybe I should reach out to the datamods team, e.g. @pvictor .

pvictor · 2024-06-19T14:03:12Z

I'm not sure I understand the problem, is it because datamods (and esquisse) depends on rio?

chainsawriot · 2024-06-20T08:09:35Z

@pvictor datamods is usually dockerized and datamonds depends on rio. Therefore, every time datamonds users (e.g. @wlandau ) dockerize datamonds, they also have to install rio during the image building phase. Therefore, if rio adds a hard dependency (like previously, arrow), it increases the installation time of datamonds, especially for the ones that require C/C++ compilation on Linux ¹. And previously, we reverted a decision to add to add arrow in order to support parquet.

Now, with the release of nanoparquet by the Posit Team (https://www.tidyverse.org/blog/2024/06/nanoparquet-0-3-0/), it seems to be possible again to add back the support of parquet, without the compilation problems of arrow. But of course, one dependency more is one more dependency to install. And nanoparquet also requires compilation, although the compilation is very fast. I would like to seek for your view on how it would impact datamonds, and perhaps also the users of datamonds.

Adding one light-weight dependency for the support of parquet by default, would that be useful for the users of datamonds?
It will also boost the R version requirement to 4.0.0, because nanoparquet asks for >=4.0.0. Currently, we check for 3.5 on CI. Would that also be a problem?

Thank you very much!

Let's assume one doesn't know how (or is not possible) to use Linux binary installation solutions such as P3M, r2u, or r-universe. ↩

chainsawriot · 2024-07-15T11:00:41Z

@wlandau @pvictor I just wanted to let you know that I have produced a branch that adds back the default support for parquet using nanoparquet. #444

It would be super nice if you could give it a test and see if it has any impact on your use cases. From my testing, it increases the compiling time by 21 seconds on a blank state Rocker container. I was wondering if this level of increase in compiling is acceptable. At least it is not "several minutes" as mentioned here.

I will consider your comments / evaluations before merging it to main. Thank you very much!

cc @jsonbecker @schochastics

wlandau · 2024-07-16T19:58:18Z

It's been a while since I looked at this thread, and the stuff I maintain no longer strongly depends on rio. From my perspective, feel free to do whatever works for you. It's encouraging that the compilation time went down so much.

pvictor · 2024-07-17T13:26:26Z

Great work @chainsawriot , it's great that {rio} support parquet files!

chainsawriot · 2024-07-26T08:51:07Z

nanoparquet does not support big endian platforms at the moment and it's not on the priority for the developers of nanoparquet r-lib/nanoparquet#21 . It probably won't affect >99% of the users. Some possible affected platforms are 32bit powerpc darwin, as reported here #445 . But we have to take this into consideration.

barracuda156 · 2024-07-26T09:12:22Z

@chainsawriot To be explicit, of course I do not expect you to fix anything for big-endian in nanoparquet code. It is just unnecessary to break rio for big-endian platforms due to a dependency, which, however desirable, is not essential.
It is fine if related functionality will be unavailable on some platforms; if moving nanoparquet to suggests is not an acceptable solution, another way is to offer configure option to disable it (that way default behavior won’t change).

P. S. Otherwise any solution from my side will be ugly: either I need to peg rio to an earlier version and never update it, or I need to patch the code to revert nanoparquet, which is a pain to maintain, and MacPorts folks dislike extra patches, or I need to prohibit it for big-endian archs for no good reason, which is not acceptable for me and will potentially hurt some users, however few.

chainsawriot · 2024-07-26T10:07:46Z

@barracuda156 Thank you for chipping in. I actually don't mind moving nanoparquet to "Suggests". We did it previously with arrow anyway. Maybe the timing for supporting parquet is not right.

I will give you an update. I really hope that I can finish it before the CRAN summer break.

* Rollback nanoparquet ref #315 * Bump ver

chainsawriot · 2024-07-26T16:23:07Z

@barracuda156 I rolled back for now and it should be on CRAN soon. But I really hope that you can help @gaborcsardi to make nanoparquet support big endian platforms because I think you have the expertise.

barracuda156 · 2024-07-27T20:06:42Z

@barracuda156 I rolled back for now and it should be on CRAN soon. But I really hope that you can help @gaborcsardi to make nanoparquet support big endian platforms because I think you have the expertise.

@chainsawriot Thank you, update merged in macports/macports-ports@97b00a2

chainsawriot added the v1.0 label Aug 29, 2023

chainsawriot pinned this issue Sep 3, 2023

chainsawriot unpinned this issue Sep 6, 2023

chainsawriot changed the title ~~Support at least one binary open standard out of the box~~ Move arrow to Imports Sep 13, 2023

chainsawriot added a commit that referenced this issue Sep 13, 2023

Fix #315

3d91cd5

chainsawriot closed this as completed in 114a735 Sep 13, 2023

chainsawriot reopened this Sep 13, 2023

chainsawriot added a commit that referenced this issue Sep 13, 2023

Update doc fix #315 again

47759ac

chainsawriot closed this as completed in a0da5b9 Sep 13, 2023

chainsawriot reopened this Sep 18, 2023

chainsawriot changed the title ~~Move arrow to Imports~~ Support at least one binary open standard out of the box Sep 18, 2023

chainsawriot removed the v1.0 label Sep 18, 2023

chainsawriot mentioned this issue Sep 18, 2023

Emergency v1.0.1 Sorry guys #376

Closed

3 tasks

chainsawriot pinned this issue Sep 19, 2023

chainsawriot mentioned this issue Sep 19, 2023

Further reduce readODS dependencies ropensci/readODS#173

Open

chainsawriot mentioned this issue Sep 19, 2023

A message to the community #307

Open

chainsawriot mentioned this issue Sep 19, 2023

export .generate_installation_order gesistsa/rang#154

Closed

chainsawriot mentioned this issue Jun 2, 2024

arrow requires 4.0.0 #427

Closed

chainsawriot added a commit that referenced this issue Jul 15, 2024

Add back default support for parquet ref #315

fb3bc51

chainsawriot closed this as completed in 88aa095 Jul 17, 2024

chainsawriot reopened this Jul 26, 2024

chainsawriot mentioned this issue Jul 26, 2024

Make nanoparquet optional dependency #445

Closed

chainsawriot added a commit that referenced this issue Jul 26, 2024

Rollback nanoparquet ref #315

c9a4cf3

chainsawriot added a commit that referenced this issue Jul 26, 2024

Rollback nanoparquet ref #315 (#446)

44439eb

* Rollback nanoparquet ref #315 * Bump ver

chainsawriot unpinned this issue Aug 29, 2024

chainsawriot mentioned this issue Jan 16, 2025

use arrow::read_parquet instead of nanoparquet #462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support at least one binary open standard out of the box #315

Support at least one binary open standard out of the box #315

chainsawriot commented Aug 29, 2023

chainsawriot commented Sep 5, 2023

schochastics commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023 •

edited

Loading

schochastics commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023 •

edited

Loading

chainsawriot commented Sep 13, 2023 •

edited

Loading

wlandau commented Sep 18, 2023 •

edited

Loading

chainsawriot commented Sep 18, 2023

chainsawriot commented Sep 19, 2023

wlandau commented Sep 19, 2023

jsonbecker commented May 11, 2024

jsonbecker commented May 11, 2024

chainsawriot commented May 12, 2024

chainsawriot commented May 14, 2024

chainsawriot commented May 25, 2024

chainsawriot commented May 28, 2024

chainsawriot commented Jun 2, 2024

chainsawriot commented Jun 5, 2024

chainsawriot commented Jun 19, 2024

pvictor commented Jun 19, 2024

chainsawriot commented Jun 20, 2024

chainsawriot commented Jul 15, 2024

wlandau commented Jul 16, 2024

pvictor commented Jul 17, 2024

chainsawriot commented Jul 26, 2024

barracuda156 commented Jul 26, 2024 •

edited

Loading

chainsawriot commented Jul 26, 2024

chainsawriot commented Jul 26, 2024 •

edited

Loading

barracuda156 commented Jul 27, 2024

Support at least one binary open standard out of the box #315

Support at least one binary open standard out of the box #315

Comments

chainsawriot commented Aug 29, 2023

chainsawriot commented Sep 5, 2023

schochastics commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023 • edited Loading

schochastics commented Sep 13, 2023

chainsawriot commented Sep 13, 2023

chainsawriot commented Sep 13, 2023 • edited Loading

chainsawriot commented Sep 13, 2023 • edited Loading

wlandau commented Sep 18, 2023 • edited Loading

chainsawriot commented Sep 18, 2023

chainsawriot commented Sep 19, 2023

wlandau commented Sep 19, 2023

jsonbecker commented May 11, 2024

jsonbecker commented May 11, 2024

chainsawriot commented May 12, 2024

Footnotes

chainsawriot commented May 14, 2024

chainsawriot commented May 25, 2024

chainsawriot commented May 28, 2024

chainsawriot commented Jun 2, 2024

chainsawriot commented Jun 5, 2024

chainsawriot commented Jun 19, 2024

pvictor commented Jun 19, 2024

chainsawriot commented Jun 20, 2024

Footnotes

chainsawriot commented Jul 15, 2024

wlandau commented Jul 16, 2024

pvictor commented Jul 17, 2024

chainsawriot commented Jul 26, 2024

barracuda156 commented Jul 26, 2024 • edited Loading

chainsawriot commented Jul 26, 2024

chainsawriot commented Jul 26, 2024 • edited Loading

barracuda156 commented Jul 27, 2024

chainsawriot commented Sep 13, 2023 •

edited

Loading

chainsawriot commented Sep 13, 2023 •

edited

Loading

chainsawriot commented Sep 13, 2023 •

edited

Loading

wlandau commented Sep 18, 2023 •

edited

Loading

barracuda156 commented Jul 26, 2024 •

edited

Loading

chainsawriot commented Jul 26, 2024 •

edited

Loading