Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error for some large Stata files #80

Closed
kuriwaki opened this issue Jan 29, 2021 · 12 comments
Closed

Parsing error for some large Stata files #80

kuriwaki opened this issue Jan 29, 2021 · 12 comments
Labels
bug data-download Functions that are about downloading, not uploading, data
Milestone

Comments

@kuriwaki
Copy link
Member

The download and parsing for this large file errors out though I'm not sure why.

The get_file stage succeeds; it only fails in the parsing/reading stage. It is a large Stata dta file, and we call it as not ingested even though on the GUI it appears to be ingested.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
# Error in get_dataframe_by_id(fileid = filedoi, .f = .f, original = original,  : 
#   read-in function was left NULL, but the target file is not ingested or you asked for 
# the original version. Please supply a .f argument.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG",  original = TRUE, .f = haven::read_dta)
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, 
# skip, name_repair = .name_repair) : 
#   Failed to parse /private/var/folders/58/bg0qcxrs7s7fgf3x_qlj8njw0000gn/T/Rtmpi5QTMl/foo128340268bdd: 
# This version $ of the file format is not supported.

# the raw version works.
out <- get_file_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
class(out)
# [1] "raw"
@wibeasley
Copy link
Contributor

So is this a haven problem? Can haven convert it, independently from the dataverse package?

@kuriwaki
Copy link
Member Author

kuriwaki commented Feb 5, 2021

It's relevant for dataverse in the sense that this was not a proble before 0.3.0 (see before and after as separate reprex's below). I think it might be something to do with the defaults in httr::GET -- maybe they were inadvertently changed.

# Version BEFORE 0.3.0 but after we patch fixed some get_file errors 
# remotes::install_github("IQSS/dataverse-client-r", 
#                         ref = "0b3d67c70cb90dbaf499d65c59972512f4a8eb7a")

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.2.1.9001'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> [1] "tbl_df"     "tbl"        "data.frame"
# Current CRAN Version

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.3.0'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> Error in df_parse_dta_raw(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): Failed to parse file: This version of the file format is not supported.

Created on 2021-02-04 by the reprex package (v0.3.0)

@pdeshlab
Copy link

pdeshlab commented Oct 27, 2021

Hey all! I wanted to document some issues I've been having with the dataverse package. It is also related to CES data, though I do not think it is unique to it. The package seems to have difficulty parsing large DTA files that are stored as TAB files on Dataverse

ces2016_dataverse <- get_dataframe_by_name(
     filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
     dataset = "10.7910/DVN/GDF6Z0",
     original = TRUE,
     .f = readr::read_tsv,
     server = "dataverse.harvard.edu"
)

# Warning: 5559 parsing failures.
# See problems(...) for more details.

We use read_dta to obtain the original DTA file:

ces2012_dataverse <- get_dataframe_by_name(
  filename = "CCES12_Common_VV.tab",
  dataset = "10.7910/DVN/HQEVPK",
  original = TRUE,
  .f = haven::read_dta,
  server = "dataverse.harvard.edu"
)

But when I try to use read_dta with either the 2016 file mentioned above, my code errors out completely. Instead of getting any parsing warnings, the file is not read in at all. Here is the error message I get:

ces2016_dataverse <- get_dataframe_by_name(
  filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
  dataset = "10.7910/DVN/GDF6Z0",
  original = TRUE,
  .f = haven::read_dta,
  progress = TRUE,
  server = "dataverse.harvard.edu" )

# Error: Failed to parse /private/var/folders/xr/xb6wvj7d3yd29p9mf04syjkw0000gn/T/Rtmp1rMbOV/fooe2db5c47f2c4: This version of the file format is not supported.

Same for filename = "CCES14_Common_Content_Validated.tab", dataset = "doi:10.7910/DVN/XFXJVY",

If I download the original DTA file onto my local computer, I can then use read_dta on it with no problems. Let me know if I can provide any other documentation to help with this issue.

@kuriwaki kuriwaki changed the title Parsing error -- corner case Parsing error for some large Stata files Oct 28, 2021
@kuriwaki
Copy link
Member Author

This is consistent with the first post. Your post is helpful: One possibile explanation httr::GET is corrupting the some complex parsing / delimter details of the data, and when write_dta tries to read the downloaded from tempfile, it cannot. So things do manage to download.

The "Warning: 1581 parsing failures." are warnings so may not be relevant to this thread, which is about the error. Of course it might be that the parsing failures are what is driving this. Many parse failures here are tab delimiters not working in a few dozen places in the dataset. However, the CCES12_Common_VV.tab that does work also has parsing failure warnings as well.

@pdeshlab
Copy link

Shiro, when you write: "the CCES12_Common_VV.tab that does work also has parsing failure warnings as well," when do you see those warnings? When I pull this data with the code I provided in my post, my R console gives me no parsing warnings. My apologies if this is taking us too far afield, just want to make sure I follow your logic.

@kuriwaki
Copy link
Member Author

To get the warning I read in 2012 as a TSV, with original = FALSE (that is, ingested = TRUE).
I see that your read_tsv code for 2014 and 2016 had original = TRUE. I'm surprised that worked at all: original = TRUE for an ingested dataset should download the original format (Stata dta) as a biinary; so read_tsv should not have been able to read that.

ces2012_dataverse <- get_dataframe_by_name(
    filename = "CCES12_Common_VV.tab",
    dataset = "10.7910/DVN/HQEVPK",
    original = FALSE,
    .f = read_tsv,
    server = "dataverse.harvard.edu")

# Warning message:
# One or more parsing issues, see `problems()` for details

For more on ingest, a Dataverse term, see here

@kuriwaki kuriwaki added this to the CRAN 0.3.10 milestone Dec 21, 2021
@kuriwaki kuriwaki added bug data-download Functions that are about downloading, not uploading, data labels Dec 24, 2021
@kuriwaki
Copy link
Member Author

I've done a bit more testing by putting up various versions of the CCES data on a separate demo.dataverse.org account: https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FV14QR6

I haven't found why this fails but it might be related to ingesting. Interestingly, subsets of the problematic data (2014 and 2016) do work. I made 51 partitions of the data for each state and uploaded on to the above link. They all ingested fine and I could read them via get_dataframe_by_name no problem. However, when I tried to upload the whole 2016 data again, demo.dataverse.org said it failed to ingest it and pushed it as an original .dta (see here). This is different from the original 2016 dataset.

I'll have to ask Dataverse folks for more details about their API in the new year.

@pdurbin
Copy link
Member

pdurbin commented Jan 3, 2022

@kuriwaki it looks like CCES16_Common_OUTPUT_Feb2018_VV.dta failed ingest. It's possibly due to the large size (546.7 MB).

In IQSS/dataverse#8271 (not merged yet) we're proposing to give the research more information about upload limits and ingest limits. Right now it looks like this:

146075895-7b84eb2b-b664-4077-8eb3-6a4e05ec166c

(These limits are arbitrary and random, by the way, but I hope you get the idea.)

You are welcome to leave feedback on that pull request. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Jan 3, 2022

It's possibly due to the large size (546.7 MB).

Me again. I tried it on demo and the size is allowed but I got a strange error:

Tabular data ingest failed. Ingest failed to produce Summary Statistics and/or UNF signatures; /tmp/tempTabfile.9853244723487760894.tab (No such file or directory)

Here it is in context:

Screen Shot 2022-01-03 at 1 32 29 PM

@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 3, 2022

@pdurbin: great. If it's not failing because of size, that could be a hint to why we are having trouble reading in the original version, uploaded 2018 (here). For example, I think this is not a pure Stata file but instead a Stata file that is translated from a SPSS .sav file via StatTransfer or other software, which might explain the error. I'll look in.

Does the Harvard Dataverse file that I linked to ^ show any traces of ingest issues -- especially those that a similarly sized CCES 2012 (here) does not?

and thanks for mentioning the PR -- the updated error message would certainly be useful. As you probably know, the message I got in December read "ingest process has finished with errors. ...Ingested files: CCES16_Common_OUTPUT_Feb2018_VV.dta (Error)"

@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 7, 2022

Ok, I finally have an answer for the narrow issue. I believe it fails because the detection for is_ingested is wrong. When is_ingested returns FALSE even though it is ingested, then we force nullify format (instead of keeping format ="original") which messes up the way we download the file.

If we can rewrite is_ingested that would be great, but may take a while -- I posed the question in #113.
Perhaps a quicker route for now is to change some defaults. For example, we might force format=original if we know that original=TRUE (get_dataframe_* defaults to original=FALSE now), so that we'd get the original regardless of the accuracy of is_ingested.

kuriwaki added a commit that referenced this issue Jan 7, 2022
Will cause some unexpected reading issues in testthat/tests-get_dataframe-dataframe-basketball.R
@kuriwaki
Copy link
Member Author

This should be fixed in the new CRAN version now. The short answer is that by the particular CCES files we were dealing with seemed somewhat mal-ingested, in which the ingest did go through but it did not generate a metadata file, and we had been relying on the latter to detect. By changing to a less hacky-way in #113, we can now avoid this issue.

Thanks @pdurbin. And @pdeshlab, thanks for the examples and let me know what you find with the new CRAN version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug data-download Functions that are about downloading, not uploading, data
Projects
None yet
Development

No branches or pull requests

4 participants