Parsing error for some large Stata files #80

kuriwaki · 2021-01-29T18:25:12Z

The download and parsing for this large file errors out though I'm not sure why.

The get_file stage succeeds; it only fails in the parsing/reading stage. It is a large Stata dta file, and we call it as not ingested even though on the GUI it appears to be ingested.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
# Error in get_dataframe_by_id(fileid = filedoi, .f = .f, original = original,  : 
#   read-in function was left NULL, but the target file is not ingested or you asked for 
# the original version. Please supply a .f argument.

out <- get_dataframe_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG",  original = TRUE, .f = haven::read_dta)
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, 
# skip, name_repair = .name_repair) : 
#   Failed to parse /private/var/folders/58/bg0qcxrs7s7fgf3x_qlj8njw0000gn/T/Rtmpi5QTMl/foo128340268bdd: 
# This version $ of the file format is not supported.

# the raw version works.
out <- get_file_by_doi("https://doi.org/10.7910/DVN/XFXJVY/GQMQLG")
class(out)
# [1] "raw"

The text was updated successfully, but these errors were encountered:

wibeasley · 2021-01-29T19:55:03Z

So is this a haven problem? Can haven convert it, independently from the dataverse package?

kuriwaki · 2021-02-05T03:08:16Z

It's relevant for dataverse in the sense that this was not a proble before 0.3.0 (see before and after as separate reprex's below). I think it might be something to do with the defaults in httr::GET -- maybe they were inadvertently changed.

# Version BEFORE 0.3.0 but after we patch fixed some get_file errors 
# remotes::install_github("IQSS/dataverse-client-r", 
#                         ref = "0b3d67c70cb90dbaf499d65c59972512f4a8eb7a")

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.2.1.9001'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> [1] "tbl_df"     "tbl"        "data.frame"

# Current CRAN Version

library(dataverse)
library(haven)

packageVersion("dataverse")
#> [1] '0.3.0'

out <- get_file(file = "CCES14_Common_Content_Validated.tab", 
                dataset  = "10.7910/DVN/XFXJVY", 
                server = "dataverse.harvard.edu")

# render the raw binary -- should be a tibble
class(read_dta(out))
#> Error in df_parse_dta_raw(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): Failed to parse file: This version of the file format is not supported.

^{Created on 2021-02-04 by the reprex package (v0.3.0)}

pdeshlab · 2021-10-27T20:10:58Z

Hey all! I wanted to document some issues I've been having with the dataverse package. It is also related to CES data, though I do not think it is unique to it. The package seems to have difficulty parsing large DTA files that are stored as TAB files on Dataverse

ces2016_dataverse <- get_dataframe_by_name(
     filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
     dataset = "10.7910/DVN/GDF6Z0",
     original = TRUE,
     .f = readr::read_tsv,
     server = "dataverse.harvard.edu"
)

# Warning: 5559 parsing failures.
# See problems(...) for more details.

We use read_dta to obtain the original DTA file:

ces2012_dataverse <- get_dataframe_by_name(
  filename = "CCES12_Common_VV.tab",
  dataset = "10.7910/DVN/HQEVPK",
  original = TRUE,
  .f = haven::read_dta,
  server = "dataverse.harvard.edu"
)

But when I try to use read_dta with either the 2016 file mentioned above, my code errors out completely. Instead of getting any parsing warnings, the file is not read in at all. Here is the error message I get:

ces2016_dataverse <- get_dataframe_by_name(
  filename = "CCES16_Common_OUTPUT_Feb2018_VV.tab",
  dataset = "10.7910/DVN/GDF6Z0",
  original = TRUE,
  .f = haven::read_dta,
  progress = TRUE,
  server = "dataverse.harvard.edu" )

# Error: Failed to parse /private/var/folders/xr/xb6wvj7d3yd29p9mf04syjkw0000gn/T/Rtmp1rMbOV/fooe2db5c47f2c4: This version of the file format is not supported.

Same for filename = "CCES14_Common_Content_Validated.tab", dataset = "doi:10.7910/DVN/XFXJVY",

If I download the original DTA file onto my local computer, I can then use read_dta on it with no problems. Let me know if I can provide any other documentation to help with this issue.

kuriwaki · 2021-10-28T06:50:58Z

This is consistent with the first post. Your post is helpful: One possibile explanation httr::GET is corrupting the some complex parsing / delimter details of the data, and when write_dta tries to read the downloaded from tempfile, it cannot. So things do manage to download.

The "Warning: 1581 parsing failures." are warnings so may not be relevant to this thread, which is about the error. Of course it might be that the parsing failures are what is driving this. Many parse failures here are tab delimiters not working in a few dozen places in the dataset. However, the CCES12_Common_VV.tab that does work also has parsing failure warnings as well.

pdeshlab · 2021-10-28T12:35:52Z

Shiro, when you write: "the CCES12_Common_VV.tab that does work also has parsing failure warnings as well," when do you see those warnings? When I pull this data with the code I provided in my post, my R console gives me no parsing warnings. My apologies if this is taking us too far afield, just want to make sure I follow your logic.

kuriwaki · 2021-10-28T16:13:45Z

To get the warning I read in 2012 as a TSV, with original = FALSE (that is, ingested = TRUE).
I see that your read_tsv code for 2014 and 2016 had original = TRUE. I'm surprised that worked at all: original = TRUE for an ingested dataset should download the original format (Stata dta) as a biinary; so read_tsv should not have been able to read that.

ces2012_dataverse <- get_dataframe_by_name(
    filename = "CCES12_Common_VV.tab",
    dataset = "10.7910/DVN/HQEVPK",
    original = FALSE,
    .f = read_tsv,
    server = "dataverse.harvard.edu")

# Warning message:
# One or more parsing issues, see `problems()` for details

For more on ingest, a Dataverse term, see here

kuriwaki · 2021-12-28T08:13:05Z

I've done a bit more testing by putting up various versions of the CCES data on a separate demo.dataverse.org account: https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FV14QR6

I haven't found why this fails but it might be related to ingesting. Interestingly, subsets of the problematic data (2014 and 2016) do work. I made 51 partitions of the data for each state and uploaded on to the above link. They all ingested fine and I could read them via get_dataframe_by_name no problem. However, when I tried to upload the whole 2016 data again, demo.dataverse.org said it failed to ingest it and pushed it as an original .dta (see here). This is different from the original 2016 dataset.

I'll have to ask Dataverse folks for more details about their API in the new year.

pdurbin · 2022-01-03T15:50:25Z

@kuriwaki it looks like CCES16_Common_OUTPUT_Feb2018_VV.dta failed ingest. It's possibly due to the large size (546.7 MB).

In IQSS/dataverse#8271 (not merged yet) we're proposing to give the research more information about upload limits and ingest limits. Right now it looks like this:

(These limits are arbitrary and random, by the way, but I hope you get the idea.)

You are welcome to leave feedback on that pull request. Thanks!

pdurbin · 2022-01-03T18:36:43Z

It's possibly due to the large size (546.7 MB).

Me again. I tried it on demo and the size is allowed but I got a strange error:

Tabular data ingest failed. Ingest failed to produce Summary Statistics and/or UNF signatures; /tmp/tempTabfile.9853244723487760894.tab (No such file or directory)

Here it is in context:

kuriwaki · 2022-01-03T18:45:26Z

@pdurbin: great. If it's not failing because of size, that could be a hint to why we are having trouble reading in the original version, uploaded 2018 (here). For example, I think this is not a pure Stata file but instead a Stata file that is translated from a SPSS .sav file via StatTransfer or other software, which might explain the error. I'll look in.

Does the Harvard Dataverse file that I linked to ^ show any traces of ingest issues -- especially those that a similarly sized CCES 2012 (here) does not?

and thanks for mentioning the PR -- the updated error message would certainly be useful. As you probably know, the message I got in December read "ingest process has finished with errors. ...Ingested files: CCES16_Common_OUTPUT_Feb2018_VV.dta (Error)"

kuriwaki · 2022-01-07T00:18:22Z

Ok, I finally have an answer for the narrow issue. I believe it fails because the detection for is_ingested is wrong. When is_ingested returns FALSE even though it is ingested, then we force nullify format (instead of keeping format ="original") which messes up the way we download the file.

If we can rewrite is_ingested that would be great, but may take a while -- I posed the question in #113.
Perhaps a quicker route for now is to change some defaults. For example, we might force format=original if we know that original=TRUE (get_dataframe_* defaults to original=FALSE now), so that we'd get the original regardless of the accuracy of is_ingested.

Will cause some unexpected reading issues in testthat/tests-get_dataframe-dataframe-basketball.R

kuriwaki · 2022-01-13T15:55:37Z

This should be fixed in the new CRAN version now. The short answer is that by the particular CCES files we were dealing with seemed somewhat mal-ingested, in which the ingest did go through but it did not generate a metadata file, and we had been relying on the latter to detect. By changing to a less hacky-way in #113, we can now avoid this issue.

Thanks @pdurbin. And @pdeshlab, thanks for the examples and let me know what you find with the new CRAN version.

kuriwaki mentioned this issue Mar 6, 2021

Encoding issues for some CCES dataverses not working kuriwaki/ccesMRPprep#6

Closed

kuriwaki changed the title ~~Parsing error -- corner case~~ Parsing error for some large Stata files Oct 28, 2021

kuriwaki added this to the CRAN 0.3.10 milestone Dec 21, 2021

kuriwaki added bug data-download Functions that are about downloading, not uploading, data labels Dec 24, 2021

kuriwaki mentioned this issue Jan 7, 2022

Better detection test for whether a file is ingested #113

Closed

kuriwaki added a commit that referenced this issue Jan 7, 2022

Attempt at fixing #80 minimally.

f13246a

Will cause some unexpected reading issues in testthat/tests-get_dataframe-dataframe-basketball.R

kuriwaki added a commit that referenced this issue Jan 13, 2022

First attempt for #113 (though see #113 (comment)), retry for #80

5ffbc11

kuriwaki mentioned this issue Jan 13, 2022

CRAN 0.3.10 Better ingest detection method which solves #80 #114

Merged

kuriwaki closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing error for some large Stata files #80

Parsing error for some large Stata files #80

kuriwaki commented Jan 29, 2021

wibeasley commented Jan 29, 2021

kuriwaki commented Feb 5, 2021

pdeshlab commented Oct 27, 2021 •

edited by kuriwaki

Loading

kuriwaki commented Oct 28, 2021

pdeshlab commented Oct 28, 2021

kuriwaki commented Oct 28, 2021

kuriwaki commented Dec 28, 2021

pdurbin commented Jan 3, 2022

pdurbin commented Jan 3, 2022

kuriwaki commented Jan 3, 2022

kuriwaki commented Jan 7, 2022 •

edited

Loading

kuriwaki commented Jan 13, 2022

Parsing error for some large Stata files #80

Parsing error for some large Stata files #80

Comments

kuriwaki commented Jan 29, 2021

wibeasley commented Jan 29, 2021

kuriwaki commented Feb 5, 2021

pdeshlab commented Oct 27, 2021 • edited by kuriwaki Loading

kuriwaki commented Oct 28, 2021

pdeshlab commented Oct 28, 2021

kuriwaki commented Oct 28, 2021

kuriwaki commented Dec 28, 2021

pdurbin commented Jan 3, 2022

pdurbin commented Jan 3, 2022

kuriwaki commented Jan 3, 2022

kuriwaki commented Jan 7, 2022 • edited Loading

kuriwaki commented Jan 13, 2022

pdeshlab commented Oct 27, 2021 •

edited by kuriwaki

Loading

kuriwaki commented Jan 7, 2022 •

edited

Loading