-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing error for some large Stata files #80
Comments
So is this a haven problem? Can haven convert it, independently from the dataverse package? |
It's relevant for # Version BEFORE 0.3.0 but after we patch fixed some get_file errors
# remotes::install_github("IQSS/dataverse-client-r",
# ref = "0b3d67c70cb90dbaf499d65c59972512f4a8eb7a")
library(dataverse)
library(haven)
packageVersion("dataverse")
#> [1] '0.2.1.9001'
out <- get_file(file = "CCES14_Common_Content_Validated.tab",
dataset = "10.7910/DVN/XFXJVY",
server = "dataverse.harvard.edu")
# render the raw binary -- should be a tibble
class(read_dta(out))
#> [1] "tbl_df" "tbl" "data.frame" # Current CRAN Version
library(dataverse)
library(haven)
packageVersion("dataverse")
#> [1] '0.3.0'
out <- get_file(file = "CCES14_Common_Content_Validated.tab",
dataset = "10.7910/DVN/XFXJVY",
server = "dataverse.harvard.edu")
# render the raw binary -- should be a tibble
class(read_dta(out))
#> Error in df_parse_dta_raw(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): Failed to parse file: This version of the file format is not supported. Created on 2021-02-04 by the reprex package (v0.3.0) |
Hey all! I wanted to document some issues I've been having with the
We use
But when I try to use
Same for If I download the original DTA file onto my local computer, I can then use |
This is consistent with the first post. Your post is helpful: One possibile explanation httr::GET is corrupting the some complex parsing / delimter details of the data, and when The "Warning: 1581 parsing failures." are warnings so may not be relevant to this thread, which is about the error. Of course it might be that the parsing failures are what is driving this. Many parse failures here are tab delimiters not working in a few dozen places in the dataset. However, the |
Shiro, when you write: "the CCES12_Common_VV.tab that does work also has parsing failure warnings as well," when do you see those warnings? When I pull this data with the code I provided in my post, my R console gives me no parsing warnings. My apologies if this is taking us too far afield, just want to make sure I follow your logic. |
To get the warning I read in 2012 as a TSV, with
For more on ingest, a Dataverse term, see here |
I've done a bit more testing by putting up various versions of the CCES data on a separate demo.dataverse.org account: https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FV14QR6 I haven't found why this fails but it might be related to ingesting. Interestingly, subsets of the problematic data (2014 and 2016) do work. I made 51 partitions of the data for each state and uploaded on to the above link. They all ingested fine and I could read them via I'll have to ask Dataverse folks for more details about their API in the new year. |
@kuriwaki it looks like CCES16_Common_OUTPUT_Feb2018_VV.dta failed ingest. It's possibly due to the large size (546.7 MB). In IQSS/dataverse#8271 (not merged yet) we're proposing to give the research more information about upload limits and ingest limits. Right now it looks like this: (These limits are arbitrary and random, by the way, but I hope you get the idea.) You are welcome to leave feedback on that pull request. Thanks! |
Me again. I tried it on demo and the size is allowed but I got a strange error:
Here it is in context: |
@pdurbin: great. If it's not failing because of size, that could be a hint to why we are having trouble reading in the original version, uploaded 2018 (here). For example, I think this is not a pure Stata file but instead a Stata file that is translated from a SPSS .sav file via StatTransfer or other software, which might explain the error. I'll look in. Does the Harvard Dataverse file that I linked to ^ show any traces of ingest issues -- especially those that a similarly sized CCES 2012 (here) does not? and thanks for mentioning the PR -- the updated error message would certainly be useful. As you probably know, the message I got in December read "ingest process has finished with errors. ...Ingested files: CCES16_Common_OUTPUT_Feb2018_VV.dta (Error)" |
Ok, I finally have an answer for the narrow issue. I believe it fails because the detection for If we can rewrite is_ingested that would be great, but may take a while -- I posed the question in #113. |
Will cause some unexpected reading issues in testthat/tests-get_dataframe-dataframe-basketball.R
This should be fixed in the new CRAN version now. The short answer is that by the particular CCES files we were dealing with seemed somewhat mal-ingested, in which the ingest did go through but it did not generate a metadata file, and we had been relying on the latter to detect. By changing to a less hacky-way in #113, we can now avoid this issue. Thanks @pdurbin. And @pdeshlab, thanks for the examples and let me know what you find with the new CRAN version. |
The download and parsing for this large file errors out though I'm not sure why.
The
get_file
stage succeeds; it only fails in the parsing/reading stage. It is a large Stata dta file, and we call it as not ingested even though on the GUI it appears to be ingested.The text was updated successfully, but these errors were encountered: