-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPIKE: Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles #8945
Comments
I tried the redetect file type API endpoint. It reported that it worked, but the file is still labelled as a "Shapefile as ZIP Archive". Lastly, I downloaded the Zip file, double zipped it again and uploaded it to Demo Dataverse to see if Demo would label it as a "Shapefile as ZIP Archive". It did. (The dataset was deleted along with other datasets older than 30 days.) |
A .zip file would get labeled as a Shapefile if any of the included files has an extension in ["shp", "shx", "dbf", "prj"]. I can't see your example file - does it have one of these? If so, we could/should tighten up the logic to test for all four since all 4 are required and someone may have a .prj or other single extension for some other reason. If there are no files with these extensions, then something else is happening. |
Hi @qqmyers. There are no shape files in the zip file. |
@pdurbin found files like “pointZ.dbf pointZ.prj pointZ.shp pointZ.shx” in a hidden directory inside of the zip file. "They seem to come from an R package called “maptools”. The path in the zip is replication/rpkgs/.checkpoint/2020-07-30/lib/x86_64-w64-mingw32/4.0.2/maptools/shapes." The depositor wrote that "the zip file does not contain any shape files." I'm not sure if the depositor's scripts use the maptools package. I've asked the depositor:
|
I haven't heard from the depositor, yet. Just sent a followup email. I also took a look at the R files in the zip file and didn't see a maptools package being imported, but I'm not very familiar with R either, so I asked the depositor some clarifying questions about that too. |
The depositor let me know that they don't think they directly used maptools in the replication, but it's possible that other packages require maptools and that it's tough to figure out which packages require which packages, so they'd rather keep the current files as they are. This sounds to me like the entire zip file is not a shapefile and it shouldn't be labelled as such only because a library used in code files includes hidden directories with shapefiles in it. Could that file detection feature in the Dataverse software be adjusted so that it doesn't label this and other zip files like it as "Shapefile as ZIP Archive"? I remember hearing that by using the API we can also upload files and specify any file type. Thought that could be a workaround for this depositor so I tested this on Demo Dataverse:
But it doesn't work for this zip file. The uploaded file is still labelled as "Shapefile as ZIP Archive" and the response in my terminal shows that the "contentType":"application/zipped-shapefile". (It does work for a PNG file I tried.) @mreekie, in #8816 we wrote about planning to talk with others who know more about the preservation and use of shapefiles. I'm wondering if those folks can also weigh in on this. |
Moving this out of the Harvard Dataverse Repository GitHub repo and into the Dataverse software GitHub |
I gave this a 10. Hopefully it's straightforward and we have a file to try to reproduce the problem. ^^ We most recently touched this code here: |
A depositor uploaded a double zipped file into a dataset in the Harvard Dataverse Repository and the file has been incorrectly labelled as a "Shapefile as ZIP Archive".
The file is in the published dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HWVUER.
There are no shape files in the zip file and the depositor wrote that it isn't a shapefile. The depositor also wrote that they used the UI (their Chrome browser) to upload the file (and not the Dataverse API). The email conversation with the depositor is at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322790.
The file needs to be correctly labelled as a "ZIP Archive". Having it labelled as a "Shapefile as ZIP Archive" might be confusing to anyone looking to download the data.
The text was updated successfully, but these errors were encountered: