Skip to content
This repository has been archived by the owner on Aug 28, 2024. It is now read-only.

proof of concept/reference code for Dataverse out of band uploader

Notifications You must be signed in to change notification settings

BerkeleyLibrary/deo-volente_old

Repository files navigation

deo-volente

deo-volente is a bunch of scripts and utility code for loading data into dataverse.

job structure

flowchart
    start(["start(dir, doi)"]) --> walkDirectory["walkDirectory(dir)"]
        --> forEachFile --> fileSuccess{"success?"} -->|true| updateDataverse
        --> success{"success?"} -->|true| End(["end"]) 

    subgraph forEachFile["forEachFile (map)"]
        handleFile --> copyToDVStorage 

        subgraph handleFile["handleFile(filename)"]
            generateStorageIdentifier -->fileMetadata
            getPathname --> splitPathAndFileName & getMimeType & md5Hash --> fileMetadata[/"fileMetadata"/]
        end
        
        subgraph copyToDVStorage
            dest["dest = dvFilesDir + splitDoi + storageIdentifier"] -->
            copyFile["copyFile(source, dest)"] --> compareMD5["compareMD5(source, dest)"]
        end
    end

    subgraph updateDataverse["updateDataverse (reduce)"]
        direction TB
        note["presumes we send one blob of JSON for all files.
        could post single files or batches."]-.->
        prepareJSON --> postJSONToDataverse --> logDataverseFileID[(logDataverseFileID)]
    end

    fileSuccess -->|false| fileFail(["fail (file job)"]) -.-> Fail
    success -->|false| Fail(["fail (dataset job)"])
Loading

large file workaround, mk ii

for files that fail because of timeouts related to size. this requires dataverse.files.<id>.upload-out-of-band=true to be set (see Dataverse documentation).

for each file in a batch:

  • generate a storageIdentifier for the file
  • upload the file to that storageIdentifier in the correct Dataverse files subdirectory (i.e. files/$PERSISTENT_IDENTIFIER)
  • create a json blob like so:
{
    "directoryLabel":"${relativePathInDatset}",
    "storageIdentifier":"file://${storageIdentifier}",
    "fileName":"${fileName.ext}",
    "mimeType":"${mime/type}",
    "description": "",
    "md5Hash": "${checksum}",
}

then, when you have a batch:

  • make an array of all the json objects
  • POST to "$SERVER_URL/api/datasets/:persistentId/addFiles?persistentId=$PERSISTENT_IDENTIFIER", with formdata "jsonData=$JSON_DATA"

later

  • sidekiq is a better infrastucture

About

proof of concept/reference code for Dataverse out of band uploader

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages