The archiver service is an ingest component that:
- Submits metadata to the appropriate external accessioning authorities. These are currently only EBI authorities (e.g. Biosamples).
- Converts metadata into the format accepted by each external authority
In the future it will:
- Update HCA metadata with accessions provided by external authorities
At the moment it consists of 3 stages.
- Running the metadata archiver (MA) script (the one in this repository) which archives the metadata of a submission through the DSP. This script also checks the submission of the files by the file uploader (see below).
- Running the file uploader (FIU) of the archive data to the DSP which runs on the EBI cluster. This will need access to the file submission JSON instructions generated by the metadata archiver.
- Running the metadata archiver (MA) again to validate and submit the entire submission.
This component is currently invoked manually after an HCA submission.
docker run -v $PWD:/output \
--env INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/ \
--env INGEST_API_GCP={ "type": "service_account", "project_id": "...", "private_key_id": "...", "private_key": "...", "client_email": "...", "client_id": "...", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "..." } \
--env DSP_API_URL=https://submission.ebi.ac.uk \
--env AAP_API_URL=https://api.aai.ebi.ac.uk/auth \
--env ONTOLOGY_API_URL=https://www.ebi.ac.uk/ols \
--env AAP_API_DOMAIN=<aap_domain> \
--env AAP_API_USER=<aap_user> \
--env AAP_API_PASSWORD=<aap_password> \
--env VALIDATION_POLL_FOREVER=False \
--env SUBMISSION_POLL_FOREVER=False \
quay.io/ebi-ait/ingest-archiver \
--alias_prefix=HCA \
--project_uuid=<project_uuid>
INGEST_API_URL
# The ingest environment to pull metadata for submission.
Production: INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/
Staging: INGEST_API_URL=http://api.ingest.staging.archive.data.humancellatlas.org/
INGEST_API_GCP (OPTIONAL)
# The service account token to use when connecting to the ingest api.
This is required when completing submissions, to post accessions back to ingest, but is otherwise optional.
Search the aws secrets manager for gcp-credentials.json
DSP_API_URL
# The DSP service on which to create the new submission to archives.
# The old name of DSP USI_API_URL is also supported
Production: https://submission.ebi.ac.uk
Test: https://submission-test.ebi.ac.uk
# The DSP uses an EBI Authentication and Authorization Profile (AAP) account.
AAP_API_URL
Production: https://api.aai.ebi.ac.uk/auth
Test: https://explore.api.aai.ebi.ac.uk/auth
AAP_API_DOMAIN
Test: subs.test-team-21
Production: subs.team-2
AAP_API_USER, AAP_API_PASSWORD
# Specify the AAP user and password. Ether create your own in the group above or use the common AAP user if archiving on behalf of ingest.
hca-ingest
--alias_prefix=HCA
--project_uuid=2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb
The --alias-prefix above is prefixed to every DSP entitiy created by the Archiver.
The --project_uuid is used to download assay manifests from the Ingest API.
You should get output like:
GETTING MANIFESTS FOR PROJECT: 2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb
Processing 6 manifests:
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/0d172fd7-f5af-4307-805b-3a421cdabd76
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/9526f387-bb5a-4a1b-9fd1-8ff977c62ffd
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/4d07290e-8bcc-4060-9b67-505133798ab0
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/b6d096f4-239a-476d-9685-2a03c86dc06b
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/985a9cb6-3665-4c04-9b93-8f41e56a2c71
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/19f1a1f8-d563-43a8-9eb3-e93de1563555
* PROCESSING MANIFEST 1/6: https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/0d172fd7-f5af-4307-805b-3a421cdabd76
Finding project entities in bundle...
1
Finding study entities in bundle...
1
Finding sample entities in bundle...
17
Finding sequencingExperiment entities in bundle...
1
Finding sequencingRun entities in bundle...
1
...
Entities to be converted: {
"project": 1,
"study": 1,
"sample": 19,
"sequencingExperiment": 6,
"sequencingRun": 6
}
Saving Report file...
Saved to /output/ARCHIVER_2019-01-04T115615/REPORT.json!
##################### FILE ARCHIVER NOTIFICATION
Saved to /output/ARCHIVER_2019-01-04T115615/FILE_UPLOAD_INFO.json!
In your current directory, the MA will have generated a directory with the name ARCHIVER_<timestamp>
containing two files, REPORT.json
and FILE_UPLOAD_INFO.json
. Inspect REPORT.json
for errors. If there are any data files to upload you will always see FileReference dsp_validation_errors in the submission_errors field. These you can ignore - we will upload the files in the following steps. For example:
"completed": false,
"submission_errors": [
{
"error_message": "Failed in DSP validation.",
"details": {
"dsp_validation_errors": [
{
"FileReference": [
"The file [306982e4-5a13-4938-b759-3feaa7d44a73.bam] referenced in the metadata is not exists on the file storage area."
]
},
{
"FileReference": [
"The file [988de423-1543-4a84-be9a-dd81f5feecff.bam] referenced in the metadata is not exists on the file storage area."
]
},
{
"FileReference": [
"The file [fd226091-9a8f-44a8-b49e-257fffa2b931.bam] referenced in the metadata is not exists on the file storage area."
]
}
]
}
}
],
If you see problems in the entities added to the submission with non-empty errors and warnings fields then please report. This is a small snippet showing a successful entity addition:
"entities": {
"HCA_2019-01-07-13-53__project_2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb": {
"errors": [],
"accession": null,
"warnings": [],
"entity_url": "https://submission-dev.ebi.ac.uk/api/projects/c26466cd-9551-46c9-b760-72e05cfc51ac"
},
FILE_UPLOAD_INFO.json
contains the instructions necessary for the file uploader to convert and upload submission data to the DSP. You need to copy this file to HCA NFS directory accessible by the cluster. However, you also need to give it a unique name so that it doesn't clash with any existing JSON files.
Therefore, prepend something to the filename to make it unique. This can be anything but we suggest your username and the dataset. For example mfreeberg_rsatija_FILE_UPLOAD_INFO.json
You will copy the file using the secure copy (scp
) command. This will need your EBI password and is equivalent to copying a file through ssh. For example
scp FILE_UPLOAD_INFO.json ebi-cli.ebi.ac.uk:/nfs/production/hca/mfreeberg_rsatija_FILE_UPLOAD_INFO.json
Login to EBI CLI to access the cluster with your EBI password
ssh ebi-cli.ebi.ac.uk
Run the file uploader with the bsub command below. We will explain more about the components below.
bsub 'singularity run -B /nfs/production/hca:/data docker://quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/mfreeberg_rsatija_FILE_UPLOAD_INFO.json -l=https://explore.api.aai.ebi.ac.uk/auth -p=<ebi-aap-password> -u=hca-ingest'
bsub
- the command for submitting a job to the clustersingularity
- the cluster runs jobs using Singularity containers.B /nfs/production/hca:/data
- this binds the/nfs/production/hca
directory to/data
inside the container.docker://quay.io/humancellatlas/ingest-file-archiver
- Singularity can run Docker images directly. This is the image for the file uploader.-d=/data
- workspace used to store downloaded files, metadata and conversions.-f=/data/mfreeberg_rsatija_FILE_UPLOAD_INFO.json
- the location of theFILE_UPLOAD_INFO.json
you copied in a previous step.-l=https://explore.api.aai.ebi.ac.uk/auth
- The AAP API url, same as the AAP_API_URL environmental variable. As above, this will need to be-l=https://api.aai.ebi.ac.uk/auth
instead if you are submitting to production DSP.-p=<ebi-aap-password>
- Test or production AAP password as used previously-u=hca-ingest
- The DSP user to use. This will always be hca-ingest right now.
On submitting you will see a response along the lines
Job <894044> is submitted to default queue <research-rh7>.
This shows that the job has been submitted to the cluster. To see the status of the job run
bjobs -W
The job should be reported as running but may also be pending if the cluster is busy.
If you want to see the job's current stdout/stderr then run the bpeek command
bpeek <job-id>
Once the job is running processing may take a long time, many days in the case where a dataset has many data file conversions to perform. It will continue running after you logout and on completion or failure will e-mail you with the results. Wait until you receive this e-mail before proceeding with the next step.
Here are some further useful links about using the cluster and associated commands.
- https://sysinf.ebi.ac.uk/doku.php?id=ebi_cluster_good_computing_guide
- https://sysinf.ebi.ac.uk/doku.php?id=introducing_singularity
The e-mail you receive will have a title similar to Job %JOB-ID%: <singularity run -B /nfs/production/hca/mfreeberg:/data docker://quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/FILE_UPLOAD_INFO.json -l=https://explore.api.aai.ebi.ac.uk/auth -p=%PW% -u=hca-ingest> in cluster <EBI> Done
This will contain a whole load of detail about the job run. Scroll down to the bottom and you should see a bunch of INFO messages such as
INFO:hca:File process_15.json: GET SUCCEEDED. Stored at fd226091-9a8f-44a8-b49e-257fffa2b931/process_15.json.
and
INFO:hca:File PBMC_RNA_R1.fastq.gz: GET SUCCEEDED. Stored at fd226091-9a8f-44a8-b49e-257fffa2b931/PBMC_RNA_R1.fastq.gz.
If you see any WARNING
or ERROR
messages please re-run the singularity command from the previous step (it will retry the failed steps) and tell ingest development.
For test purposes you can run the data uploader outside of singularity with the command
docker run --rm -v $PWD:/data quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/FILE_UPLOAD_INFO.json -l=https://api.aai.ebi.ac.uk/auth -p=<password> -u=hca-ingest
To do this you need to run the metadata archiver again:
docker run -v $PWD:/output \
--env INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/ \
--env DSP_API_URL=https://submission.ebi.ac.uk \
--env AAP_API_URL=https://api.aai.ebi.ac.uk/auth \
--env ONTOLOGY_API_URL=https://www.ebi.ac.uk/ols \
--env AAP_API_DOMAIN=<aap_domain> \
--env AAP_API_USER=<aap_user> \
--env AAP_API_PASSWORD=<aap_password>> \
--env VALIDATION_POLL_FOREVER=False \
--env SUBMISSION_POLL_FOREVER=False \
quay.io/ebi-ait/ingest-archiver \
--alias_prefix=HCA \
--project_uuid=<project_uuid>
--submission_url=https://submission.ebi.ac.uk/api/submissions/<submission-uuid>
You can get the submission UUID from either the output of the initial metadata archiver run, e.g.
DSP SUBMISSION: https://submission-dev.ebi.ac.uk/api/submissions/b729f228-d587-440c-ae5b-d0c1f34b8766
or in the REPORT.json
in the submission-url
field (there will be several). For example,
"submission_url": "https://submission-dev.ebi.ac.uk/api/submissions/b729f228-d587-440c-ae5b-d0c1f34b8766"
On success you will get the message SUCCESSFULLY SUBMITTED
. You're done!
docker build -f test.Dockerfile -t unittests .
docker run unittests
python -m unittest