Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update diversityOfIlealMucosa - user feedback #1336

Open
arschat opened this issue Dec 2, 2024 · 4 comments
Open

Update diversityOfIlealMucosa - user feedback #1336

arschat opened this issue Dec 2, 2024 · 4 comments
Assignees
Labels
dataset All dataset tickets should have this label, only one ticket per dataset HCA NeedsUpdate Release 46 DCP Data Release 46 @ 27/1

Comments

@arschat
Copy link
Collaborator

arschat commented Dec 2, 2024

In the hca-support channel (i.e. contact us form from Data Portal) we received the following message:

Check out this new ticket w/ subject: SRR to SRX mismatch.
Ticket description:
----------------------------------------------
in the metadata file diversityOfIlealMucosa 2024-11-28 00.27 SRR19039103 is listed as SRX15110551 but on ncbi SRA it's https://www.ncbi.nlm.nih.gov/sra/?term=SRX15110547.
Similarly SRR19039104 on SRA is https://www.ncbi.nlm.nih.gov/sra/SRX15110546 not 551 like the metadata here.
Are these metadata supplied by the authors or converted from SRA or somewhere else?
------------------
Submitted from: https://explore.data.humancellatlas.org/projects/9dd91b6e-7c62-49d3-a3d4-74f603deffdb/project-metadata.

@arschat
Copy link
Collaborator Author

arschat commented Dec 2, 2024

Using the SraRunTable.csv, I checked in metadata spreadsheet the following (comparison spreadsheet: diversityOfIlealMucosa_metadata_23-08-2024_check_map.xlsx):

  • Donor level, biosample id with
    • age
    • ethnicity
    • sex
    • disease
  • specimen level, based on biosample id
    • disease
    • srx id
    • srs id
  • cell suspension, based on srx id
    • samn id
    • gsm id
  • sequence file, based on srr
    • input biomaterial (srx id)
    • insdc srx id
    • srr in filename
  • analysis file, based on gsm from filename
    • input biomaterial (srx id)

Discrepancies found

  • There are 2 columns with SRX mis-match on sequence file tab.

    • input biomaterial (srx id) cell_suspension.biomaterial_core.biomaterial_id: 30 values are identical
    • insdc experiment (srx id) process.insdc_experiment.insdc_experiment_accession: 121 values are missing or non valid
  • There are 2 runs that don't have sequence files: SRR19039060, SRR19039068

    • corresponding srx ID not missing from cell suspension nor analysis file
  • The filenames does not seem to match the contributor provided nor the archive generated files

    • i.e. for SRR19039056 the contributor provided is named p21063-s007_GCA29_5GEX_E2_S13_L003_I1_001.fastq.gz while in the spreadsheet the name is SRR19039056_1.fastq.gz
    • number of files for archive generated files is 55 while in spreadsheet we have 127 files. the contributor provided files are 133 which align with the two runs missing (133 - 3*2 = 127 files)
  • hca-util upload area 46f49251-6bd0-4d48-8e00-bdbcd5864072

@arschat
Copy link
Collaborator Author

arschat commented Dec 2, 2024

All srx mismatches have been fixed and submission is now in graph valid.

  • ncbi cloud delivery requested for two runs
  • files delivered
  • files renamed & moved to hca-util-upload-area
  • files added to submission
  • patch content for each file
  • create 2 processes and link all 6 files
    • SRX15110562 -> SRR19039060
    • SRX15110554 -> SRR19039068
  • replace old file schema & update ontology fields data -> EDAM
  • submission export
  • import form sent

@arschat arschat added dataset All dataset tickets should have this label, only one ticket per dataset HCA labels Dec 4, 2024
@arschat
Copy link
Collaborator Author

arschat commented Dec 5, 2024

File GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz is invalid in archive (downloaded twice and got the same error)

fastq_validator.sh GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz
Checking integrity of gzip file GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz...done.
Checking  GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz
fastq_utils 0.24.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz
CASAVA=1.8
57300000
ERROR: Error in file GCA9-22812-5GEXC8_S3_L001_R2_001.fastq.gz: line 229267721: read length too small - 0

Will include "archive generated file" for this run.

@arschat arschat added NeedsUpdate Release 46 DCP Data Release 46 @ 27/1 labels Dec 9, 2024
@arschat
Copy link
Collaborator Author

arschat commented Dec 9, 2024

import form sent

@arschat arschat self-assigned this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset All dataset tickets should have this label, only one ticket per dataset HCA NeedsUpdate Release 46 DCP Data Release 46 @ 27/1
Projects
None yet
Development

No branches or pull requests

1 participant