Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to extend format to allow for hybrid (short/long sequencing) data #14

Open
andersgs opened this issue Aug 15, 2022 · 7 comments

Comments

@andersgs
Copy link

andersgs commented Aug 15, 2022

I suggest adding an optional SRArun_acc_long column to the metadata format to support hybrid datasets with short and long sequencing data.

We would need a supporting sha256sumLongRead column too.

@retimme
Copy link

retimme commented Aug 15, 2022

ahh, you are proposing an additional column, not to describe long reads specifically, but to capture an additional SRR accession? maybe we make it more generic to maximize it's use?

@lskatz
Copy link
Contributor

lskatz commented Aug 17, 2022

I've been thinking about this more and more, and I think from This tweet , Justin is right. Every new technology introduces a new column to the scheme, forcing new versions of this repo. I think I'd rather leave it as is. It will force duplicate rows of metadata but that would only be overcome by yet another table which I'd also rather not design. @retimme does at make sense to you too?

@retimme
Copy link

retimme commented Aug 17, 2022

yeah, i'm torn between solving this for immediate use and coming up with better standard that will stand up better over time. We could bring this up in the PHA4GE data_structures working group for ideas modernizing the structure? I don't need to be involved in the solution (my plate is full) - @andersgs, what do you think?

@andersgs
Copy link
Author

I am happy with that. Possibly a JSON format might be better suited to the task. I am, however, running my own battle with cancer. So, I am not really able to spear head this. I could check with Emma to see what she thinks.

@andersgs
Copy link
Author

My goal was to instigate a debate more than arrive at a solution. 😎

@retimme
Copy link

retimme commented Aug 19, 2022

understood, @andersgs. I'm sorry to hear that. I agree that this should be moved to JSON. Lets ping emma for ideas.

@andersgs
Copy link
Author

Thank you @retimme... I have been thinking about how this may work and came up with this YAML file as perhaps a launching pad for discussions.

---
name: my dataset
n_samples: 1
description: |
 An example dataset
pmid: 123456
test: hybrid assemblies
data:
 - biosample: SAMEA123456
   collection_date: 2000-01-01
   organism: Klebsiella pneumoniae
   reads:
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_1.fastq.gz 
     md5sum: gahquhgq174
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_2.fastq.gz 
     md5sum: hjutd1356
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123457
     url: ftp://ftp.ena/SRR123457.fastq.gz 
     md5sum: gujrsa12367
     bytes: 98989
     read_type: long
     library_format: single-end
     library_strategy: shotgun
   tests:
   - name: assembly length
     expected_value: 5,146,787
     method: unicycler hybrid assembly 
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”
   - name: number of circular plasmids
     expected_value: 3
     method: unicycler hybrid assembly
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”

There are essentially four bits.

A preamble with data about the dataset:

name: my dataset
n_samples: 1
description: |
 An example dataset
pmid: 123456
test: hybrid assemblies

Then a data section that has three elements per sample.

First information about the sample data:

 - biosample: SAMEA123456
   collection_date: 2000-01-01
   organism: Klebsiella pneumoniae

And, then information about the read data:

   reads:
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_1.fastq.gz 
     md5sum: gahquhgq174
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_2.fastq.gz 
     md5sum: hjutd1356
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123457
     url: ftp://ftp.ena/SRR123457.fastq.gz 
     md5sum: gujrsa12367
     bytes: 98989
     read_type: long
     library_format: single-end
     library_strategy: shotgun

Finally, information about the test and expected result for the sample:

   tests:
   - name: assembly length
     expected_value: 5,146,787
     method: unicycler hybrid assembly 
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”
   - name: number of circular plasmids
     expected_value: 3
     method: unicycler hybrid assembly
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”

With this last tests section being per sample.

It is a little more descriptive, but it provides various elements that would allow for comparison. Allows for multiple tests per sample. One could imagine someone releases a dataset with typing info (e.g., MLST and serotyping) per sample. Or, detection of different AMR profiles per sample. So, expanding on the initial idea of using for phylogenetic-based surveillance.

Curious to hear what you guys think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants