Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pooled data #9

Open
jdeck88 opened this issue Oct 17, 2017 · 13 comments
Open

Pooled data #9

jdeck88 opened this issue Oct 17, 2017 · 13 comments

Comments

@jdeck88
Copy link
Member

jdeck88 commented Oct 17, 2017

Discussion online about how best to accept pooled data. Here is an issue to track comments on pooled data.

The solution for accepting pooled data can be either UI changes, changes to existing drop down boxes, new metadata fields, or simply changes in documentation to describe how this should be handled, e.g. add a new section to Help doc entitled "How do i handle Pooled Data"?

@jdeck88
Copy link
Member Author

jdeck88 commented Oct 17, 2017

Comment From Chris Bird:

All we need to do is accept a demultiplex file that equates DNA sequences to sample ID. We all discussed this a while back, and I remember devising a way to get GeoMe to accept. Here are the categories that we need: Read1 Barcode, Read2 Barcode, Sample ID. These can be added in a file or built into the meta-data. Individuals or individual samples can be linked to the appropriate FASTQ files in the meta-data. I think that all of this is possible without mods, but of course, would be better if it’s baked in.

Chris Meyer and I have gone back and forth a bit on accommodating lightly processed metabarcoding data (aligned read1 and read2, demultiplexed). I think the main outcome was that even for the same data type, sometimes different labs (his and my labs) handle the same type of data differently and we may need to add some flexibility. I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info.

@mgaither
Copy link

Indeed but that "sample" needs to be described as well. For instance, to properly describe the sample we need to know how many individuals were pooled and the geographic range over which those individuals were collected. Were there 30 individuals collected across the island of Oahu or 30 individuals collected from the same rock?

@cbird808
Copy link

cbird808 commented Oct 17, 2017 via email

@mgaither
Copy link

My comment is null and void if each individual is assigned a separate sample ID with its own metadata.

@jdeck88
Copy link
Member Author

jdeck88 commented Oct 19, 2017

From Michelle Gaither:
I think the stickiest issue is the from in which data is uploaded to GeOMe.

Right now GeOMe is requesting demultiplexed fastq files-one for each individual. In the case of pooled ezRAD data that would equate to one fastq file per pool of individuals. Chris has suggested that we have a separate entry (and sample ID) for each individual in the pool with all those sample IDs pointing to the same fastq file. I'm wondering if we need to then link those sample IDs in a way that ensures users understand the pooled nature of the data. So if a user downloads the metadata for one individual in the pool ALL the metadata for that pool is automatically downloaded.

Alternatively, its has been suggested that GeOMe allow for the uploading of raw fastq files. In this case each sample ID would also include barcode information to allow for demultiplexing-with all relevant sample IDs pointing to the same fastq file [this could be the case for traditional (individually barcoded) RADSeq as well] but gets quite complex for pools of pools.

@cbird808
Copy link

cbird808 commented Oct 19, 2017 via email

@mgaither
Copy link

I seq 96 individual in a lane. With ezRAD I'm guessing you pool several "pools"...that's what I mean by pools of pools. The most unmolested fastq file would then be from a single lane of sequencing and thus a pool of "pools".

@cbird808
Copy link

cbird808 commented Oct 19, 2017 via email

@mgaither
Copy link

"With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq"

but you said earlier

"I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info."

I call that a pool of pools that hasn't been demultiplexed. I assume when your say raw FASTqs you mean straight off the Illumina with little processing and no demultiplexing.

Personally I'm in favor or demultiplexed files that represent either an individual or a pool of individuals as in ezRAD.

@cbird808
Copy link

cbird808 commented Oct 19, 2017 via email

@mgaither
Copy link

I'm catching your vibe

Ezrad data is not demultiplexed, except by the sequencer.---The pools are distinguished by illumina indices only (not user designed barcodes) which are parsed by the Illumina software therefore the "raw fastq" is at the level of the pool.

Ddrad data does need demultiplexing----yes. Each Illumina index will also contain individuals with barcodes.

"Thus, I envision each of the individuals barcoded in a pair of fastq files to have their own Geome entry. Each of these entries point to the pair of raw fastq files and a decode (demultiplex) file."

Yep your point is well taken.......we will discuss at meeting.

@jdeck88
Copy link
Member Author

jdeck88 commented Oct 27, 2017

After a call on Tuesday Oct. 23rd we decided to handle pooled data by adopting count values in the dwc:individualCount field. Any dwc:individualCount > 1 for a single materialSample is pooled. This way, we can handle any pooled data samples. This comment is, for now, closing this particular issue which was just directed towards coming up with a short-term pooled data strategy.

@jdeck88
Copy link
Member Author

jdeck88 commented Jun 14, 2019

How to deal with pooled-seq data has come up again, and we realized that the synopsis here did not result in an adequate clarification on how to go about represent pooled-seq data. I'm re-opening this thread to come up with a more satisfactory conclusion, in particular a direction about pooled-seq data for our FAQ document at:

https://docs.google.com/document/d/1tEFpclCyJ6aLnypmtdfdjLVhiWQ-rYhGqu5eGhq3s5s/edit#heading=h.9jdy98irwwtj

I'll ping Chris Bird and Michelle Gaither on this again since they were both on original thread.

@jdeck88 jdeck88 reopened this Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants