Pooled data #9

jdeck88 · 2017-10-17T14:14:50Z

Discussion online about how best to accept pooled data. Here is an issue to track comments on pooled data.

The solution for accepting pooled data can be either UI changes, changes to existing drop down boxes, new metadata fields, or simply changes in documentation to describe how this should be handled, e.g. add a new section to Help doc entitled "How do i handle Pooled Data"?

jdeck88 · 2017-10-17T14:17:34Z

Comment From Chris Bird:

All we need to do is accept a demultiplex file that equates DNA sequences to sample ID. We all discussed this a while back, and I remember devising a way to get GeoMe to accept. Here are the categories that we need: Read1 Barcode, Read2 Barcode, Sample ID. These can be added in a file or built into the meta-data. Individuals or individual samples can be linked to the appropriate FASTQ files in the meta-data. I think that all of this is possible without mods, but of course, would be better if it’s baked in.

Chris Meyer and I have gone back and forth a bit on accommodating lightly processed metabarcoding data (aligned read1 and read2, demultiplexed). I think the main outcome was that even for the same data type, sometimes different labs (his and my labs) handle the same type of data differently and we may need to add some flexibility. I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info.

mgaither · 2017-10-17T16:21:24Z

Indeed but that "sample" needs to be described as well. For instance, to properly describe the sample we need to know how many individuals were pooled and the geographic range over which those individuals were collected. Were there 30 individuals collected across the island of Oahu or 30 individuals collected from the same rock?

cbird808 · 2017-10-17T17:08:15Z

I think I’m not understanding something From: Michelle R. Gaither [mailto:notifications@github.com] Sent: Tuesday, October 17, 2017 12:02 PM To: biocodellc/geome-db <geome-db@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [biocodellc/geome-db] Pooled data (#9) Indeed but that "sample" needs to be described as well. For instance, to properly describe the sample we need to know how many individuals were pooled and the geographic range over which those individuals were collected. Were there 30 individuals collected across the island of Oahu or 30 individuals collected from the same rock? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMNeS8IYCWOi_43FqppZV8KzCBHNP0_3ks5stN2MgaJpZM4P8Oi8>.

mgaither · 2017-10-17T21:14:01Z

My comment is null and void if each individual is assigned a separate sample ID with its own metadata.

jdeck88 · 2017-10-19T16:28:50Z

From Michelle Gaither:
I think the stickiest issue is the from in which data is uploaded to GeOMe.

Right now GeOMe is requesting demultiplexed fastq files-one for each individual. In the case of pooled ezRAD data that would equate to one fastq file per pool of individuals. Chris has suggested that we have a separate entry (and sample ID) for each individual in the pool with all those sample IDs pointing to the same fastq file. I'm wondering if we need to then link those sample IDs in a way that ensures users understand the pooled nature of the data. So if a user downloads the metadata for one individual in the pool ALL the metadata for that pool is automatically downloaded.

Alternatively, its has been suggested that GeOMe allow for the uploading of raw fastq files. In this case each sample ID would also include barcode information to allow for demultiplexing-with all relevant sample IDs pointing to the same fastq file [this could be the case for traditional (individually barcoded) RADSeq as well] but gets quite complex for pools of pools.

cbird808 · 2017-10-19T18:55:11Z

Pools of pools would be uncommon for rad Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: John Deck <notifications@github.com> Sent: Thursday, October 19, 2017 11:28:51 AM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9) From Michelle Gaither: I think the stickiest issue is the from in which data is uploaded to GeOMe. Right now GeOMe is requesting demultiplexed fastq files-one for each individual. In the case of pooled ezRAD data that would equate to one fastq file per pool of individuals. Chris has suggested that we have a separate entry (and sample ID) for each individual in the pool with all those sample IDs pointing to the same fastq file. I'm wondering if we need to then link those sample IDs in a way that ensures users understand the pooled nature of the data. So if a user downloads the metadata for one individual in the pool ALL the metadata for that pool is automatically downloaded. Alternatively, its has been suggested that GeOMe allow for the uploading of raw fastq files. In this case each sample ID would also include barcode information to allow for demultiplexing-with all relevant sample IDs pointing to the same fastq file [this could be the case for traditional (individually barcoded) RADSeq as well] but gets quite complex for pools of pools. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMNeS5yWnLfNGhaLIF2O8opvZ3ujAxsjks5st3jCgaJpZM4P8Oi8>.

mgaither · 2017-10-19T19:11:56Z

I seq 96 individual in a lane. With ezRAD I'm guessing you pool several "pools"...that's what I mean by pools of pools. The most unmolested fastq file would then be from a single lane of sequencing and thus a pool of "pools".

cbird808 · 2017-10-19T19:19:13Z

With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq For ddrad, we have barcoded individuals also. That's not a pool though. I suppose that somebody out there has barcoded pools in ddRAD, I'll ruminate on that Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Michelle R. Gaither <notifications@github.com> Sent: Thursday, October 19, 2017 2:11:56 PM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9) I seq 96 individual in a lane. With ezRAD I'm guessing you pool several "pools"...that's what I mean by pools of pools. The most unmolested fastq file would then be from a single lane of sequencing and thus a pool of "pools". — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMNeS_gA13fCOmJUjLRSeDVsrgRXcpBwks5st578gaJpZM4P8Oi8>.

mgaither · 2017-10-19T19:43:50Z

"With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq"

but you said earlier

"I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info."

I call that a pool of pools that hasn't been demultiplexed. I assume when your say raw FASTqs you mean straight off the Illumina with little processing and no demultiplexing.

Personally I'm in favor or demultiplexed files that represent either an individual or a pool of individuals as in ezRAD.

cbird808 · 2017-10-19T19:59:44Z

We can work this out on Monday. But I can't help myself 😁 Ezrad data is not demultiplexed, except by the sequencer. Ddrad data does need demultiplexing. I support accepting the raw fastq files. Indeed, it can't get easier than that. Thus, I envision each of the individuals barcoded in a pair of fastq files to have their own Geome entry. Each of these entries point to the pair of raw fastq files and a decode (demultiplex) file. I'm wary of accepting demultiplexed data because it involves decisions that won't be the same from person to person, it often involves some sort of trimming, quality filtering, pearing, which will remove data. Get Outlook for Android<https://aka.ms/ghei36> From: Michelle R. Gaither <notifications@github.com> Sent: Thursday, October 19, 2017 7:43:51 PM To: biocodellc/geome-db Cc: Bird, Chris; Comment Subject: Re: [biocodellc/geome-db] Pooled data (#9) "With ezrad, there are no barcodes. So every pool has its own fastq or pair of fastq" but you said earlier "I’m still of the opinion that it can’t get any easier than uploading your raw FASTQs and the demultiplex info." I call that a pool of pools that hasn't been demultiplexed. I assume when your say raw FASTqs you mean straight off the Illumina with little processing and no demultiplexing. Personally I'm in favor or demultiplexed files that represent either an individual or a pool of individuals as in ezRAD. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMNeS0CLpOUasi4kewmDBLvNNxewR-3vks5st6Z3gaJpZM4P8Oi8>. [Image]

mgaither · 2017-10-19T20:09:15Z

I'm catching your vibe

Ezrad data is not demultiplexed, except by the sequencer.---The pools are distinguished by illumina indices only (not user designed barcodes) which are parsed by the Illumina software therefore the "raw fastq" is at the level of the pool.

Ddrad data does need demultiplexing----yes. Each Illumina index will also contain individuals with barcodes.

"Thus, I envision each of the individuals barcoded in a pair of fastq files to have their own Geome entry. Each of these entries point to the pair of raw fastq files and a decode (demultiplex) file."

Yep your point is well taken.......we will discuss at meeting.

jdeck88 · 2017-10-27T21:09:31Z

After a call on Tuesday Oct. 23rd we decided to handle pooled data by adopting count values in the dwc:individualCount field. Any dwc:individualCount > 1 for a single materialSample is pooled. This way, we can handle any pooled data samples. This comment is, for now, closing this particular issue which was just directed towards coming up with a short-term pooled data strategy.

jdeck88 · 2019-06-14T15:41:03Z

How to deal with pooled-seq data has come up again, and we realized that the synopsis here did not result in an adequate clarification on how to go about represent pooled-seq data. I'm re-opening this thread to come up with a more satisfactory conclusion, in particular a direction about pooled-seq data for our FAQ document at:

https://docs.google.com/document/d/1tEFpclCyJ6aLnypmtdfdjLVhiWQ-rYhGqu5eGhq3s5s/edit#heading=h.9jdy98irwwtj

I'll ping Chris Bird and Michelle Gaither on this again since they were both on original thread.

jdeck88 closed this as completed Oct 27, 2017

jdeck88 mentioned this issue Oct 27, 2017

Pooled Data CheckBox on Upload, Query, and Marker Visualiztion #11

Open

jdeck88 reopened this Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pooled data #9

Pooled data #9

jdeck88 commented Oct 17, 2017 •

edited

Loading

jdeck88 commented Oct 17, 2017

mgaither commented Oct 17, 2017

cbird808 commented Oct 17, 2017 via email

mgaither commented Oct 17, 2017

jdeck88 commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

jdeck88 commented Oct 27, 2017

jdeck88 commented Jun 14, 2019

Pooled data #9

Pooled data #9

Comments

jdeck88 commented Oct 17, 2017 • edited Loading

jdeck88 commented Oct 17, 2017

mgaither commented Oct 17, 2017

cbird808 commented Oct 17, 2017 via email

mgaither commented Oct 17, 2017

jdeck88 commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

cbird808 commented Oct 19, 2017 via email

mgaither commented Oct 19, 2017

jdeck88 commented Oct 27, 2017

jdeck88 commented Jun 14, 2019

jdeck88 commented Oct 17, 2017 •

edited

Loading