ATAC-seq Data Integration #14

ilan-gold · 2020-11-03T15:34:05Z

Cell by bin
Visualize in higlass
Cell by peaks (in BED + snap files)
Annotated peaks (genomic intervals) per cell
Genome-wide (not necessarily tied to a gene)
Our Pipelines:

We need to create all the new pipelines.

TMC:

CalTech
Stanford

Outstanding Issues:

Creation of new pipelines

Notes:
We have many visualization options:

Cell by gene

A result of the alignment? Perhaps less valuable since derived from the cell by bin matrix?
Would be useful if our heatmap viewer can scale

Quality control

Potentially view in a table

Motifs

In higlass: potentially nucleotide track or epilogos?

Variability scores for each motif

Which transcription factors are enriched per cell

Quality control scores motifs
Probably more helpful to visualize cluster profiles rather than individual cell profiles

Is there a clustering available in the snap output files already?
Also would want to show variability within clusters

Can we summarize each cluster into a single profile (using the bin data for profiles)

Link from cell sets to clusters (e.g. with colors)

ngehlenborg · 2021-05-24T13:49:29Z

Related to hubmapconsortium/portal-ui#1334:

Stanford bulk ATAC-seq files (NA_summits.bed, NA_peaks.narrowPeak) can be visualized as a single, overlaid HiGlass track in Vitessce (if HiGlass doesn't support bed files, we can use Gosling)
We need to decide which additional genome annotation tracks we should be displaying. At a minimum, we should have a track with gene locations for the correct genome build.
We need to be able to figure out which genome build is being used.

@mruffalo How can we figure out which genome build was used to process these datasets?

mruffalo · 2021-05-24T17:11:32Z

@ngehlenborg What exactly do you mean by "figure out" -- what type of answer do you have in mind? Me answering in a comment to this GitHub issue? (GRCh38 with GENCODE v32 anntoations for all processed ATAC-seq datasets.) Storing a mapping of pipeline versions (commit hashes? tags? both?) to genome and annotation versions in this repository or somewhere else appropriate? Or a programmatic way to obtain the annotations for a derived data set, given the pipeline version that was used to produce that data set?

Something like this could be automated by examining a derived data set, obtaining the pipeline commit that produced that data set, and getting supplementary data from the appropriate Docker image:

$ docker run -it --rm hubmap/sc-atac-seq-grch38:1.2-bulk
root@2ff4069dd1db:/opt# ls -1 supplementary-data/
bwa-index
gencode.v32.annotation.bed
grch38.fasta.fai
hg38.blacklist.bed
hg38.promoters.bed

This would allow accessing the actual genome annotations in BED format -- does something like this seem useful enough to make more convenient?

ngehlenborg · 2021-05-25T16:47:51Z

Sorry, that wasn't very clear. I am wondering how we can figure out which genome build was used programmatically. We should probably have that for each pipeline through an API or a well-defined location in the CWL file?

I am not sure what is best, but I would rather not have to write code that checks file names on disk.

ilan-gold · 2021-05-25T17:14:14Z

I agree with @ngehlenborg - the way this would work ideally is that it would be somewhere that is eminently parse-able (say some sort of metadata.tsv or json file) so that the portal backend can pick it up and throw it in the config for Vitessce, which will then fetch the correct annotation for that genome. I think the CWL file is a good location too - the most important thing will be consistency at least within each assay, if not across assays, that need this sort of thing.

ngehlenborg · 2022-01-20T20:26:42Z

We need to agree on a location for the genome build for a given data set with the IEC and the CMU TC. Added to portal call agenda.

mccalluc · 2022-02-07T15:55:49Z

From the 1/21/2022 minutes:

Genome build info communicated in the output directories to be used with the index

Need to know which reference genome to use and reference genome to display

MR is going to add this feature to the ATAC and RNA pipelines

Vitessce will utilize the information and a .json file should be sufficient

cc @mruffalo : Please update here if that isn't correct.

mccalluc · 2022-02-16T15:41:24Z

Matt posted on hive-developers February 9:

{
    "genome": "grch38",
    "annotations": {
        "source": "GENCODE",
        "version": 35
    }
}

Nils responded:

Confirmed that this is sufficient. Matt Ruffalo, You can go ahead and get this out.

mccalluc · 2022-07-19T15:30:11Z

Ilan says:

This just kind of fell by the wayside. I’ll have a look again, I don’t remember where this was left.

ilan-gold added enhancement New feature or request data-integration labels Nov 3, 2020

ngehlenborg mentioned this issue May 24, 2021

Visualize bulkATACseq? hubmapconsortium/portal-ui#1334

Closed

ngehlenborg mentioned this issue Jun 3, 2021

create Gosling-based genome view vitessce/vitessce#955

Open

mccalluc added the feature: vitessce label Jan 19, 2022

mccalluc transferred this issue from hubmapconsortium/portal-ui Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATAC-seq Data Integration #14

ATAC-seq Data Integration #14

ilan-gold commented Nov 3, 2020

ngehlenborg commented May 24, 2021

mruffalo commented May 24, 2021

ngehlenborg commented May 25, 2021

ilan-gold commented May 25, 2021

ngehlenborg commented Jan 20, 2022

mccalluc commented Feb 7, 2022

mccalluc commented Feb 16, 2022

mccalluc commented Jul 19, 2022

ATAC-seq Data Integration #14

ATAC-seq Data Integration #14

Comments

ilan-gold commented Nov 3, 2020

ngehlenborg commented May 24, 2021

mruffalo commented May 24, 2021

ngehlenborg commented May 25, 2021

ilan-gold commented May 25, 2021

ngehlenborg commented Jan 20, 2022

mccalluc commented Feb 7, 2022

mccalluc commented Feb 16, 2022

mccalluc commented Jul 19, 2022