A continualy expanding collection of data-related notes. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.
-
Google dataset search - search for any dataset. Overview
-
20 Big Data Repositories You Should Check Out - blog post from Data Science Central
-
Registry of Open Data on AWS - Amazon-hosted datasets. Genomics, satellite imagery, population statistics, many more
-
library(help = "datasets")
ordata()
- shows built-in R datasets -
A list of over 1,000 datasets available in R packages, curated by @VincentAB.
-
curran/data - A collection of public data sets, primarily in text format
-
Tidy Tuesday - A weekly social data project in R with curated datasets
-
dsbox - Data Science in the Box datasets
-
dslabs - Data Science Labs - Datasets and functions that can be used for data analysis practice, homework and projects in data science courses and workshops. 26 datasets are available for case studies in data visualization, statistical inference, modeling, linear regression, data wrangling and machine learning. Made by Rafael Irizarry and Amy Gill.
-
Multi-omics studies by BIRSBiointegration Hackathon - three single-cell multi-omics hackathons: Spatial transcriptomics, Spatial proteomics and scNMT-seq. Tweet
-
OmicsDI - Omics Discovery Index platform for searching multi-omics datasets. Integrates proteomics, genomics, metabolomics, and transcriptomics datasets. Includes data from multiple databases, from GEO and EGA to LINCS, dbGaP, more. Search by organism, disease, tissue, gene identifiers, keywords.
- Perez-Riverol, Yasset. “Discovering and Linking Public Omics Data Sets Using the Omics Discovery Index.” NATURE BIOTECHNOLOGY 35, no. 5 (2017)
-
FANTOM6 data update. New noncoding RNA annotations, remapped miRNA atlases, new Transcription initiation peaks data, MotifActivity, KD of 285 lncRNAs - expression and phenotyping data. RADICL-seq - RNA-chromatin interaction BEDPE data. RefEx reference expression dataset, and more in the paper.
- Abugessaisa, Imad, Jordan A Ramilowski, Marina Lizio, Jesicca Severin, Akira Hasegawa, Jayson Harshbarger, Atsushi Kondo, et al. “FANTOM Enters 20th Year: Expansion of Transcriptomic Atlases and Functional Annotation of Non-Coding RNAs”
-
Go Get Data (GGD) - access to scientific data via Conda recipes. BED files of various genomic features, from cancer genes to structural variants and more. Homo Sapiens, Mus Musculus, Danio Rerio, various genomic assemblies.
-
Cormier, Michael, Jonathan Belyeu, Brent S Pedersen, Joseph Brown, Johannes Koster, and Aaron Quinlan. “Go Get Data (GGD): Simple, Reproducible Access to Scientific Data.” Preprint. Bioinformatics, September 11, 2020
-
Human Pangenome Reference Consortium - HG002 Data Freeze (v1.0). Genome sequencing data using different technologies (PacBio, Oxford Nanopore, Hi-C, Strand-seq, 10X Genomics, BioNano, Illumina). https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0
-
ATAC-seq in mouse tissues. Kundaje lab pipeline, reproducible peaks.https://figshare.com/collections/An_ATAC-seq_atlas_of_chromatin_accessibility_in_mouse_tissues/4436264/1
- Liu, Chuanyu, Mingyue Wang, Xiaoyu Wei, Liang Wu, Jiangshan Xu, Xi Dai, Jun Xia, et al. “An ATAC-Seq Atlas of Chromatin Accessibility in Mouse Tissues.” Scientific Data 6, no. 1 (December 2019): 65. https://doi.org/10.1038/s41597-019-0071-0.
-
b3get
- a python module to download Broad Bioimage Benchmark Collection images. https://github.com/psteinb/b3get -
PMLB
- A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. https://github.com/EpistasisLab/penn-ml-benchmarks- Olson, Randal S., William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. “PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison.” BioData Mining 10 (2017): 36. https://doi.org/10.1186/s13040-017-0154-4.
-
https://discover.repositive.io/ - Google for genomics
-
PharmacoGx
- Analysis of Large-Scale Pharmacogenomic Data, https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html -
datamicroarray - A collection of small-sample, high-dimensional microarray data sets to assess machine-learning algorithms and models
-
UC Irvine Machine Learning Repository - Classical aggregator of well-curated datasets for machine learning
-
20 Big Data Repositories You Should Check Out. https://www.datasciencecentral.com/profiles/blogs/20-big-data-repositories-you-should-check-out-1
-
OpenML
- platform for sharing machine learning data, tasks, and experiments, API and an R package to access it. https://www.openml.org/home. mlr R package for machine learning in R. https://mlr.mlr-org.com/- Casalicchio, Giuseppe, Jakob Bossek, Michel Lang, Dominik Kirchhoff, Pascal Kerschke, Benjamin Hofner, Heidi Seibold, Joaquin Vanschoren, and Bernd Bischl. “OpenML: An R Package to Connect to the Machine Learning Platform OpenML.” Computational Statistics, June 19, 2017. https://doi.org/10.1007/s00180-017-0742-2.
-
Patent analysis using the Google Patents Public Datasets on BigQuery. https://github.com/google/patents-public-data
-
Places to find CC0 photos and the like. https://github.com/jennybc/free-photos
-
CORGIS Datasets Project - The Collection of Really Great, Interesting, Situated Datasets. https://think.cs.vt.edu/corgis/
-
Inter-university Consortium for Political and Social Research (ICPSR) provides leadership and training in data access, curation, and methods of analysis for the social science research community. https://www.icpsr.umich.edu/icpsrweb/ICPSR/
-
U.S. General Services Administration. 2018. The home of the U.S. Government’s open data. https://www.data.gov/
-
PMLB
- A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. https://github.com/EpistasisLab/penn-ml-benchmarks- Olson, Randal S., William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. “PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison.” BioData Mining 10 (2017): 36. https://doi.org/10.1186/s13040-017-0154-4.
-
Histology (CIMA) dataset - 2D histological microscopy tissue slices, stained with different stains, and landmarks denoting key-points in each slice. The dataset-histology-landmarks GitHub repo has Python code to work with this data
-
Breast cancer image classification. Data from Stanford Tissue Microarray Database (TMAD) and Breast Cancer Histopathological Database (BreakHis), >6K images. Different variants of ResNet and Inception architectures. Data augmentation (resizing, rotation, cropping, flipping). Training details. Classification into malignant and benign, or into subtypes. Can handle images at different magnifications. ResNet performs better. GitHub repository includes crawler to get images
- M. Jannesari, M. Habibzadeh, H. Aboulkheyr, P. Khosravi, O. Elemento, M. Totonchi, and I. Hajirasouliha. “Breast Cancer Histopathological Image Classification: A Deep Learning Approach.” In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2405–12, 2018
-
BACH, breast cancer histology images - Hematoxylin and eosin (H&E) stained breast histology microscopy and whole-slide images
-
Medical Data for Machine Learning - a curated list of medical data for machine learning, mostly imaging
-
ANTsRNet - Medical image analysis framework merging ANTsR and deep learning. A collection of deep learning architectures ported to the R language and tools for basic medical image processing. Based on keras and tensorflow with cross-compatibility with our python analog ANTsPyNet
- SARS COV-2 database of uniformly processed 21 COVID-19 scRNA-seq datasets (over 3.2 million cells). Table 1 - COVID-19 data obtained with various technologies. GitHub with processing scripts.
Paper
Tian, Yuan, Lindsay N. Carpp, Helen E. R. Miller, Michael Zager, Evan W. Newell, and Raphael Gottardo. “Single-Cell Immunology of SARS-CoV-2 Infection.” Nature Biotechnology, December 20, 2021. https://doi.org/10.1038/s41587-021-01131-y.
-
Mortality Tracker - web app for public mortality timecourse data analysis and visualization. COVID-19 and more. GitHub, 12min video
- Almeida, Jonas S, Meredith Shiels, Praphulla Bhawsar, Bhaumik Patel, Erika Nemeth, Richard Moffitt, Montserrat Garcia Closas, Neal Freedman, and Amy Berrington. “Mortality Tracker: The COVID-19 Case for Real Time Web APIs as Epidemiology Commons.” Bioinformatics, (August 4, 2021)
-
owid/covid-19-data - Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests. All countries. Updated daily by Our World in Data. Web site
-
COVID-19 RNA-Seq datasets - A repository for sharing information on available COVID-19 RNA-Seq datasets. Links to GEO resources, and more
-
COVID-19 Crowd Generated Gene and Drug Set Library - experimental, computational, twitted drug sets, gene sets, collected by Avi Ma'ayan's lab. Analysis with EnrichR, Venn diagrams
-
COVID19: Coronavirus COVID-19 (2019-nCoV) Epidemic Datasets - Unified tidy format datasets of COVID-19 (2019-nCoV) epidemic across several sources. The data are downloaded in real-time, cleaned and matched with exogenous variables. Tweet, COVID-19 Data Hub
-
COVID-19 Open Research Dataset (CORD-19) - full-text of COVID-related papers, PubMed, bioRxiv & medRxiv, JSON format, downloadable https://pages.semanticscholar.org/coronavirus-research
-
Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE, https://systems.jhu.edu/research/public-health/ncov/, https://github.com/CSSEGISandData/COVID-19
-
Useful projects and resources for COVID-19. https://github.com/soroushchehresa/awesome-coronavirus
-
Up-to-date data resources, analysis tools, and visualizations for COVID-19 pandemic data https://seandavi.github.io/sars2pack, https://github.com/seandavi/sars2pack
-
An ongoing repository of data on coronavirus cases and deaths in the U.S. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html, https://github.com/nytimes/covid-19-data
-
Open-Access Data and Computational Resources to Address COVID-19 https://datascience.nih.gov/covid-19-open-access-resources
-
Novel Coronavirus 2019 time series data on cases https://datahub.io/core/covid-19, https://github.com/datasets/covid-19
-
Repository of COVID-19 forecasts in the US, https://github.com/reichlab/covid19-forecast-hub
-
The coronavirus R package provides a tidy format dataset of the 2019 Novel Coronavirus COVID-19 (2019-nCoV) epidemic. The raw data pulled from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus repository. https://github.com/RamiKrispin/coronavirus
-
COVID-19 Italia - Monitoraggio situazione. https://github.com/pcm-dpc/COVID-19
-
Time Series of the Covid19 epidemic data in Italy. https://github.com/DavideMagno/ItalianCovidData
- Gitenberg is a collaborative, open source community curating and publishing highly usable and attractive ebooks in the public domain. Our books are free to use by anyone for any purpose. They contain detailed metadata and are accessible in a wide variety of formats. https://gitenberg.org/
-
Journals publishing data
-
Patent analysis using the Google Patents Public Datasets on BigQuery. https://github.com/google/patents-public-data
-
b3get
- a python module to download Broad Bioimage Benchmark Collection images. https://github.com/psteinb/b3get -
Places to find CC0 photos and the like. https://github.com/jennybc/free-photos
-
Data Notes - making it easier to publish your hidden data. https://bmcresnotes.biomedcentral.com/about/introducing-data-notes