Skip to content

Releases: IHEC/epiATLAS-metadata-harmonization

Version 1.3

11 Dec 22:01
Compare
Choose a tag to compare
Version 1.3 Pre-release
Pre-release

Version 1.3

At this time, this repository is only for sample metadata, not experiment metadata.
For more information about experiment metadata check out the IHEC Data Portal and EpiATLAS.
You can also find this metadata on EpiRR.
There is metadata available for 2279 EpiRR entries.
The CSV for the sample metadata can be found at openrefine/v1.3/IHEC_metadata_harmonization.v1.3.csv and the extended version at openrefine/v1.3/IHEC_metadata_harmonization.v1.3.extended.csv

News

  • New column harmonized_sample_label which is a sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst.
  • Ordering of rows is now based on the following columns (harmonized_sample_ontology_term_high_order_fig1 and harmonized_sample_ontology_intermediate ordered manually; age sorted as double; other columns sorted ignoring case) in this order: harmonized_sample_ontology_term_high_order_fig1, harmonized_sample_ontology_intermediate, harmonized_sample_label, harmonized_sample_disease_high, harmonized_sample_disease_intermediate, harmonized_donor_sex, automated_harmonized_donor_age_in_years, and EpiRR.
  • Some changes in harmonized_sample_ontology_intermediate.
  • Fixed harmonized_donor_life_stage for 5 entries.
  • Extended version: harmonized_sample_ontology_term_high_order_fig1_color contains a coloring for each value in harmonized_sample_ontology_term_high_order_fig1.
  • Extended version: harmonized_sample_ontology_intermediate_color contains a coloring for harmonized_sample_ontology_intermediate.
  • Extended version: In addition to the columns harmonized_donor_sex and harmonized_donor_life_stage that have been complemented and corrected, based on the high confidence predictions of the EpiClass tool, the extended version now contains the columns without these corrections, i.e., ${column}_uncorrected.
  • Extended version: The columns containing information about whether data is available have been renamed to contain the assay name, e.g., automated_experiments_ChIP-Seq_H3K27ac. WGBS and RNA-Seq columns have been separated by PBAT vs. standard and mRNA-Seq vs. total-RNA-Seq.

Raw Files

In case you are interested in the raw files that the harmonization process was based on, those can be found at raw/EpiAtlas_EpiRR_metadata_all.csv.
Note that they contain different columns, as they changed during the harmonization process.

Diff

The overall diff between v1.2 and v1.3 can be found at openrefine/v1.3/diff_v1.2_v1.3.json

Metadata Standard

Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column descriptions:

The table below describes the columns included in the metadata table and the extended metadata table.

Column Examples Explanation # Not Null (%)
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2279 (100.0%)
project CEEHRC BLUEPRINT The project from which the epigenome originated. 2279 (100.0%)
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue. 2279 (100.0%)
harmonized_sample_label B Lymphocyte Acute Lymphoblastic Leukemia Sample label based on sample ontology and sample disease using common terms that might connect multiple ontologies or columns by Martin Hirst. 2279 (100.0%)
harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2279 (100.0%)
harmonized_sample_ontology_intermediate_color "143,81,121" Extended only A unique color for each unique entry in harmonized_sample_ontology_intermediate. 2246 (98.6%)
harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2279 (100.0%)
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. 2279 (100.0%)
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial. 2279 (100.0%)
epiATLAS_status Complete Partial Complete_imputed Equivalent to harmonized_EpiRR_status but referring to the reprocessed data rather than original submitted data, describing the status of the reference epigenome with the additional information of full epigenomes when using imputed data. 2279 (100.0%)
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture. 1561 (68.5%)
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line. 73 (3.2%)
harmonized_tissue_type skeletal muscle tissue amygdala ...
Read more

Version 1.3

30 Sep 08:07
Compare
Choose a tag to compare
Version 1.3 Pre-release
Pre-release

Version 1.3

At this time, this repository is only for sample metadata, not experiment metadata.
For more information about experiment metadata check out the IHEC Data Portal and EpiATLAS.
You can also find this metadata on EpiRR.
There is metadata available for 2279 EpiRR entries.
The CSV for the sample metadata can be found at openrefine/v1.3/IHEC_metadata_harmonization.v1.3.csv and the extended version at openrefine/v1.3/IHEC_metadata_harmonization.v1.3.extended.csv

News

  • Fixed harmonized_donor_life_stage for 5 entries.
  • Extended version: harmonized_sample_ontology_intermediate_color contains a coloring for each value in harmonized_sample_ontology_intermediate.
  • Extended version: In addition to the columns harmonized_donor_sex and harmonized_donor_life_stage that have been complemented and corrected, based on the high confidence predictions of the EpiClass tool, the extended version now contains the columns without these corrections, i.e., ${column}_uncorrected.
  • Extended version: The columns containing information about whether data is available have been renamed to contain the assay name, e.g., automated_experiments_ChIP-Seq_H3K27ac. WGBS and RNA-Seq columns have been separated by PBAT vs. standard and mRNA-Seq vs. total-RNA-Seq.

Raw Files

In case you are interested in the raw files that the harmonization process was based on, those can be found at raw/EpiAtlas_EpiRR_metadata_all.csv.
Note that they contain different columns, as they changed during the harmonization process.

Diff

The overall diff between v1.2 and v1.3 can be found at openrefine/v1.3/diff_v1.2_v1.3.json

Metadata Standard

Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column descriptions:

The table below describes the columns included in the metadata table and the extended metadata table.

Column Examples Explanation # Not Null (%)
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2279 (100.0%)
project CEEHRC BLUEPRINT The project from which the epigenome originated. 2279 (100.0%)
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue. 2279 (100.0%)
harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2279 (100.0%)
harmonized_sample_ontology_intermediate_color 182,26,57 Extended only A unique color for each unique entry in harmonized_sample_ontology_intermediate. 2279 (100.0%)
harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2279 (100.0%)
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. 2279 (100.0%)
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial. 2279 (100.0%)
epiATLAS_status Complete Partial Complete_imputed Equivalent to harmonized_EpiRR_status but referring to the reprocessed data rather than original submitted data, describing the status of the reference epigenome with the additional information of full epigenomes when using imputed data. 2279 (100.0%)
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture. 1561 (68.5%)
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line. 73 (3.2%)
harmonized_tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue. 2008 (88.1%)
harmonized_sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term. Different ontologies are used, depending on the biomaterial_type: 'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue. 2279 (100.0%)
harmonized_cell_markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- Markers used to isolate and identify the cell type, when applicable. 1144 (50.2%)
automated_harmonized_sample_ontology CL UBERON EFO ...
Read more

Version 1.2

16 Feb 17:57
Compare
Choose a tag to compare

Version 1.2

At this time, this repository is only for sample metadata, not experiment metadata.
There is metadata available for 2279 EpiRR entries.
The CSV for the sample metadata can be found
at openrefine/v1.2/IHEC_metadata_harmonization.v1.2.csv and the extended version at openrefine/v1.2/IHEC_metadata_harmonization.v1.2.extended.csv

News

  • Added 63 entries that had erroneously been removed in v1.1.
  • The columns harmonized_donor_sex and harmonized_donor_life_stage have been complemented and corrected, based on
    the prediction of the EpiClass tool. For more information on this, please contact Pierre-Étienne Jacques.
  • Some minor changes to sample_disease and donor_health_status columns.
  • Added column epiATLAS_status which is equivalent to harmonized_EpiRR_status but referring to the reprocessed data
    rather than original submitted data, describing the status of the reference epigenome with the additional information
    of full epigenomes when using imputed data.
  • Extended version: Added columns for each assay type (histone marks, wgbs, and
    rna-seq) automated_experiments_${assay} containing the uuid for observed data, or imputed if only imputed data is
    available.
  • Extended version: Added column harmonized_sample_ontology_term_high_order_fig1
  • Extended version: Columns sample_ontology_term_high_order_JeffreyHyacinthe
    and sample_ontology_term_high_order_JonathanSteif have been removed and replaced
    by harmonized_sample_ontology_term_high_order_fig1 containing the sample labels corresponding to the annotations in
    the overview figure.
  • Extended version: Added columns harmonized_sample_[...]_order_AnetaMikulasova containing manually assigned
    labels by Aneta
    Mikulasova, which contain information about organ, cell, and cancer (sub-)types.
  • Extended version: Removed columns automated_harmonized_($column)_($order)(_unique)?,
    e.g., automated_harmonized_sample_ontology_term_intermediate_order_unique containing the automatic extraction higher
    order as decribed in v0.9. These columns
    were used to derive the harmonized_sample_ontology_intermediate and harmonized_sample_disease_intermediate
    columns, but this was based on older versions of these columns. The columns are still generated internally, for
    checking purposes, but could confuse users and are not necessary for the metadata.

Diff

The overall diff between v1.1 and v1.2 can be found at openrefine/v1.2/diff_v1.1_v1.2.json

Metadata Standard

Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column descriptions:

The table below describes the columns included in the metadata table and the extended metadata table.

Column Examples Explanation # Not Null (%)
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2279 (100.0%)
project CEEHRC BLUEPRINT The project from which the epigenome originated. 2279 (100.0%)
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue. 2279 (100.0%)
harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2279 (100.0%)
harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2279 (100.0%)
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. 2279 (100.0%)
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial. 2279 (100.0%)
epiATLAS_status Complete Partial Complete_imputed Equivalent to harmonized_EpiRR_status but referring to the reprocessed data rather than original submitted data, describing the status of the reference epigenome with the additional information of full epigenomes when using imputed data. 2279 (100.0%)
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture. 1561 (68.5%)
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line. 73 (3.2%)
harmonized_tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue. 2008 (88.1%)
harmonized_sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term. Different ontologies are used, depending on the biomaterial_type: 'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue. 2279 (100.0%)
harmonized_cell_markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- ...
Read more

Version 1.1

01 Aug 09:23
Compare
Choose a tag to compare

Version 1.1

At this time, this repository is only for sample metadata, not experiment metadata.
There is metadata available for 2216 EpiRR entries.
The CSV for the sample metadata can be found at openrefine/v1.1/IHEC_metadata_harmonization.v1.1.csv
Based on tag v1.1.1 because of a change in column order of the extended version.

News

  • Removed entries if no corresponding datasets were reprocessed or all datasets corresponding to an EpiRR entry were pruned.
  • Removed column harmonized_donor_life_status which doesn't contain any information after some entries have been removed (see above).
  • Added column epirr_id_without_version for natural joins with the epimap_metadata.csv which provides metadata about the reprossed datasets #85.
  • Extended version: Added column automated_harmonized_donor_age_in_years based on harmonized_donor_age as explained in #86.
  1. Intervals are split by - and the mean is computed.
  2. Values with 'week' or 'day' as harmonized_donor_age_unit are divided by 52 or 365, respectively.
    Note: unknown is converted to nan and 90+ is just converted to 90

Diff

The overall diff between v1.0 and v1.1 can be found at openrefine/v1.1/diff_v1.0_v1.1.json

Extended Version:

For more information on the columns from the extended version at openrefine/v1.1/IHEC_metadata_harmonization.v1.1.extended.csv, please also see version 0.9.

Metadata Standard

Please keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column descriptions:

The table below describes the columns included in the metadata table at IHEC_metadata_harmonization.v1.1.csv and the extended metadata table at IHEC_metadata_harmonization.v1.1.extended.csv.

Column Examples Explanation # Not Null (%)
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version. 2216 (100.0%)
project CEEHRC BLUEPRINT The project from which the epigenome originated. 2216 (100.0%)
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue. 2216 (100.0%)
harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology. 2216 (100.0%)
harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease. 2216 (100.0%)
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. 2216 (100.0%)
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial. 2216 (100.0%)
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture. 1498 (67.6%)
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line. 73 (3.3%)
harmonized_tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue. 1958 (88.4%)
harmonized_sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term. Different ontologies are used, depending on the biomaterial_type: 'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue. 2216 (100.0%)
harmonized_cell_markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- Markers used to isolate and identify the cell type, when applicable. 1082 (48.8%)
automated_harmonized_sample_ontology CL UBERON EFO Extended only Automatic extraction from harmonized_sample_ontology_curie. The ontology corresponding to the curie, mostly used for other automatic extractions. 2216 (100.0%)
automated_harmonized_sample_ontology_term myeloid cell MCF 10A amygdala Extended only Automatic extraction from harmonized_sample_ontology_curie. The term corres...
Read more

Version 1.0

18 Oct 14:24
99de552
Compare
Choose a tag to compare

Version 1.0

At this time, this repository is only for sample metadata, not experiment metadata.
The CSV for the sample metadata can be found at openrefine/v1.0/IHEC_metadata_harmonization.v1.0.csv

News

  • The prefix harm has been renamed to harmonized for all columns where at least one cell was changed compared to the original data from EpiRR.
  • The prefix automated was added afterward for all columns that are generated completely automatically and lack manual curation. They are available in the extended version only.
  • The column originally called line has been renamed to cell_line, i.e., now harmonized_cell_line.
  • The column originally called markers has been renamed to cell_markers, i.e., now harmonized_cell_markers.
  • In all columns originally containing disease it has been renamed to sample_disease, to emphasize that this attribute reflects the disease for this particular sample, not the donor health condition.

Diff

The overall diff between v0.11 and v1.0 can be found at diff_v0.11_v1.0.json

Extended Version:

For more information on the columns from the extended version at IHEC_metadata_harmonization.v1.0.extended.csv, please also see version 0.9.

Metadata Standard

Please keep in mind that we try to stay as close to
the IHEC Metadata Standard
as possible.

Column descriptions:

The table below describes the columns included in the metadata table
at IHEC_metadata_harmonization.v1.0.csv and the extended metadata table
at IHEC_metadata_harmonization.v1.0.extended.csv.

Column Examples Explanation
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version.
project CEEHRC BLUEPRINT The project from which the epigenome originated.
harmonized_biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue.
harmonized_sample_ontology_intermediate T cell epithelial cell derived cell line A manually refined higher level annotation describing the samples using ancestors in the ontology.
harmonized_sample_disease_high Healthy/None Cancer Disease A manually refined higher level annotation describing the disease using only three categories: Healthy/None, Cancer, Disease.
harmonized_sample_disease_intermediate Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology.
NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation.
harmonized_EpiRR_status Complete Partial Whether this epigenome is Complete or Partial.
harmonized_cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture.
harmonized_cell_line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line.
harmonized_tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue.
harmonized_sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term.
Different ontologies are used, depending on the biomaterial_type:
'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue.
harmonized_cell_markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- Markers used to isolate and identify the cell type, when applicable.
automated_harmonized_sample_ontology CL UBERON EFO Extended only Automatic extraction from harmonized_sample_ontology_curie. The ontology corresponding to the curie, mostly used for other automatic extractions.
automated_harmonized_sample_ontology_term myeloid cell MCF 10A amygdala Extended only Automatic extraction from harmonized_sample_ontology_curie. The term corresponding to the curie, mostly used for detecting inconsistencies.
sample_ontology_term_high_order_JeffreyHyacinthe Cell Line Blood Extended only semi-manual annotation by Jeffrey Hyacinthe. Had been applied to v0.8 ...
Read more

Version 0.11

03 Oct 17:35
Compare
Choose a tag to compare

Version 0.11

The CSV for the metadata can be found at openrefine/v0.11/IHEC_metadata_harmonization.v0.11.csv

News

For all columns that were changed, i.e., “harmonized” in this effort, we added the prefix harm_ to clearly mark that this column has been changed in comparison to the original EpiRR data.
For the columns describing manual annotations, we removed the suffix _order_manual. They can still be distinguished from the automatically extracted higher order annotations in the extended version, because these later columns still have the suffix _order or _order_unique.

In this version, we rearranged the column order, such that it made more sense to us:

  • First, six columns that describe the most important information about the entry:
    EpiRR, project, harm_biomaterial_type, harm_sample_ontology_intermediate, harm_disease_high, harm_disease_intermediate,
  • Next, the EpiRR_status and the columns describing the sample ontology (cell type, cell line or tissue)
    EpiRR_status, harm_cell_type, harm_line, harm_tissue_type, harm_sample_ontology_curie, harm_markers,
  • Afterwards, two columns stating the disease of this particular sample
    harm_disease, harm_disease_ontology_curie
  • Lastly, nine columns with information about the donor(s) of this sample
    donor_type, harm_donor_id, harm_donor_age, harm_donor_age_unit, harm_donor_life_stage, harm_donor_sex, harm_donor_health_status, harm_donor_health_status_ontology_curie, harm_donor_life_status

Additionally, we added the donor_type column, which describes whether the reference epigenome is from Single donor, Composite or Pooled samples. This information was downloaded from EpiRR directly.

Diff

The overall diff between v0.10 and v0.11 can be found at openrefine/v0.11/diff_v0.10_v0.11.json

Explanations

For a table that describes the columns included in the metadata table, please refer to version 0.10

Extended Version

For explanations concerning the extended version, openrefine/v0.11/IHEC_metadata_harmonization.v0.11.extended.csv, please see version 0.9.

Metadata Standard

Please always keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Version 0.10

20 Sep 12:52
Compare
Choose a tag to compare

The CSV for the metadata can be found at openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv

The overall diff between v0.9 and v0.10 can be found at openrefine/v0.10/diff_v0.9_v0.10.json

The table below describes the columns included in the metadata table.

For explanations concerning the extended version, please see version 0.9.

Please always keep in mind that we try to stay as close to the IHEC Metadata Standard as possible.

Column Examples Explanation
EpiRR IHECRE00000001.4 EpiRR identifier. The number behind the dot (.) is the version.
EpiRR_status Complete Partial Whether this epigenome is Complete or Partial.
project CEEHRC BLUEPRINT The project from which the epigenome originated.
biomaterial_type cell line primary cell primary cell culture primary tissue One of primary cell,primary cell culture, cell line, primary tissue.
cell_type myeloid cell effector memory CD8-positive, alpha-beta T cell The cell type and main sample ontology classification for entries where biomaterial_type is primary cell or primary cell culture.
line MCF 10A The cell line and main sample ontology classification for entries where biomaterial_type is cell line.
tissue_type skeletal muscle tissue amygdala The cell line and main sample ontology classification for entries where biomaterial_type is primary tissue.
sample_ontology_curie CL:0000990 UBERON:0001876 EFO:0001200 The CURIE identifying the sample ontology term.
Different ontologies are used, depending on the biomaterial_type:
'CL' for primary cell or primary cell culture, 'EFO' for cell line and 'UBERON' for primary tissue.
sample_ontology_term_high_order_manual other T cell A manually refined higher level annotation describing the samples using ancestors in the ontology.
markers CD3+ CD4+ CD45RA+ CD3- CD19- CD56- Markers used to isolate and identify the cell type, when applicable.
disease Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA This attribute reflects the disease for this particular sample, not the donor health condition.
disease_ontology_curie NCIM:C0678222 NCIM:C0023487 The CURIE identifying the NCIM disease ontology term.
disease_high_order_manual Healthy/None Cancer Disease A manually refined higher level annotation describing the diseases using only three categories: Healthy/None, Cancer, Disease.
disease_intermediate_order_manual Carcinoma Leukemia A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology.
NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation.
donor_id CEMT0007 C07015 Identifier for donors within their projects.
donor_age 60-65 unknown 46 Age of donor. Can be an interval.
donor_age_unit year day Age unit of donor.
donor_life_stage embryonic adult Life stage of donor.
sex female male Sex of donor.
donor_health_status Breast Carcinoma Acute Promyelocytic Leukemia with PML-RARA Links to the health status of the donor that provided the sample. Does not describe the disease for this particular sample.
donor_health_status_ontology_curie NCIM:C0023487 NCIM:C0678222 The CURIE identifying the NCIM donor health status ontology term.
health_state dead alive Health state of donor: dead or alive.

Version 0.9

17 Aug 12:10
Compare
Choose a tag to compare

The CSV for the metadata can be found at openrefine/v0.9/IHEC_metadata_harmonization.v0.0.csv

The overall diff between v0.8 and v0.9 can be found at openrefine/v0.9/diff_v0.8_v0.9.json

This version comes with the first “extended” version openrefine/v0.9/IHEC_metadata_harmonization.v0.9.extended.csv that includes higher level annotations for the three ontology columns.
The following columns have been added in comparison to the normal v0.9:

  • donor_health_status_ontology_curie_ncit: mapping from NCIM to NCIT curies for the donor_health_status_ontology_curie
  • disease_ontology_curie_ncit: mapping from NCIM to NCIT curies for the disease_ontology_curie
  • sample_ontology: ontology to use based on the biomaterial_type
  • sample_ontology_term: the ontology term extracted from disease_ontology_curie that should reflect either line, tissue_type or cell_type, depending on the sample_ontology
  • sample_ontology_term_high_order_JeffreyHyacinthe: semi-manual annotation by Jeffrey Hyacinthe. Had been applied to v0.8
  • sample_ontology_term_high_order_JonathanSteif: semi-manual annotation by Jonathan Steif. Had been applied to v0.9 draft
  • sample_ontology_term_high_order_manual: semi-manual annotation using the automatic extraction columns below and the manual annotation above. Created by some members of the IHEC IA metadata group (Pierre-Etienne Jacques, Gabriella Frosi and Quirin Manz). Had been applied to v0.9. Although this is the current higher level annotation for sample_ontology_term, it should be handled with caution, since it's still preliminary and should be checked by others.

Note that the sample_ontology_term columns were grouped by their sample_ontology in the automatic extraction.
The following columns are a result of the automatic extraction:

  • ($column)_($order)(_unique)?:
    $column describes the ontology column that the automatic extraction was performed on. One of [sample_ontology_term, donor_health_status_ontology_term, disease_ontology_term]
    $order describes the number of unique terms that are overall allowed in the column (or group for sample_ontology_term). For intermediate_order the maximum number of terms is 30, for high_order it is 15
    _unique suffix is attached if the automatic extraction considered only unique terms for counting before the automatic extraction. If not attached, the extraction was performed on all entries and duplicates were counted as well. This basically reflects the underlying dataset in which the extraction was performed, allowing duplicates or not.
    This results in the following 12 additional columns:
  • sample_ontology_term_intermediate_order_unique:
  • sample_ontology_term_high_order_unique:
  • sample_ontology_term_intermediate_order:
  • sample_ontology_term_high_order:
  • donor_health_status_ontology_term_intermediate_order_unique:
  • donor_health_status_ontology_term_high_order_unique:
  • donor_health_status_ontology_term_intermediate_order:
  • donor_health_status_ontology_term_high_order:
  • disease_ontology_term_intermediate_order_unique:
  • disease_ontology_term_high_order_unique:
  • disease_ontology_term_intermediate_order:
  • disease_ontology_term_high_order:

Version 0.8

18 May 09:35
Compare
Choose a tag to compare

The CSV for the metadata can be found at openrefine/v0.8/IHEC_metadata_harmonization.v0.8.csv

This version includes significant changes to the structure of the table:

  1. Renamed columns according to metadata standard: sample_ontology_term -> sample_ontology_curie
  2. Split the previously merged information for donor_health_status and disease into overall 4 columns:
    donor_health_status is split in donor_health_status and disease
    disease_ontology_term is split in donor_health_status_ontology_curie and disease_ontology_curie
  3. Removed entries not associated with humans and dropping the taxon_id column
    find a list of removed ids in openrefine/v0.8/removed_entries.csv

The overall diff between v0.7 and v0.8 can be found at openrefine/v0.8/diff_v0.7_v0.8.json

Version 0.7

15 Mar 17:37
Compare
Choose a tag to compare

The CSV for the metadata can be found at openrefine/v0.7/IHEC_metadata_harmonization.v0.7.csv (permanent link)