This python package contains functions to interact with the Genomics England 100,000 Genomes Project data.
The Cohort class takes in a dictionary with icd-10, hpo or disease terms and gathers a list of participants and their age, mortality, ancestry, and sex.
The Survdat class contains functions to extract diagnosis dates, date of death and last follow ups with the clinic to calculate and plot a Kaplan-Meier survival analysis (stratifying on Domain variants).
input a dictionary with keys of: icd-10 codes, hpo terms, cancer terms, disease terms.
Sources:
- ICD-10:
- hes-apc
- hes-ae
- hes-op
- mortality
- mhsds_medical_history_previous_diagnosis
- mhsds_provisional_diagnosis
- mhsds_primary_diagnosis
- mhsds_secondary_diagnosis
- mhsds_care_activity
- mhsds_indirect_activity
- cancer_invest_sample_pathology
- cancer_participant_tumour
- cancer_registry
- rtds
- sact
- av_tumour
- HPO terms
- rare_diseases_participant_phenotype
- cancer terms
- cancer_participant_disease
- disease terms
- rare_diseases_participant_disease
Cancer participants:
- PRIMARY cancers in cancer_analysis
- keeping unique participant_ids + disease_types.
Date of death:
- Earliest date between death_details and mortality tables
Date of diagnosis:
- Sources:
- cancer_participant_tumour
- av_tumour
- cancer_registry
- Approach:
- Translate ICD-10 code to a gel_disease_type
- Sort tables by date and keep only the first diagnosis date for each disease type.
- merge sources and keep unique values per participant.
- merge dod with cancer_analysis matching both the participant_id and disease_types
- inferring missing date of diagnosis from averages only occurs if the --infer flag is True.
Date of last follow up:
- the last known interaction within HES data.
MIT license