Skip to content

Generate interactive, user-defined visualisations to help defining inclusion/exclusion criteria on a metadata table

License

Notifications You must be signed in to change notification settings

FranckLejzerowicz/Xclusion_criteria

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xclusion_criteria

Generate interactive, user-defined visualisations to help defining inclusion/exclusion criteria on a metadata table.

Description

Defining inclusion/exclusion criteria can be trouble some and the subsetting of metadata file for non-expert users can be a challenge. This tools allows to apply a series of inclusion/exclusion criteria in a given order and allows retrieving both the included and excluded sample selection along with user-defined visualization hat allows scrutinizing categories that might be of interest for a future study (e.g. to make sure that particular study groups are balanced.)

Installation

git clone https://github.com/FranckLejzerowicz/Xclusion_criteria.git
cd Xclusion_criteria
pip install -e .

and then if there are updates

pip install --upgrade git+https://github.com/FranckLejzerowicz/Xclusion_criteria.git

*Note that python and pip should be python3

Input

  • [REQUIRED] option -m: Path to the metadata file (can read a tab-, comma- or semi-colon-separated table. The names will be lower-cased and the sample IDs column will be renamed sample_name).

  • [REQUIRED] option -c: Path the a yaml file containing the criteria.

    init:
      antibiotic_history,1:
        - 'I have not taken antibiotics in the past year.'
        - 'Year'
      age_cat,0:
        - 'NULLS'
        - '70+'
        - 'child'
        - 'teen'
        - 'baby'
      bmi,2:
        - '18'
        - 'None'
    add:
      alcohol_consumption,1:
        - 'No'
    filter:
      alcohol_types_red_wine,1:
        - 'Yes'
    no_nan:
      - bmi
    

    There are four possible main filtering steps:

    • init
    • add
    • filter
    • no_nan

    For the three first steps (init, add and filter), the format is exactly the same and consists in providing 3 pieces of information for each variable based on which to filter (multi-variable filtering will soon be possible!):

    1. the variable name (as in the metadata but lower case), e.g. antibiotic_history
    2. a numeric indicator telling whether the filtering based on the variable's content should be a "remove it", "keep it" or "must be in range", e.g. the 0 in antibiotic_history,0::
      • "remove it": 0
      • "keep it": 1
      • "must be in range": 2
    3. the list of factors that are considered for the filtering based on the variable (must be exactly as in the table), e.g. for antibiotic_history,0:
      - "I have not taken antibiotics in the past year."
      - "Year"
      

    For the numeric indicator 2, the range must be composed of two values: a minimum and a maximum (in this order), e.g.

    age,2:
      - 10
      - 70
    

    It is possible to not set a minimum or a maximum bound, by writing "None" instead, e.g.

    age,2:
      - 10
      - None
    

    (but there must be 2 items...)

    The fourth key no_nan is special: if present, it lists the variables that will be filtered so that no sample will be left that has a missing value for these variables. These missing values are formal NumPy's "nan" (np.nan), as well as any of these terms:

    • unknown
    • unspecified
    • not provided
    • not applicable
    • missing
    • nan

    (this default missing values vocabulary can be edited, in file ./Xclusion_criteria/resources/nulls.txt)

  • option -p: Path to a yaml file containing the plotting's categorical variables, e.g.:

    categories:
      - bmi_cat
      - age_years
      - types_of_plants
    

    There will be barplot bars for each factor of each of these categorical variables (see image below).

Outputs

  • [REQUIRED] option -in: Metadata table reduced to the samples satisfying all the inclusion criteria (the selecion).
  • option -ex: Metadata table reduced to the samples not satisfying a least one inclusion criteria.
  • option -v: Interactive visualization composed of three panels (see below).

Example

This command:

 Xclusion_criteria \
    -m Xclusion_criteria/tests/metadata/md.tsv \
    -c Xclusion_criteria/examples/criteria_nonempty_output.yml \
    -p Xclusion_criteria/examples/plot.yml \
    -in Xclusion_criteria/tests/output/md_out.tsv \
    -v Xclusion_criteria/tests/output/md_viz.html

Prints out:

- read input metadata... Done.
- get yml content, i.e. all inclusion/exclusion criteria...
Problems encountered during criteria parsing:
[Warning] Subset values for variable age_cat not in table
 - teen
 - baby
- infer dtypes... Done.
- get the numerical and categorical metadata variables... Done.
- apply filtering criteria to subset the metadata... Done.
- write the metadata for criteria-included samples... Done.
- check there are min 3 categorical and 2 numerical variables...
  [numerical] age_years (n=16/16)
  ...
  [numerical] weight_kg (n=16/16)

  [categorical] acne_medication (n=1: No:15)
  ...
  [categorical] whole_grain_frequency (n=2: Never:2,Occasional:1)
- build the three-panel criteria-based filtering figure...
 --> 3 passed "numerical" variables that are not numerical:
        * vioscreen_micromacro__added_sugars__by_total_sugars__in_g
        * vioscreen_micromacro__energy_in_kcal
        * vioscreen_micromacro__alcohol_in_g
Start making the chart (html) figure
   * make filtering figure... Done
 - get numeric melted table... 
 - get categorical melted table... 
 - merge numeric and categorical tables... 
   * make scatter figure... Done
   * make barplots figure...Done
 - Write figure... Done: Xclusion_criteria/tests/output/md_viz.html

Interactive visualization

the -o html output has 3 panels:

  1. Samples selection progression at each inclusion/exclusion criteria step is reported. Selection (interaction: hovering on the steps/dots with the mouse shows the variable and variable's factors used for selection)

  2. Samples on a scatter plot which x and y axes could be changed using a dropdown menu. Dropdown

  3. Barplots showing the number of samples for each of the different factors of the user-defined categories. By default, these are for all the samples of the final selection. This selection can be further refined by selecting samples using click-and-brush on the scatter. brush

Data fetching

It is possible to perform the downloading and filtering of the microbiome data directly, and perform filtering, by subprocessing another unittested package: Xrbfetch. It applies "technical" exclusion criteria on the microbiome data (while the above only appl) yield sample counts change that are also plotted, and are given in command line:

Parameters:

  • option --fetch: fetch data using redbiom using Xrbfetch
  • option -r: data to fetch, default is Deblur-Illumina-16S-V4-150nt-780653
  • option -s: fasta file with sequences to filter out (see default at Xrbfetch)
  • option -f: minimum number of reads for samples to filter out
  • option --unique: keep only one sample per host (most reads / features
  • option --update: update sample names to get rid of Qiita-prep info (e.g. from 10317.000048372.84675 to 10317.000048372)

Output:

  • option -b: Biom file containing fetched and filtered data
  • option -o: Output metadata for the samples which data could be fetched

Example with data fetching

This command:

 Xclusion_criteria \
    -m Xclusion_criteria/tests/metadata/md.tsv \
    -c Xclusion_criteria/examples/criteria_nonempty_output.yml \
    -p Xclusion_criteria/examples/plot.yml \
    -in Xclusion_criteria/tests/output/md_out.tsv \
    -v Xclusion_criteria/tests/output/md_viz.html \
    --fetch \
    -o Xclusion_criteria/tests/output/md_fetch_metadata.tsv \
    -b Xclusion_criteria/tests/output/md_fetch_data.biom \

Prints out:

- read input metadata... Done.
- get yml content, i.e. all inclusion/exclusion criteria...
Problems encountered during criteria parsing:
[Warning] Subset values for variable age_cat not in table
 - teen
 - baby
- infer dtypes... Done.
- get the numerical and categorical metadata variables... Done.
- apply filtering criteria to subset the metadata... Done.
- write the metadata for criteria-included samples... Done.
- check there are min 3 categorical and 2 numerical variables...
  [numerical] age_years (n=16/16)
  ...
  [numerical] weight_kg (n=16/16)

  [categorical] acne_medication (n=1: No:15)
  ...
  [categorical] whole_grain_frequency (n=2: Never:2,Occasional:1)
- build the three-panel criteria-based filtering figure...
 --> 3 passed "numerical" variables that are not numerical:
        * vioscreen_micromacro__added_sugars__by_total_sugars__in_g
        * vioscreen_micromacro__energy_in_kcal
        * vioscreen_micromacro__alcohol_in_g
Start making the chart (html) figure
   * make filtering figure... Done
 - get numeric melted table... 
 - get categorical melted table... 
 - merge numeric and categorical tables... 
   * make scatter figure... Done
   * make barplots figure...Done
 - Write figure... Done: Xclusion_criteria/tests/output/md_viz.html

Optional arguments

Usage: Xclusion_criteria [OPTIONS]

Options:
  -m, --m-metadata-file TEXT    Metadata file on which to apply
                                included/exclusion criteria.  [required]

  -c, --i-criteria TEXT         Must be a yaml file (see README or
                                'examples/criteria.yml').  [required]

  -p, --i-plot-groups TEXT      Must be a yaml file (see README or
                                'examples/criteria.yml').

  -in, --o-included TEXT        Output metadata for the included samples only.
                                [required]

  -ex, --o-excluded TEXT        Output metadata for the excluded samples only.
  -v, --o-visualization TEXT    Output metadata explorer for the included
                                samples only.

  -r, --p-random INTEGER        Reduce visualization to the passed number of
                                (random) samples.  [default: 100]

  --fetch / --no-fetch          Run Xrbfetch (third-party) to get features
                                data.  [default: False]

  -o, --o-metadata-file TEXT    [if --fetch] Path to the output metadata table
                                file. (Default = add _fetched_#s.tsv).

  -b, --o-biom-file TEXT        [if --fetch] Path to the output biom table
                                file. (Default = add _fetched_#s.biom).

  -x, --p-redbiom-context TEXT  [if --fetch] Redbiom context for fetching 16S
                                data from Qiita.  [default: Deblur-
                                Illumina-16S-V4-150nt-780653]

  -s, --p-bloom-sequences TEXT  [if --fetch] Fasta file containing the
                                sequences known to bloom in fecal samples
                                (defaults to 'newblooms.all.fasta' file from
                                Xrbfetch package's folder 'resources').

  -f, --p-reads-filter INTEGER  [if --fetch] Minimum number of reads per
                                sample.  [default: 1500]

  --unique / --no-unique        [if --fetch] Keep a unique sample per host
                                (most read, or most features).  [default:
                                True]

  --update / --no-update        [if --fetch] Update the sample names to remove
                                Qiita-prep info.  [default: True]

  --dim / --no-dim              [if --fetch] Add the number of samples in the
                                final biom file name before extension (e.g.
                                for '-b out.biom' it becomes
                                'out_1000s.biom').  [default: True]

  --version                     Show the version and exit.
  --help                        Show this message and exit.

Bug Reports

contact flejzerowicz@health.ucsd.edu

About

Generate interactive, user-defined visualisations to help defining inclusion/exclusion criteria on a metadata table

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published