add pandas/notebook utilities into genome-grist codebase? #52

ctb · 2021-01-28T14:41:27Z

this seems like generically useful code that ensures that the various dataframes are all copacetic and synchronized...

SampleDFs = namedtuple('SampleDFs', 'gather_df, all_df, left_df, names_df')

def load_dfs(outdir, sample_id):

    # load mapping CSVs
    all_df = pd.read_csv(f'{outdir}/minimap/depth/{sample_id}.summary.csv')
    left_df = pd.read_csv(f'{outdir}/leftover/depth/{sample_id}.summary.csv')

    # load gather CSV
    gather_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.x.genbank.gather.csv')

    # names!
    names_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.genomes.info.csv')

    # connect gather_df to all_df and left_df using 'genome_id'
    def fix_name(x):
        return "_".join(x.split('_')[:2]).split('.')[0]

    gather_df['genome_id'] = gather_df['name'].apply(fix_name)
    names_df['genome_id'] = names_df['acc'].apply(fix_name)

    # check that all dataframes are copacetic
    in_gather = set(gather_df.genome_id)
    in_left = set(left_df.genome_id)
    assert in_gather == in_left
    assert in_gather == set(names_df.genome_id)
    assert in_gather == set(all_df.genome_id)

    # re-sort left_df and all_df to match gather_df order, using matching genome_id column
    all_df.set_index("genome_id")
    all_df.reindex(index=gather_df["genome_id"])
    all_df.reset_index()

    left_df.set_index("genome_id")
    left_df.reindex(index=gather_df["genome_id"])
    left_df.reset_index()

    #left_df["mapped_bp"] = (1 - left_df["percent missed"]/100) * left_df["genome bp"]
    #left_df["unique_mapped_coverage"] = left_df.coverage / (1 - left_df["percent missed"] / 100.0)

    names_df.set_index("genome_id")
    names_df.reindex(index=gather_df["genome_id"])
    names_df.reset_index()
    
    return SampleDFs(gather_df, all_df, left_df, names_df)

The text was updated successfully, but these errors were encountered:

This was referenced Feb 16, 2022

move pandas mangling for names, taxonomy into testable python code #88

Open

[WIP] add parsing code for notebooks into the genome_grist package #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pandas/notebook utilities into genome-grist codebase? #52

add pandas/notebook utilities into genome-grist codebase? #52

ctb commented Jan 28, 2021 •

edited

Loading

add pandas/notebook utilities into genome-grist codebase? #52

add pandas/notebook utilities into genome-grist codebase? #52

Comments

ctb commented Jan 28, 2021 • edited Loading

ctb commented Jan 28, 2021 •

edited

Loading