Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pandas/notebook utilities into genome-grist codebase? #52

Open
ctb opened this issue Jan 28, 2021 · 0 comments
Open

add pandas/notebook utilities into genome-grist codebase? #52

ctb opened this issue Jan 28, 2021 · 0 comments

Comments

@ctb
Copy link
Member

ctb commented Jan 28, 2021

this seems like generically useful code that ensures that the various dataframes are all copacetic and synchronized...

SampleDFs = namedtuple('SampleDFs', 'gather_df, all_df, left_df, names_df')

def load_dfs(outdir, sample_id):

    # load mapping CSVs
    all_df = pd.read_csv(f'{outdir}/minimap/depth/{sample_id}.summary.csv')
    left_df = pd.read_csv(f'{outdir}/leftover/depth/{sample_id}.summary.csv')

    # load gather CSV
    gather_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.x.genbank.gather.csv')

    # names!
    names_df = pd.read_csv(f'{outdir}/genbank/{sample_id}.genomes.info.csv')

    # connect gather_df to all_df and left_df using 'genome_id'
    def fix_name(x):
        return "_".join(x.split('_')[:2]).split('.')[0]

    gather_df['genome_id'] = gather_df['name'].apply(fix_name)
    names_df['genome_id'] = names_df['acc'].apply(fix_name)

    # check that all dataframes are copacetic
    in_gather = set(gather_df.genome_id)
    in_left = set(left_df.genome_id)
    assert in_gather == in_left
    assert in_gather == set(names_df.genome_id)
    assert in_gather == set(all_df.genome_id)

    # re-sort left_df and all_df to match gather_df order, using matching genome_id column
    all_df.set_index("genome_id")
    all_df.reindex(index=gather_df["genome_id"])
    all_df.reset_index()

    left_df.set_index("genome_id")
    left_df.reindex(index=gather_df["genome_id"])
    left_df.reset_index()

    #left_df["mapped_bp"] = (1 - left_df["percent missed"]/100) * left_df["genome bp"]
    #left_df["unique_mapped_coverage"] = left_df.coverage / (1 - left_df["percent missed"] / 100.0)

    names_df.set_index("genome_id")
    names_df.reindex(index=gather_df["genome_id"])
    names_df.reset_index()
    
    return SampleDFs(gather_df, all_df, left_df, names_df)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant