Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to download google drive datasets #138

Merged
merged 37 commits into from
Aug 5, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
237d4a9
Add feature to download google drive datasets
ajlail98 Mar 25, 2021
c6fd46d
Add gdown to requirements.txt
ajlail98 Mar 25, 2021
751bd7c
:art: pulling from upstream dev
ajlail98 Apr 29, 2021
b65d14e
:art: Formatted script according to template, renamed variables
ajlail98 May 25, 2021
d7afab6
:art: Changed permissions
ajlail98 May 25, 2021
0f35181
:art: Added unique filenames for each file size
ajlail98 May 25, 2021
fbb6c73
:art: Moved to external folder
ajlail98 May 25, 2021
72f0c99
Moved script to validation and renamed
ajlail98 Jun 4, 2021
bd92f5e
Rename function and add type hints
ajlail98 Jun 4, 2021
ad70968
Add file containing fileIDs to reference
ajlail98 Jun 4, 2021
b7df5f5
Add user input options for files/folders
ajlail98 Jun 9, 2021
0abe3a6
Reformat with black
ajlail98 Jun 9, 2021
df63e97
Change targets variable name
ajlail98 Jun 10, 2021
79484a5
Change "folder" to "dataset"
ajlail98 Jun 10, 2021
662d5bf
Update column names
ajlail98 Jun 10, 2021
7678155
Condense logic into one function
ajlail98 Jun 11, 2021
3ffd397
Change logic to input multiple files and multiple output dirs
ajlail98 Jun 11, 2021
46eafc2
Add logger warnings
ajlail98 Jun 15, 2021
d21f825
Add datasets.py info to setup.py
ajlail98 Jun 15, 2021
54d151d
Change internet_is_connected into an import
ajlail98 Jun 24, 2021
3dd9e63
Add internet connection checker and error message
ajlail98 Jun 24, 2021
2a45ab2
Directory structure to organize downloads
ajlail98 Jul 13, 2021
b7c2048
Change variable names and clean up extra bits
ajlail98 Jul 13, 2021
9a932d5
Add __init__.py to validation
ajlail98 Jul 13, 2021
98e356b
Add error for non-existent dir_path
ajlail98 Jul 13, 2021
0d1274b
Add detail to internet_is_connected failure
ajlail98 Jul 14, 2021
7af3c95
Added NotImplementedError
ajlail98 Jul 16, 2021
df317b0
Only read csv once
ajlail98 Jul 16, 2021
85c9387
Change strategy for filtering df
ajlail98 Jul 16, 2021
12afe4b
Using df.loc to retrieve file_id
ajlail98 Jul 16, 2021
e7da939
Argparse and var name refinements
ajlail98 Jul 16, 2021
dceb0f5
Add ability to ping custom IP
ajlail98 Jul 20, 2021
622d934
Reformatting
ajlail98 Jul 20, 2021
ac89c06
Hardcode fileID csv hosted on google drive
ajlail98 Jul 22, 2021
af931bb
Reformatting
ajlail98 Jul 22, 2021
33f75e1
Remove gdown_fileIDs.csv
ajlail98 Jul 22, 2021
7fdf590
Add verbose error message and dockerfile entrypoint
ajlail98 Jul 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion autometa/binning/summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,9 @@ def main():
)
# Now retrieve stats for each metabin
metabin_stats_df = get_metabin_stats(
bin_df=bin_df, markers_fpath=args.markers, cluster_col=args.binning_column,
bin_df=bin_df,
markers_fpath=args.markers,
cluster_col=args.binning_column,
)
metabin_stats_df.to_csv(args.output_stats, sep="\t", index=True, header=True)
logger.info(f"Wrote metabin stats to {args.output_stats}")
Expand Down
4 changes: 3 additions & 1 deletion autometa/binning/unclustered_recruitment.py
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,9 @@ def get_confidence_filtered_predictions(
raise NotImplementedError(classifier)

df = pd.DataFrame(
predictions, index=test_data.index, columns=train_data.target_names,
predictions,
index=test_data.index,
columns=train_data.target_names,
)
# Filter predictions by confidence threshold
confidence_threshold = num_classifications * confidence
Expand Down
325 changes: 325 additions & 0 deletions autometa/config/project.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
COPYRIGHT
Copyright 2020 Ian J. Miller, Evan R. Rees, Kyle Wolf, Siddharth Uppal,
Shaurya Chanana, Izaak Miller, Jason C. Kwan

This file is part of Autometa.

Autometa is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

Autometa is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with Autometa. If not, see <http://www.gnu.org/licenses/>.
COPYRIGHT

Configuration handling for Autometa User Project.
"""


import logging
import os

from configparser import NoOptionError

from autometa.config import DEFAULT_CONFIG
from autometa.config import get_config
from autometa.config import parse_args
from autometa.config import put_config
from autometa.common.utilities import make_inputs_checkpoints
from autometa.common.utilities import get_existing_checkpoints
from autometa.common.utilities import merge_checkpoints


logger = logging.getLogger(__name__)


class Project:
"""Autometa Project class to configure project directory given `config_fpath`

Parameters
----------
config_fpath : str
</path/to/project.config>

Attributes
----------
dirpath : str
Path to directory containing `config_fpath`
config : config.ConfigParser
interpolated config object parsed from `config_fpath`.
n_metagenomes : int
Number of metagenomes contained in project directory
metagenomes : dict
metagenomes pertaining to project keyed by number and values of metagenome.config file path.
new_metagenome_num : int
Retrieve new minimum metagenome num from metagenomes in project.

Methods
----------
* self.save()
* self.new_metagenome_directory()
* self.setup_checkpoints_and_files()
* self.add()
* self.update()
"""

def __init__(self, config_fpath):
self.config_fpath = config_fpath
self.dirpath = os.path.dirname(os.path.realpath(config_fpath))
self.config = get_config(self.config_fpath)
if not self.config.has_section("metagenomes"):
self.config.add_section("metagenomes")

@property
def n_metagenomes(self):
"""Return the number of metagenome directories present in the project

Returns
-------
int
Number of metagenomes contained in project.
"""
return len(self.metagenomes)

@property
def metagenomes(self):
"""retrieve metagenome configs from project.config

Returns
-------
dict
{metagenome_num:</path/to/metagenome.config>, ...}
"""
return {
int(k.strip("metagenome_")): v
for k, v in self.config.items("metagenomes")
if os.path.exists(v)
}

@property
def new_metagenome_num(self):
"""Retrieve new minimum metagenome num from metagenomes in project.

Returns
-------
int
New metagenome number in project.

"""
# I.e. no metagenomes have been added to project yet.
if not self.metagenomes:
return 1
# max corresponds to highest metagenome number recovered in project directory
max_num = max(self.metagenomes)
if max_num == self.n_metagenomes:
return self.n_metagenomes + 1
# Otherwise metagenome_num in between max and others has been removed
# Therefore new metagenome may be inserted.
for mg_num in range(1, max_num):
if mg_num in self.metagenomes:
continue
return mg_num

def save(self):
"""Save project config in project directory"""
put_config(self.config, self.config_fpath)

def new_metagenome_directory(self):
"""Create a new metagenome directory in project

Returns
-------
str
Path to newly created metagenome directory contained in project

Raises
------
IsADirectoryError
Directory that is trying to be created already exists
"""
metagenome_name = f"metagenome_{self.new_metagenome_num:03d}"
metagenome_dirpath = os.path.join(self.dirpath, metagenome_name)
# Check presence of metagenome directory
if os.path.exists(metagenome_dirpath):
raise IsADirectoryError(metagenome_dirpath)
os.makedirs(metagenome_dirpath)
return metagenome_dirpath

def setup_checkpoints_and_files(self, config, dirpath):
"""Update config files section with symlinks of existing files to metagenome output directory.
Also get checkpoints from each existing file and write these to a checkpoints file.

Note
----
Will write checkpoints to `config.get("files", "checkpoints")` file path. Will skip writing checkpoints
if "checkpoints" is not available in "files".

Parameters
----------
config : config.ConfigParser
metagenome config to be updated
dirpath : str
Path to output metagenome directory

Returns
-------
config.ConfigParser
Updated metagenome config
"""
# symlink any files that already exist and were specified
checkpoint_inputs = []
try:
checkpoints_fpath = config.get("files", "checkpoints")
except NoOptionError:
logger.debug("checkpoints option unavailable, skipping.")
checkpoints_fpath = None
for option in config.options("files"):
default_fname = os.path.basename(DEFAULT_CONFIG.get("files", option))
option_fpath = os.path.realpath(config.get("files", option))
if os.path.exists(option_fpath):
if option_fpath.endswith(".gz") and not default_fname.endswith(".gz"):
default_fname += ".gz"
full_fpath = os.path.join(dirpath, default_fname)
os.symlink(option_fpath, full_fpath)
checkpoint_inputs.append(full_fpath)
else:
full_fpath = os.path.join(dirpath, default_fname)
config.set("files", option, full_fpath)
if checkpoints_fpath:
logger.debug(
f"Making {len(checkpoint_inputs)} checkpoints and writing to {checkpoints_fpath}"
)
checkpoints = make_inputs_checkpoints(checkpoint_inputs)
checkpoints.to_csv(checkpoints_fpath, sep="\t", index=False, header=True)
return config

def add(self, fpath):
"""Setup Autometa metagenome directory given a metagenome.config file.

Parameters
----------
fpath : str
</path/to/metagenome.config>

Returns
-------
argparse.Namespace

Raises
-------
IsADirectoryError
Metagenome output directory already exists
"""
metagenome_dirpath = self.new_metagenome_directory()
metagenome_name = os.path.basename(metagenome_dirpath)
mg_config = get_config(fpath)
# Add/Update database and env sections for debugging individual metagenome binning runs.
for section in ["databases", "environ", "versions"]:
if not mg_config.has_section(section):
mg_config.add_section(section)
for option, value in self.config.items(section):
if not mg_config.has_option(section, option):
mg_config.set(section, option, value)
# symlink any files that already exist and were specified and checkpoint existing files
self.setup_checkpoints_and_files(config=mg_config, dirpath=metagenome_dirpath)
# Set outdir parameter and add config section linking metagenome config to project config
mg_config.set("parameters", "outdir", metagenome_dirpath)
mg_config_fpath = os.path.join(metagenome_dirpath, f"{metagenome_name}.config")
mg_config.add_section("config")
mg_config.set("config", "project", self.config_fpath)
mg_config.set("config", "metagenome", mg_config_fpath)
# Save metagenome config to metagenome directory metagenome_00d.config
put_config(mg_config, mg_config_fpath)
self.config.set("metagenomes", metagenome_name, mg_config_fpath)
# Only write updated project config after successful metagenome configuration.
self.save()
logger.debug(
f"updated {self.config_fpath} metagenome: {metagenome_name} : {mg_config_fpath}"
)
return parse_args(mg_config_fpath)

def update(self, metagenome_num, fpath):
"""Update project config metagenomes section with input metagenome.config file.

Parameters
----------
metagenome_num: int
metagenome number to update
fpath : str
</path/to/new/metagenome.config> This config will overwrite any values in old config
that are different

Returns
-------
argparse.Namespace

Raises
-------
ValueError
`metagenome` must be an int and within project config!
"""
metagenome = f"metagenome_{metagenome_num:03d}"
if not self.config.has_option("metagenomes", metagenome):
raise ValueError(
f"{metagenome_num} must be an int and within project config!"
)
old_config_fp = self.config.get("metagenomes", metagenome)
old_config = get_config(old_config_fp)
new_config = get_config(fpath)
new_checkpoints = []
for section in new_config.sections():
if not old_config.has_section(section):
old_config.add_section(section)
for option in new_config.options(section):
new_value = new_config.get(section, option)
if section == "files":
if not os.path.exists(new_value):
continue
new_checkpoints.append(new_value)
old_config.set(section, option, new_value)
checkpoints_fpath = old_config.get("files", "checkpoints")
checkpoints = merge_checkpoints(
old_checkpoint_fpath=checkpoints_fpath,
new_checkpoints=new_checkpoints,
overwrite=True,
)
put_config(old_config, old_config_fp)
logger.debug(f"Updated {metagenome}.config with {fpath}")
return parse_args(old_config_fp)


def main():
import argparse
import logging as logger

logger.basicConfig(
format="[%(asctime)s %(levelname)s] %(name)s: %(message)s",
datefmt="%m/%d/%Y %I:%M:%S %p",
level=logger.DEBUG,
)
parser = argparse.ArgumentParser(
description="""
Contains Project class used to manipulate user's Project.
main logs status of project.
"""
)
parser.add_argument("config", help="</path/to/project.config>")
args = parser.parse_args()
project = Project(args.config)
logger.info(
f"{project.config_fpath} has {project.n_metagenomes} metagenomes in {project.dirpath}"
)
logger.info(f"metagenome config numbers: {','.join(map(str,project.metagenomes))}")


if __name__ == "__main__":
main()
47 changes: 47 additions & 0 deletions autometa/validation/download_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# in response to issue #110
# Intended: fetch data similarly to scikit-learn API
# pulling data from google drive folder with simulated or synthetic communities

# use gdown to download data from google drive to output directory specified by the user
# create a dictionary of the databases in the google drive
# allow the user to call them based on size (eg '78', '156'...)
# allow the user to specify <some/directory>
# find that corresponding file and download it to <some/directory>

# goal: autometa-download-dataset --community 78 --output <some/directory>

# prepare dependencies
import gdown
import argparse

# take in commands that user input
# including test file for now
parser = argparse.ArgumentParser(prog='autometa-download-dataset', description='Download a simulated community file from google drive to a specified directory')
parser.add_argument('--community',
help='specify a size of simulated community in MB',
choices=['78', '156', '312', '625', '1250', '2500', '5000', '10000', 'test'],
required=True)
parser.add_argument('--output',
help='specify the directory to download the file',
required=True)
args = parser.parse_args()

# provide list of database options as a dictionary with file_ids from google
simulated = {
'test': '1fy3M7RnS_HGSQVKidCy-rAwXuxldyOOv',
'78': '15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y',
'156': '13bkwFBIUhdWVWlAmVCimDODWF-7tRxgI',
'312': '1qyAu-m6NCNuVlDFFC10waOD28j15yfV-',
'625': '1FgMXSD50ggu0UJbZd1PM_AvLt-E7gJix',
'1250': '1KoxwxBAYcz8Xz9H2v17N9CHOZ-WXWS5m',
'2500': '1wKZytjC4zjTuhHdNUyAT6wVbuDDIwk2m',
'5000': '1IX6vLfBptPxhL44dLa6jePs-GRw2XJ3S',
'10000': '1ON2vxEWC5FHyyPqlfZ0znMgnQ1fTirqG'
}
Copy link
Collaborator

@Sidduppal Sidduppal Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great start. You can try the items() method (link) to loop through the dictionary and get the respective "ID" needed to download.


# construct file id into a url to put into gdown
file_id = simulated[args.community]
url = f'https://drive.google.com/uc?id={file_id}'

# download the specified file with gdown
gdown.download(url, args.output)
Loading