-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Dummy data generator #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of files are duplicated between tests/
and dataservice/utils/
?
import uuid | ||
import random | ||
|
||
from dataservice.util.data_gen.data import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
should be avoided.
You can look up python start imports for all the reasons it's bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
dataservice/util/data_gen/data.py
Outdated
|
||
# NUM_DEMOGRAPHICS = 1 | ||
|
||
NUM_DIAGNOSES = random.randint(0, 35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting these to a random number on import seems counter intuitive.
This way, all samples will have the same number of aliquots, but that number will change between runs.
Using a random number of children for each parent may be a better idea. Perhaps this number could be MAX_DIAGNOSES
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import
from dataservice.api.genomic_file.models import GenomicFile | ||
|
||
|
||
class DataGenerator(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have a class to represent the generator's state, why no use class attributes instead of global variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.
And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?
dataservice/util/data_gen/data.py
Outdated
NUM_GENOMIC_FILES = 5 | ||
|
||
# Demographics data | ||
race_list = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to put these in a txt
file to stay consistent and have less clutter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! See comments on files
import uuid | ||
import random | ||
|
||
from dataservice.util.data_gen.data import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
from dataservice import create_app | ||
from dataservice.extensions import db | ||
from dataservice.api.participant.models import Participant | ||
from dataservice.api.participant.models import Participant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate import
from dataservice.api.genomic_file.models import GenomicFile | ||
|
||
|
||
class DataGenerator(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.
And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?
@@ -0,0 +1,251 @@ | |||
from datetime import datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to remove the tests/dummy_data_generator folder right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to remove tests/dummy_data_generator
dataservice/util/data_gen/data.py
Outdated
|
||
# NUM_DEMOGRAPHICS = 1 | ||
|
||
NUM_DIAGNOSES = random.randint(0, 35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import
I think in the future we may want to decouple the data input generation (creating/reading in choices, setting default maximums, etc) from the data generation/population (creating db objects and publishing to db). But this is probably fine for now in terms of an initial MVP |
self.data = { | ||
'external_id': 'diagnosis_{}'.format(i), | ||
'diagnosis': random.choice(self.diagnosis_list), | ||
'progression_or_recurrence': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove progression_or_recurrence. This field is no longer in the Diagnosis model
@@ -22,6 +22,7 @@ def test_create_and_find(self): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo in this file's name
dt = datetime.now() | ||
for i in range(total): | ||
|
||
self.e_data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do with just assigning to e_data
rather than a class property. I'm guessing you never actually need self.e_data
?
self.s_list.append( | ||
Sample( | ||
**self.sample_data, | ||
aliquots=self._create_aliquots( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty messy and is hard to follow. Maybe try defining the aliquots first, then passing them in?
aliquots = self._create_aliquots()
Sample(**sample_data, aliquots=aliquots)
creates participants with samples, demographics, and diagnoses | ||
""" | ||
for i in range(total): | ||
self.p = Participant( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p probably doesn't need to be a class property. Same comment as below about assigning samples and other relationships beforehand.
random.randint( | ||
self.min_samples, | ||
self.max_samples)), | ||
demographic=self._create_demographics(i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want _create_demographics(1)
since you can only have one demographic per participant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am just trying to have same suffix for creating demographic.
external_id': 'demo_id_{}'.format(i)
relativedelta.relativedelta(days=random.randint(1, 30)), | ||
'experiment_strategy': | ||
random.choice | ||
(self.experiment_strategy_list), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function arguments should really be on the same line as the function
Should be sufficient for a first pass, but there's some stuff that could be cleaned up.
You now have a bunch of class properties that follow different naming schemes all over the file, and you'd have to grep through to find the one you're looking for. You could perhaps create some sort of
Now you could do Perhaps worth considering adding a Here's a cool library that may give some inspiration, too. |
c37c9f6
to
ce30012
Compare
ce30012
to
d5b75f2
Compare
|
||
class DataGenerator(object): | ||
|
||
def __init__(self, config_name='testing'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the config name to default to FLASK_CONFIG
in the environment?
606bcb4
to
24ccdd5
Compare
|
||
|
||
class DataGenerator(object): | ||
def __init__(self, config_name=os.environ.get('FLASK_CONFIG', 'default')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a good idea to do any logic like this in a function handle. Probably more appropriate to put this in the __init__.py
where the DataGenerator
object get's created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could get rid of config_name as a param and assume it will be read in from the environ, something like:
class DataGenerator(object):
def __init__(self):
self.setup(os.environ.get('FLASK_CONFIG', 'default'))
OR leave the config_name as an optional parameter and later on it could be passed in as a cmd line arg to the flask cmd:
class DataGenerator(object):
def __init__(self, config_name=None):
if not config_name:
config_name = os.environ.get('FLASK_CONFIG','default')
self.setup(config_name)
24ccdd5
to
2701add
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
No description provided.