Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Dummy data generator #89

Merged
merged 3 commits into from
Feb 7, 2018
Merged

✨ Dummy data generator #89

merged 3 commits into from
Feb 7, 2018

Conversation

parimalak
Copy link
Contributor

No description provided.

Copy link
Contributor

@dankolbman dankolbman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of files are duplicated between tests/ and dataservice/utils/?

import uuid
import random

from dataservice.util.data_gen.data import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* should be avoided.
You can look up python start imports for all the reasons it's bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto


# NUM_DEMOGRAPHICS = 1

NUM_DIAGNOSES = random.randint(0, 35)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting these to a random number on import seems counter intuitive.
This way, all samples will have the same number of aliquots, but that number will change between runs.
Using a random number of children for each parent may be a better idea. Perhaps this number could be MAX_DIAGNOSES then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import

from dataservice.api.genomic_file.models import GenomicFile


class DataGenerator(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have a class to represent the generator's state, why no use class attributes instead of global variables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.

And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?

NUM_GENOMIC_FILES = 5

# Demographics data
race_list = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to put these in a txt file to stay consistent and have less clutter here.

@dankolbman dankolbman changed the title ✨ Dummy data generator ✨ Dummy data generator Feb 4, 2018
Copy link
Member

@znatty22 znatty22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! See comments on files

import uuid
import random

from dataservice.util.data_gen.data import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

from dataservice import create_app
from dataservice.extensions import db
from dataservice.api.participant.models import Participant
from dataservice.api.participant.models import Participant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicate import

from dataservice.api.genomic_file.models import GenomicFile


class DataGenerator(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.

And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?

@@ -0,0 +1,251 @@
from datetime import datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to remove the tests/dummy_data_generator folder right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to remove tests/dummy_data_generator


# NUM_DEMOGRAPHICS = 1

NUM_DIAGNOSES = random.randint(0, 35)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import

@baileyckelly baileyckelly added this to the CHOP Sprint 2 milestone Feb 5, 2018
@znatty22
Copy link
Member

znatty22 commented Feb 5, 2018

I think in the future we may want to decouple the data input generation (creating/reading in choices, setting default maximums, etc) from the data generation/population (creating db objects and publishing to db). But this is probably fine for now in terms of an initial MVP

self.data = {
'external_id': 'diagnosis_{}'.format(i),
'diagnosis': random.choice(self.diagnosis_list),
'progression_or_recurrence':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove progression_or_recurrence. This field is no longer in the Diagnosis model

@@ -22,6 +22,7 @@ def test_create_and_find(self):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo in this file's name

dt = datetime.now()
for i in range(total):

self.e_data = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with just assigning to e_data rather than a class property. I'm guessing you never actually need self.e_data?

self.s_list.append(
Sample(
**self.sample_data,
aliquots=self._create_aliquots(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty messy and is hard to follow. Maybe try defining the aliquots first, then passing them in?

aliquots = self._create_aliquots()
Sample(**sample_data, aliquots=aliquots)

creates participants with samples, demographics, and diagnoses
"""
for i in range(total):
self.p = Participant(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p probably doesn't need to be a class property. Same comment as below about assigning samples and other relationships beforehand.

random.randint(
self.min_samples,
self.max_samples)),
demographic=self._create_demographics(i),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want _create_demographics(1) since you can only have one demographic per participant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am just trying to have same suffix for creating demographic.
external_id': 'demo_id_{}'.format(i)

relativedelta.relativedelta(days=random.randint(1, 30)),
'experiment_strategy':
random.choice
(self.experiment_strategy_list),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function arguments should really be on the same line as the function

@dankolbman
Copy link
Contributor

Should be sufficient for a first pass, but there's some stuff that could be cleaned up.
I'm not too big a fan of being heavily reliant on the class properties that are assigned within functions, eg:

class DataMocker:
    def __init__(self):
        self._genomic_files()

    def _genomic_files(self):
        self.max_gen_files = 10
        self.min_gen_files = 1
        self.file_format_list = ['.txt', '.csv']

You now have a bunch of class properties that follow different naming schemes all over the file, and you'd have to grep through to find the one you're looking for. You could perhaps create some sort of MockEntity class with entity_class, min, max, sample_values attributes for each entity, but that may be over abstracting in this case. You may be better off trying to use some sort of dict to do it instead:

entities = [ Participant, GenomicFile, ... ]
self.entity_ranges =  {c: {
    'min': 5,
    'max': 25,
    'choices': open(c.__class__.__name__+'.txt').read().split('\n')
} for c in entities }

Now you could do self.entity_ranges[Participant] and get back the min, max, and choices in a consistent manner.

Perhaps worth considering adding a fake() static function on each of the entities that returns a new entity with appropriate fields set.

Here's a cool library that may give some inspiration, too.

@parimalak parimalak force-pushed the dummy-data-generator branch from c37c9f6 to ce30012 Compare February 6, 2018 19:48
@parimalak parimalak force-pushed the dummy-data-generator branch from ce30012 to d5b75f2 Compare February 7, 2018 13:24

class DataGenerator(object):

def __init__(self, config_name='testing'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the config name to default to FLASK_CONFIG in the environment?

@parimalak parimalak force-pushed the dummy-data-generator branch 2 times, most recently from 606bcb4 to 24ccdd5 Compare February 7, 2018 15:58


class DataGenerator(object):
def __init__(self, config_name=os.environ.get('FLASK_CONFIG', 'default')):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a good idea to do any logic like this in a function handle. Probably more appropriate to put this in the __init__.py where the DataGenerator object get's created?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get rid of config_name as a param and assume it will be read in from the environ, something like:

class DataGenerator(object):
    def __init__(self):
        self.setup(os.environ.get('FLASK_CONFIG', 'default'))

OR leave the config_name as an optional parameter and later on it could be passed in as a cmd line arg to the flask cmd:

class DataGenerator(object):
    def __init__(self, config_name=None):
        if not config_name:
            config_name = os.environ.get('FLASK_CONFIG','default')
        self.setup(config_name)

@parimalak parimalak force-pushed the dummy-data-generator branch from 24ccdd5 to 2701add Compare February 7, 2018 16:37
Copy link
Contributor

@dankolbman dankolbman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@parimalak parimalak merged commit d7dfba9 into master Feb 7, 2018
@znatty22 znatty22 deleted the dummy-data-generator branch February 16, 2018 20:01
@dankolbman dankolbman mentioned this pull request Apr 2, 2018
alubneuski pushed a commit that referenced this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants