✨ Dummy data generator #89

parimalak · 2018-02-03T02:28:33Z

No description provided.

dankolbman

Looks like a lot of files are duplicated between tests/ and dataservice/utils/?

dankolbman · 2018-02-04T00:18:24Z

dataservice/util/data_gen/data_generator.py

+import uuid
+import random
+
+from dataservice.util.data_gen.data import *


* should be avoided.
You can look up python start imports for all the reasons it's bad.

dankolbman · 2018-02-04T00:20:48Z

dataservice/util/data_gen/data.py

+
+# NUM_DEMOGRAPHICS = 1
+
+NUM_DIAGNOSES = random.randint(0, 35)


Setting these to a random number on import seems counter intuitive.
This way, all samples will have the same number of aliquots, but that number will change between runs.
Using a random number of children for each parent may be a better idea. Perhaps this number could be MAX_DIAGNOSES then

Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import

dankolbman · 2018-02-04T00:21:35Z

dataservice/util/data_gen/data_generator.py

+from dataservice.api.genomic_file.models import GenomicFile
+
+
+class DataGenerator(object):


Since you have a class to represent the generator's state, why no use class attributes instead of global variables?

Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.

And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?

dankolbman · 2018-02-04T00:24:22Z

dataservice/util/data_gen/data.py

+NUM_GENOMIC_FILES = 5
+
+# Demographics data
+race_list = [


Might be nice to put these in a txt file to stay consistent and have less clutter here.

znatty22

Looks pretty good! See comments on files

znatty22 · 2018-02-04T03:49:20Z

dataservice/util/data_gen/data_generator.py

+import uuid
+import random
+
+from dataservice.util.data_gen.data import *


znatty22 · 2018-02-04T03:49:31Z

dataservice/util/data_gen/data_generator.py

+from dataservice import create_app
+from dataservice.extensions import db
+from dataservice.api.participant.models import Participant
+from dataservice.api.participant.models import Participant


Remove duplicate import

znatty22 · 2018-02-04T03:57:24Z

dataservice/util/data_gen/data_generator.py

+from dataservice.api.genomic_file.models import GenomicFile
+
+
+class DataGenerator(object):


Agreed, I think you could have class attributes to represent the default totals for each entity and then use them as default args for each create entity method.

And then probably the rest of the data/inputs driving the generator should all be outside of the class. I see that this class has some inputs provided from data.py and other inputs hardcoded in. I know it would be tedious to move all the inputs out 😑 ... but then the generator is totally agnostic of the inputs its using to build the data. And then its also nice bc you could have different data input files for the generator. Maybe something for the next version though?

znatty22 · 2018-02-04T03:59:08Z

tests/dummy_data_generator/test_data_generator.py

@@ -0,0 +1,251 @@
+from datetime import datetime


I think you meant to remove the tests/dummy_data_generator folder right?

Remember to remove tests/dummy_data_generator

znatty22 · 2018-02-04T04:01:16Z

dataservice/util/data_gen/data.py

+
+# NUM_DEMOGRAPHICS = 1
+
+NUM_DIAGNOSES = random.randint(0, 35)


Agreed. These numbers should either be constant totals for each entity type or maximums so that you can randomize the total (bounded by max) of that entity at runtime not during import

znatty22 · 2018-02-05T22:50:56Z

I think in the future we may want to decouple the data input generation (creating/reading in choices, setting default maximums, etc) from the data generation/population (creating db objects and publishing to db). But this is probably fine for now in terms of an initial MVP

znatty22 · 2018-02-05T22:51:35Z

dataservice/util/data_gen/data_generator.py

+            self.data = {
+                'external_id': 'diagnosis_{}'.format(i),
+                'diagnosis': random.choice(self.diagnosis_list),
+                'progression_or_recurrence':


Remove progression_or_recurrence. This field is no longer in the Diagnosis model

dankolbman · 2018-02-06T00:35:51Z

tests/genomic_file/test_geonmic_file_models.py

@@ -22,6 +22,7 @@ def test_create_and_find(self):
        """


Looks like a typo in this file's name

dankolbman · 2018-02-06T00:37:40Z

dataservice/util/data_gen/data_generator.py

+        dt = datetime.now()
+        for i in range(total):
+
+            self.e_data = {


Could do with just assigning to e_data rather than a class property. I'm guessing you never actually need self.e_data?

dankolbman · 2018-02-06T00:41:07Z

dataservice/util/data_gen/data_generator.py

+            self.s_list.append(
+                Sample(
+                    **self.sample_data,
+                    aliquots=self._create_aliquots(


This looks pretty messy and is hard to follow. Maybe try defining the aliquots first, then passing them in?

aliquots = self._create_aliquots() Sample(**sample_data, aliquots=aliquots)

dankolbman · 2018-02-06T00:41:44Z

dataservice/util/data_gen/data_generator.py

+        creates participants with samples, demographics, and diagnoses
+        """
+        for i in range(total):
+            self.p = Participant(


p probably doesn't need to be a class property. Same comment as below about assigning samples and other relationships beforehand.

dankolbman · 2018-02-06T01:17:10Z

dataservice/util/data_gen/data_generator.py

+                    random.randint(
+                        self.min_samples,
+                        self.max_samples)),
+                demographic=self._create_demographics(i),


I think you want _create_demographics(1) since you can only have one demographic per participant.

i am just trying to have same suffix for creating demographic.
external_id': 'demo_id_{}'.format(i)

dankolbman · 2018-02-06T01:19:30Z

dataservice/util/data_gen/data_generator.py

+                relativedelta.relativedelta(days=random.randint(1, 30)),
+                'experiment_strategy':
+                random.choice
+                (self.experiment_strategy_list),


the function arguments should really be on the same line as the function

dankolbman · 2018-02-06T01:30:15Z

Should be sufficient for a first pass, but there's some stuff that could be cleaned up.
I'm not too big a fan of being heavily reliant on the class properties that are assigned within functions, eg:

class DataMocker:
    def __init__(self):
        self._genomic_files()

    def _genomic_files(self):
        self.max_gen_files = 10
        self.min_gen_files = 1
        self.file_format_list = ['.txt', '.csv']

You now have a bunch of class properties that follow different naming schemes all over the file, and you'd have to grep through to find the one you're looking for. You could perhaps create some sort of MockEntity class with entity_class, min, max, sample_values attributes for each entity, but that may be over abstracting in this case. You may be better off trying to use some sort of dict to do it instead:

entities = [ Participant, GenomicFile, ... ]
self.entity_ranges =  {c: {
    'min': 5,
    'max': 25,
    'choices': open(c.__class__.__name__+'.txt').read().split('\n')
} for c in entities }

Now you could do self.entity_ranges[Participant] and get back the min, max, and choices in a consistent manner.

Perhaps worth considering adding a fake() static function on each of the entities that returns a new entity with appropriate fields set.

Here's a cool library that may give some inspiration, too.

dankolbman · 2018-02-07T15:19:23Z

dataservice/util/data_gen/data_generator.py

+
+class DataGenerator(object):
+
+    def __init__(self, config_name='testing'):


Can you set the config name to default to FLASK_CONFIG in the environment?

dankolbman · 2018-02-07T16:09:00Z

dataservice/util/data_gen/data_generator.py

+
+
+class DataGenerator(object):
+    def __init__(self, config_name=os.environ.get('FLASK_CONFIG', 'default')):


Probably not a good idea to do any logic like this in a function handle. Probably more appropriate to put this in the __init__.py where the DataGenerator object get's created?

You could get rid of config_name as a param and assume it will be read in from the environ, something like:

class DataGenerator(object): def __init__(self): self.setup(os.environ.get('FLASK_CONFIG', 'default'))

OR leave the config_name as an optional parameter and later on it could be passed in as a cmd line arg to the flask cmd:

class DataGenerator(object): def __init__(self, config_name=None): if not config_name: config_name = os.environ.get('FLASK_CONFIG','default') self.setup(config_name)

dankolbman

Nice!

✨ Dummy data generator

parimalak requested review from dankolbman, znatty22 and allisonheath February 3, 2018 02:28

dankolbman suggested changes Feb 4, 2018

View reviewed changes

dankolbman changed the title ~~✨ Dummy data generator~~ ✨ Dummy data generator Feb 4, 2018

znatty22 reviewed Feb 4, 2018

View reviewed changes

baileyckelly added this to the CHOP Sprint 2 milestone Feb 5, 2018

znatty22 reviewed Feb 5, 2018

View reviewed changes

dankolbman reviewed Feb 6, 2018

View reviewed changes

parimalak force-pushed the dummy-data-generator branch from c37c9f6 to ce30012 Compare February 6, 2018 19:48

parimalak and others added 2 commits February 6, 2018 18:14

✨Add dummy_date_generation script

a6fb147

✨ Add dummy data generator

633cdc8

parimalak force-pushed the dummy-data-generator branch from ce30012 to d5b75f2 Compare February 7, 2018 13:24

dankolbman reviewed Feb 7, 2018

View reviewed changes

parimalak force-pushed the dummy-data-generator branch 2 times, most recently from 606bcb4 to 24ccdd5 Compare February 7, 2018 15:58

dankolbman reviewed Feb 7, 2018

View reviewed changes

🔥 progress_recurrence field in diagnoses entity

2701add

parimalak force-pushed the dummy-data-generator branch from 24ccdd5 to 2701add Compare February 7, 2018 16:37

dankolbman approved these changes Feb 7, 2018

View reviewed changes

znatty22 approved these changes Feb 7, 2018

View reviewed changes

parimalak merged commit d7dfba9 into master Feb 7, 2018

znatty22 deleted the dummy-data-generator branch February 16, 2018 20:01

dankolbman mentioned this pull request Apr 2, 2018

🏷 Release 0.2.0 #217

Merged

alubneuski pushed a commit that referenced this pull request Oct 11, 2024

Merge pull request #89 from kids-first/dummy-data-generator

c370ece

✨ Dummy data generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Dummy data generator #89

✨ Dummy data generator #89

parimalak commented Feb 3, 2018

dankolbman left a comment

dankolbman Feb 4, 2018

znatty22 Feb 4, 2018

dankolbman Feb 4, 2018

znatty22 Feb 4, 2018

dankolbman Feb 4, 2018

znatty22 Feb 4, 2018

dankolbman Feb 4, 2018

znatty22 left a comment

znatty22 Feb 4, 2018

znatty22 Feb 4, 2018

znatty22 Feb 4, 2018

znatty22 Feb 4, 2018

znatty22 Feb 5, 2018

znatty22 Feb 4, 2018

znatty22 commented Feb 5, 2018

znatty22 Feb 5, 2018

dankolbman Feb 6, 2018

dankolbman Feb 6, 2018

dankolbman Feb 6, 2018

dankolbman Feb 6, 2018

dankolbman Feb 6, 2018

parimalak Feb 6, 2018

dankolbman Feb 6, 2018

dankolbman commented Feb 6, 2018

dankolbman Feb 7, 2018

dankolbman Feb 7, 2018

znatty22 Feb 7, 2018

dankolbman left a comment

		from dataservice.api.genomic_file.models import GenomicFile


		class DataGenerator(object):


		class DataGenerator(object):

		def __init__(self, config_name='testing'):



		class DataGenerator(object):
		def __init__(self, config_name=os.environ.get('FLASK_CONFIG', 'default')):

✨ Dummy data generator #89

✨ Dummy data generator #89

Conversation

parimalak commented Feb 3, 2018

dankolbman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

znatty22 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

znatty22 commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankolbman commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankolbman left a comment

Choose a reason for hiding this comment