✨ Add Diagnosis db model and tests #67

znatty22 · 2018-01-24T01:14:12Z

Adds the initial database model for the Diagnosis entity, along with updates to the Participant model which include the relationship to the Diagnosis model.

Please see comments on files for explanations on choice of lazy loading parameters.

znatty22 · 2018-01-24T01:18:43Z

dataservice/api/participant/models.py

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
    """
    __tablename__ = "participant"
    external_id = db.Column(db.String(32))
+    diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,


Using lazy=joined for Participant.diagnoses
Chose joined load for Participant.diagnoses to support the MVP use case: get single Participant by id. We decided this query would result in a response that includes the Person’s full demographic info, full diagnoses info, and full samples info. Thus it makes sense to use joined/subquery load parameter here so that we can retrieve all of this information for a person in a single query.

Choice of lazy=joined or lazy=subquery
In terms of choosing joined vs subquery, several sources say that in general (i.e. in most use cases) join is more performant than subquery. However, it seems that for Postgres this should not matter since it is able to intelligently choose whether or not to use a subquery or convert the subquery to a join. Thus, I’ve decided to use lazy=joined rather than subquery. This can always change later based on the evolution of our use cases.

These are good things to think about, however, it would be hard to gauge performance difference without any data or reliable backend yet.
These are also ORM level settings and can be changed without migration, and can even be set for individual queries, so we're not locking ourselves into anything by deciding on something here.

Do you have any links on the clever postgres features? I think the number of queries made are decided by the orm, regardless of how clever postgres is.

I agree that we'll probably want to fetch all the diagnoses with each participant for many use cases, however, what if there is one case where we just want to get all Partipant's ids, and we forget to set lazy=True? Then we've joined and sent all the diagnosis data unnecessarily.

Isn't that the same thing as forgetting to use lazy=joined or joinedload in our get endpoints?
Maybe we could have the default behavior of our get endpoints use lazy=joined/subquery (for demographic, diagnosis, sample) and then once we start supporting partial object retrieval we can explicitly set lazy=True for get all queries.

I agree though probably just easier to figure it out once we can play with data

As for the postgres comment, I had done a quick search and took a glance at this article and this video.

https://robots.thoughtbot.com/postgresql-performance-considerations
https://www.youtube.com/watch?v=dAQhBJMAeqc

Forgetting to do lazy=joined is still equally as possible, but the difference is that executing a statement with lazy=joined that doesn't need to be joined is much more costly than one that is lazy=select.

znatty22 · 2018-01-24T01:21:20Z

dataservice/api/participant/models.py

+    diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
+                                lazy='joined',
+                                backref=db.backref('participants',
+                                                   lazy=True))


Using lazy=True for Diagnosis.participants
In terms of the reverse relationship (how to load participants for a diagnosis or diagnoses), it is tough to tell what to choose here because this will be dependent on the query use cases. For example, when a user queries for a diagnosis will they usually just be interested in the diagnosis shallow attributes (i.e. primary diagnosis) or will they usually also be interested in getting all the participants or a subset of participants (perhaps based on some condition) for that diagnosis. In another example, when a user queries for a collection of diagnoses will they be interested in just the diagnoses shallow attributes for the collection of diagnoses or will they usually want to query for a collection of diagnoses related to a set of participants that meet some condition(s). In each of these examples the lazy loading parameter can make a significant impact on performance. Since this is unclear right now, I am leaving the lazy loading parameter to the default (lazy=True) which simply loads participants when they are needed rather than up front (lazy=joined or lazy=subquery). We can always change this later once we start getting into implementing richer query functionality.

Again, the resource fetching functionality can be set on a per-query basis, so we can set them depending on the needs of whatever is querying.

allisonheath

Mainly a few comments on the actual fields.

allisonheath · 2018-01-24T02:12:01Z

dataservice/api/diagnosis/models.py

+    __tablename__ = 'diagnosis'
+    external_id = db.Column(db.String(32))
+    primary_diagnosis = db.Column(db.Text)
+    progression_or_recurence = db.Column(db.String(32))


Text instead of length limited string for external_id and progression_or_recurrence? Also slight typo, it's recurrence.

Ok and then I assume that the external_id field for any entity that has it, should also be Text? If so, we'll make sure to change that for Demographic and Sample as well

allisonheath · 2018-01-24T02:20:00Z

dataservice/api/diagnosis/models.py

+
+    __tablename__ = 'diagnosis'
+    external_id = db.Column(db.String(32))
+    primary_diagnosis = db.Column(db.Text)


This is what we had previously, but I was thinking today maybe we make this several fields:

raw_diagnosis_text for what comes in from input files

ICD-10 what we map the raw diagnosis text to this standard

MONDO what we map the raw diagnosis text to this standard

Don't think this should hold up this PR, and the general structure between the raw and the mapped data needs more thought. But noticed as the general setup here is to be many to one with participant and primary_diagnosis typically indicates a one to one relationship. So if nothing else should drop the primary part if many to one.

👍
And is it correct to model Participant-Diagnosis as a many-to-many relationship?. I've seen the many to many in the kids first data but I think in the kf-dictionary diagnosis schema, it indicates a one to many (case to diagnosis). Seems like many-to-many makes more sense though?

🤔 well I think for the MVP we could just leave it at primary_diagnosis and have it be one to one and overall that would be simpler. But that might not set us up right because we know shortly forthcoming data sets will have multiple diagnoses.

Hmm just based on the kids first data I think we should leave it as many to many and I can rename primary_diagnosis to diagnosis. I think this will set us up well for the future and its a very minor change

Agree with many-many. Should probably plan on turning this into a relationship to some Disease type later (with appropriate icd codes and what not).

Nit: it doesn't make a difference, but I think Text should be Text() here to have a uniform style

Woops will change Text to Text()

allisonheath · 2018-01-24T02:23:56Z

dataservice/api/diagnosis/models.py

+    external_id = db.Column(db.String(32))
+    primary_diagnosis = db.Column(db.Text)
+    progression_or_recurence = db.Column(db.String(32))
+    days_to_last_followup = db.Column(db.Integer())


I think this should be age_at_diagnosis_days rather than the days_to_last_followup as well.

Thinking of how we may want to model everything in terms of events down the road, we may want to agree on having a common field such as age_at_event on all the tables that are potentially events.

dankolbman · 2018-01-24T15:47:46Z

dataservice/api/diagnosis/models.py

+    """
+
+    __tablename__ = 'diagnosis'
+    external_id = db.Column(db.String(32))


Do diagnoses have external ids? Do we have an example of that somewhere?

Neither demographic or diagnosis had external_ids in the kids first cohorts I looked at, but I just put the field in there just in case. I can remove it if we don't think it makes sense though

dankolbman · 2018-01-24T15:51:27Z

dataservice/api/diagnosis/models.py

+
+    __tablename__ = 'diagnosis'
+    external_id = db.Column(db.String(32))
+    primary_diagnosis = db.Column(db.Text)


Agree with many-many. Should probably plan on turning this into a relationship to some Disease type later (with appropriate icd codes and what not).

Nit: it doesn't make a difference, but I think Text should be Text() here to have a uniform style

dankolbman · 2018-01-24T15:53:22Z

dataservice/api/diagnosis/models.py

+    external_id = db.Column(db.String(32))
+    primary_diagnosis = db.Column(db.Text)
+    progression_or_recurence = db.Column(db.String(32))
+    days_to_last_followup = db.Column(db.Integer())


Thinking of how we may want to model everything in terms of events down the road, we may want to agree on having a common field such as age_at_event on all the tables that are potentially events.

dankolbman · 2018-01-24T15:54:20Z

dataservice/api/participant/models.py

+diagnoses_assoc_table = db.Table('diagnoses',
+                                 db.Column('diagnosis_id', db.Integer,
+                                           db.ForeignKey('diagnosis.kf_id'),
+                                           primary_key=True),


What does specifying each as a primary key here do? Does it eliminate needing an id, or create indicies on both fields?

Yea, so it creates a composite primary key using both columns. I think you have to have a composite primary key here bc its an association table which doesn't really have a real primary key. But I wonder if the default behavior of the association table is to treat the foreign keys as a composite primary key. Like I wonder what would happen if i removed the primary_key params...

dankolbman · 2018-01-24T15:59:22Z

dataservice/api/participant/models.py

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
    """
    __tablename__ = "participant"
    external_id = db.Column(db.String(32))
+    diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,


These are good things to think about, however, it would be hard to gauge performance difference without any data or reliable backend yet.
These are also ORM level settings and can be changed without migration, and can even be set for individual queries, so we're not locking ourselves into anything by deciding on something here.

Do you have any links on the clever postgres features? I think the number of queries made are decided by the orm, regardless of how clever postgres is.

dankolbman · 2018-01-24T16:07:30Z

dataservice/api/participant/models.py

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
    """
    __tablename__ = "participant"
    external_id = db.Column(db.String(32))
+    diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,


I agree that we'll probably want to fetch all the diagnoses with each participant for many use cases, however, what if there is one case where we just want to get all Partipant's ids, and we forget to set lazy=True? Then we've joined and sent all the diagnosis data unnecessarily.

dankolbman · 2018-01-24T16:11:09Z

dataservice/api/participant/models.py

+    diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
+                                lazy='joined',
+                                backref=db.backref('participants',
+                                                   lazy=True))


Again, the resource fetching functionality can be set on a per-query basis, so we can set them depending on the needs of whatever is querying.

dankolbman

lftm. Can you remove that one merge commit before merging?

allisonheath

Just the one thing - leaving as comment so feel free to merge after updating.

allisonheath · 2018-01-24T22:16:58Z

dataservice/api/diagnosis/models.py

+    __tablename__ = 'diagnosis'
+    external_id = db.Column(db.Text())
+    diagnosis = db.Column(db.Text())
+    progression_or_recurrence = db.Column(db.Text())


I think progression_or_recurrence should be removed for the initial pass.

* ✨ Add Diagnosis db model and relationship to Participant db model * ✅ Add tests for Diagnosis db model

znatty22 added 3 commits January 23, 2018 20:05

✨ Add Diagnosis db model and relationship to Participant db model

6c3bec5

✅ Add tests for Diagnosis db model

4a055e6

Merge branch 'master' into diagnosis-model

2595e9b

znatty22 commented Jan 24, 2018

View reviewed changes

znatty22 added the data model Changes to the underlying data representation label Jan 24, 2018

znatty22 self-assigned this Jan 24, 2018

znatty22 requested review from dankolbman, parimalak and allisonheath January 24, 2018 02:02

znatty22 mentioned this pull request Jan 24, 2018

✨ Sample db model #66

Merged

allisonheath requested changes Jan 24, 2018

View reviewed changes

dankolbman reviewed Jan 24, 2018

View reviewed changes

znatty22 added 3 commits January 24, 2018 15:06

🔧 Update column type to Text and rename model attribs

0366118

🔧 Change many-to-many to one-to-many in db models

bf30a79

✅ Update tests to test one-to-many relationship

242945e

parimalak approved these changes Jan 24, 2018

View reviewed changes

dankolbman approved these changes Jan 24, 2018

View reviewed changes

Merge branch 'master' into diagnosis-model

9b158b2

allisonheath reviewed Jan 24, 2018

View reviewed changes

allisonheath approved these changes Jan 24, 2018

View reviewed changes

Merge branch 'master' into diagnosis-model

3a5b45b

znatty22 merged commit aef3b5a into master Jan 24, 2018

dankolbman pushed a commit that referenced this pull request Jan 25, 2018

✨ Add Diagnosis db model and tests (#67)

ad4fd95

* ✨ Add Diagnosis db model and relationship to Participant db model * ✅ Add tests for Diagnosis db model

znatty22 deleted the diagnosis-model branch January 25, 2018 19:49

dankolbman mentioned this pull request Apr 2, 2018

🏷 Release 0.2.0 #217

Merged

alubneuski pushed a commit that referenced this pull request Oct 11, 2024

✨ Add Diagnosis db model and tests (#67)

faca0b0

* ✨ Add Diagnosis db model and relationship to Participant db model * ✅ Add tests for Diagnosis db model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add Diagnosis db model and tests #67

✨ Add Diagnosis db model and tests #67

znatty22 commented Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

znatty22 Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018 •

edited

Loading

znatty22 Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

znatty22 Jan 24, 2018

allisonheath left a comment

allisonheath Jan 24, 2018

znatty22 Jan 24, 2018

allisonheath Jan 24, 2018

znatty22 Jan 24, 2018 •

edited

Loading

allisonheath Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

znatty22 Jan 24, 2018

allisonheath Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

znatty22 Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman Jan 24, 2018

dankolbman left a comment

allisonheath left a comment

allisonheath Jan 24, 2018

✨ Add Diagnosis db model and tests #67

✨ Add Diagnosis db model and tests #67

Conversation

znatty22 commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankolbman Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonheath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

znatty22 Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankolbman left a comment

Choose a reason for hiding this comment

allisonheath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankolbman Jan 24, 2018 •

edited

Loading

znatty22 Jan 24, 2018 •

edited

Loading