Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add Diagnosis db model and tests #67

Merged
merged 8 commits into from
Jan 24, 2018
Merged

✨ Add Diagnosis db model and tests #67

merged 8 commits into from
Jan 24, 2018

Conversation

znatty22
Copy link
Member

Adds the initial database model for the Diagnosis entity, along with updates to the Participant model which include the relationship to the Diagnosis model.

Please see comments on files for explanations on choice of lazy loading parameters.

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
"""
__tablename__ = "participant"
external_id = db.Column(db.String(32))
diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using lazy=joined for Participant.diagnoses
Chose joined load for Participant.diagnoses to support the MVP use case: get single Participant by id. We decided this query would result in a response that includes the Person’s full demographic info, full diagnoses info, and full samples info. Thus it makes sense to use joined/subquery load parameter here so that we can retrieve all of this information for a person in a single query.

Choice of lazy=joined or lazy=subquery
In terms of choosing joined vs subquery, several sources say that in general (i.e. in most use cases) join is more performant than subquery. However, it seems that for Postgres this should not matter since it is able to intelligently choose whether or not to use a subquery or convert the subquery to a join. Thus, I’ve decided to use lazy=joined rather than subquery. This can always change later based on the evolution of our use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good things to think about, however, it would be hard to gauge performance difference without any data or reliable backend yet.
These are also ORM level settings and can be changed without migration, and can even be set for individual queries, so we're not locking ourselves into anything by deciding on something here.

Do you have any links on the clever postgres features? I think the number of queries made are decided by the orm, regardless of how clever postgres is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'll probably want to fetch all the diagnoses with each participant for many use cases, however, what if there is one case where we just want to get all Partipant's ids, and we forget to set lazy=True? Then we've joined and sent all the diagnosis data unnecessarily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the same thing as forgetting to use lazy=joined or joinedload in our get endpoints?
Maybe we could have the default behavior of our get endpoints use lazy=joined/subquery (for demographic, diagnosis, sample) and then once we start supporting partial object retrieval we can explicitly set lazy=True for get all queries.

I agree though probably just easier to figure it out once we can play with data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the postgres comment, I had done a quick search and took a glance at this article and this video.

https://robots.thoughtbot.com/postgresql-performance-considerations
https://www.youtube.com/watch?v=dAQhBJMAeqc

Copy link
Contributor

@dankolbman dankolbman Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgetting to do lazy=joined is still equally as possible, but the difference is that executing a statement with lazy=joined that doesn't need to be joined is much more costly than one that is lazy=select.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
lazy='joined',
backref=db.backref('participants',
lazy=True))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using lazy=True for Diagnosis.participants
In terms of the reverse relationship (how to load participants for a diagnosis or diagnoses), it is tough to tell what to choose here because this will be dependent on the query use cases. For example, when a user queries for a diagnosis will they usually just be interested in the diagnosis shallow attributes (i.e. primary diagnosis) or will they usually also be interested in getting all the participants or a subset of participants (perhaps based on some condition) for that diagnosis. In another example, when a user queries for a collection of diagnoses will they be interested in just the diagnoses shallow attributes for the collection of diagnoses or will they usually want to query for a collection of diagnoses related to a set of participants that meet some condition(s). In each of these examples the lazy loading parameter can make a significant impact on performance. Since this is unclear right now, I am leaving the lazy loading parameter to the default (lazy=True) which simply loads participants when they are needed rather than up front (lazy=joined or lazy=subquery). We can always change this later once we start getting into implementing richer query functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the resource fetching functionality can be set on a per-query basis, so we can set them depending on the needs of whatever is querying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@znatty22 znatty22 added the data model Changes to the underlying data representation label Jan 24, 2018
@znatty22 znatty22 self-assigned this Jan 24, 2018
@znatty22 znatty22 mentioned this pull request Jan 24, 2018
Copy link
Member

@allisonheath allisonheath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly a few comments on the actual fields.

__tablename__ = 'diagnosis'
external_id = db.Column(db.String(32))
primary_diagnosis = db.Column(db.Text)
progression_or_recurence = db.Column(db.String(32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text instead of length limited string for external_id and progression_or_recurrence? Also slight typo, it's recurrence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok and then I assume that the external_id field for any entity that has it, should also be Text? If so, we'll make sure to change that for Demographic and Sample as well


__tablename__ = 'diagnosis'
external_id = db.Column(db.String(32))
primary_diagnosis = db.Column(db.Text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we had previously, but I was thinking today maybe we make this several fields:

  • raw_diagnosis_text for what comes in from input files
  • ICD-10 what we map the raw diagnosis text to this standard
  • MONDO what we map the raw diagnosis text to this standard

Don't think this should hold up this PR, and the general structure between the raw and the mapped data needs more thought. But noticed as the general setup here is to be many to one with participant and primary_diagnosis typically indicates a one to one relationship. So if nothing else should drop the primary part if many to one.

Copy link
Member Author

@znatty22 znatty22 Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
And is it correct to model Participant-Diagnosis as a many-to-many relationship?. I've seen the many to many in the kids first data but I think in the kf-dictionary diagnosis schema, it indicates a one to many (case to diagnosis). Seems like many-to-many makes more sense though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 well I think for the MVP we could just leave it at primary_diagnosis and have it be one to one and overall that would be simpler. But that might not set us up right because we know shortly forthcoming data sets will have multiple diagnoses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm just based on the kids first data I think we should leave it as many to many and I can rename primary_diagnosis to diagnosis. I think this will set us up well for the future and its a very minor change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with many-many. Should probably plan on turning this into a relationship to some Disease type later (with appropriate icd codes and what not).

Nit: it doesn't make a difference, but I think Text should be Text() here to have a uniform style

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops will change Text to Text()

external_id = db.Column(db.String(32))
primary_diagnosis = db.Column(db.Text)
progression_or_recurence = db.Column(db.String(32))
days_to_last_followup = db.Column(db.Integer())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be age_at_diagnosis_days rather than the days_to_last_followup as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of how we may want to model everything in terms of events down the road, we may want to agree on having a common field such as age_at_event on all the tables that are potentially events.

"""

__tablename__ = 'diagnosis'
external_id = db.Column(db.String(32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do diagnoses have external ids? Do we have an example of that somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither demographic or diagnosis had external_ids in the kids first cohorts I looked at, but I just put the field in there just in case. I can remove it if we don't think it makes sense though


__tablename__ = 'diagnosis'
external_id = db.Column(db.String(32))
primary_diagnosis = db.Column(db.Text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with many-many. Should probably plan on turning this into a relationship to some Disease type later (with appropriate icd codes and what not).

Nit: it doesn't make a difference, but I think Text should be Text() here to have a uniform style

external_id = db.Column(db.String(32))
primary_diagnosis = db.Column(db.Text)
progression_or_recurence = db.Column(db.String(32))
days_to_last_followup = db.Column(db.Integer())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of how we may want to model everything in terms of events down the road, we may want to agree on having a common field such as age_at_event on all the tables that are potentially events.

diagnoses_assoc_table = db.Table('diagnoses',
db.Column('diagnosis_id', db.Integer,
db.ForeignKey('diagnosis.kf_id'),
primary_key=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does specifying each as a primary key here do? Does it eliminate needing an id, or create indicies on both fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, so it creates a composite primary key using both columns. I think you have to have a composite primary key here bc its an association table which doesn't really have a real primary key. But I wonder if the default behavior of the association table is to treat the foreign keys as a composite primary key. Like I wonder what would happen if i removed the primary_key params...

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
"""
__tablename__ = "participant"
external_id = db.Column(db.String(32))
diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good things to think about, however, it would be hard to gauge performance difference without any data or reliable backend yet.
These are also ORM level settings and can be changed without migration, and can even be set for individual queries, so we're not locking ourselves into anything by deciding on something here.

Do you have any links on the clever postgres features? I think the number of queries made are decided by the orm, regardless of how clever postgres is.

@@ -14,3 +23,7 @@ class Participant(db.Model, Base):
"""
__tablename__ = "participant"
external_id = db.Column(db.String(32))
diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'll probably want to fetch all the diagnoses with each participant for many use cases, however, what if there is one case where we just want to get all Partipant's ids, and we forget to set lazy=True? Then we've joined and sent all the diagnosis data unnecessarily.

diagnoses = db.relationship('Diagnosis', secondary=diagnoses_assoc_table,
lazy='joined',
backref=db.backref('participants',
lazy=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the resource fetching functionality can be set on a per-query basis, so we can set them depending on the needs of whatever is querying.

Copy link
Contributor

@dankolbman dankolbman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lftm. Can you remove that one merge commit before merging?

Copy link
Member

@allisonheath allisonheath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one thing - leaving as comment so feel free to merge after updating.

__tablename__ = 'diagnosis'
external_id = db.Column(db.Text())
diagnosis = db.Column(db.Text())
progression_or_recurrence = db.Column(db.Text())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think progression_or_recurrence should be removed for the initial pass.

@znatty22 znatty22 merged commit aef3b5a into master Jan 24, 2018
dankolbman pushed a commit that referenced this pull request Jan 25, 2018
* ✨ Add Diagnosis db model and relationship to Participant db model
* ✅ Add tests for Diagnosis db model
@znatty22 znatty22 deleted the diagnosis-model branch January 25, 2018 19:49
@dankolbman dankolbman mentioned this pull request Apr 2, 2018
alubneuski pushed a commit that referenced this pull request Oct 11, 2024
* ✨ Add Diagnosis db model and relationship to Participant db model
* ✅ Add tests for Diagnosis db model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data model Changes to the underlying data representation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants