-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update resolve_bill method #149
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, one tiny recommendation and some thoughts
openstates/importers/base.py
Outdated
date = datetime.fromisoformat(date) | ||
new_date = date - timedelta(days=1) | ||
legislative_session = LegislativeSession.objects.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be legislative_sessions
(plural) since it can return more than one
openstates/importers/base.py
Outdated
@@ -169,15 +169,28 @@ def resolve_bill(self, bill_id: str, *, date: str) -> typing.Optional[_ID]: | |||
if bill_transform_func: | |||
bill_id = bill_transform_func(bill_id) | |||
|
|||
# move the start_date up a bit in case the event is on the last day of a session to compare with end_date | |||
# Some steps here to first find the session that match the incoming event using the event date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point about increasing queries: in the past lots of queries in resolve_bill() has caused some real DB issues. So I think it could indeed be worth a bit more refactor to move this logic somewhere else.
The Importer base class has a property already called session_cache
(see here). Here's what I'm thinking:
- We could add a new method to base importer
get_all_sessions()
but there is a possible bad scenario whereget_session()
(which already exists and is used with Bills and Votes) loads a single session or two intosession_cache
. That creates a situation where it is impossible to know if we have all of the sessions or not. So we don't want to set up a scenario whereget_all_sessions()
logic is set up to fail byget_session()
- So then maybe make a change to
get_session()
(here) so that it loads all sessions intosession_cache
instead of just the one requested session (but it would still return the requested session). Basically it is not any less performant to query for 16 sessions than it is to query for one session, so why not load them all anytime some client code wants a session loaded from the DB? resolve_bill()
callsget_session()
or some new methodget_all_sessions()
, meaning that it is going to get back all the sessions as a list of dicts, and there will be only one DB query to get them.- then instead the logic here loops through the list of sessions to make comparisons and find the right one (instead of db query with WHERE clauses).
OR - leave the logic here but implement a new self.current_session_id_cache
property to save it for use by subsequent events.
All that said - there are generally many fewer events returned by an events scraper (vs. bills/votes). And I think only Event importer uses resolve_bill()
. So we definitely could just merge this in and wait to see if performance issues appear before making these kind of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, thank you for the additional insight. I will take a good.
1f4f2e6
to
cd8c5ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, two small changes recommended
@@ -128,6 +128,7 @@ def __init__(self, jurisdiction_id: str, do_postimport=True) -> None: | |||
self.pseudo_id_cache: typing.Dict[str, typing.Optional[_ID]] = {} | |||
self.person_cache: typing.Dict[_PersonCacheKey, typing.Optional[str]] = {} | |||
self.session_cache: typing.Dict[str, LegislativeSession] = {} | |||
self.all_sessions_cache: typing.List[LegislativeSession] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth a comment here explaining the difference between these two
openstates/importers/base.py
Outdated
@@ -139,6 +140,12 @@ def __init__(self, jurisdiction_id: str, do_postimport=True) -> None: | |||
if settings.IMPORT_TRANSFORMERS.get(self._type): | |||
self.cached_transformers = settings.IMPORT_TRANSFORMERS[self._type] | |||
|
|||
def get_all_sessions(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method should return self.all_sessions_cache
(rather than ONLY load a value to that property) - that would make it consistent with get_session()
which returns the requested session
Update
resolve_bill
: find a session that matches the incoming entity using the entity date. If a unique session is not found, then use the session with the latest "start_date". This does increase the number of calls to the database significantly. However, the result in this case was a 100% match for all event : bill relationship.