-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opinons / Opinon Cluster Webhook #3779
base: main
Are you sure you want to change the base?
Opinons / Opinon Cluster Webhook #3779
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: cl/search/models.py
Did you find this useful? React with a 👍 or 👎 |
|
for more information, see https://pre-commit.ci
Jordan, to make things explicit, we usually don't do reviews until a review has been requested, but I suspect you don't have the rights to do that, so before we start on this, can I just ask: Is this ready or a draft? |
Feedback on the notes may change this PR. Besides that, the PR is ready for review. |
Thanks @jordanparker6 for the work on this. Here is some feedback: Triggering WebhooksYour current webhook triggering logic resides within custom
FrontendAfter you create your webhook in So, we’d need to add those event templates for each new webhook. The view that handles it is TestsIf you prefer, once this is completed I'm willing to help and add the required tests for these new webhooks, to ensure each webhook contains the payload we expect and are properly triggered. |
@albertisfu Thanks for the feedback. I have done the following.
Things that need to be done / need clarification. I am yet to implement the webhook update method for the I can't seem to run tests locally as the tests are linked to someones personal files ("/home/elliott/freelawmachine/flp/columbia_data/opinions"). Any tips for getting this running locally so I can test the UI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jordanparker6 this is getting closer!
I've left some comments in the code.
And following up on your comments.
I wonder if batching of events is actually needed given the celery implementation.
I believe we can benefit from batching in situations where we add or update thousands of Opinions/OpinionClusters
at a time, as we typically do, to avoid triggering an equivalent number of tasks/webhooks. On the other hand, batching could slow down the process of sending webhooks if our ingestion rate is not too high during a given period.
This approach would necessitate additional logic to send the accumulated webhooks even if the batch size has not been reached after a certain threshold time.
Do you have any thoughts about this? @mlissner
I have created a OpinionSerializerOffline and OpinionClusterSerializerOffline where I have implemented serialisation that avoids the request context. Please review there are some key decisions on what is / isn't included.
Mike, do you have any suggestions on which fields should be included in the webhook/client context?
- Should we add the same fields as in the API?
- Should we directly include nested fields of related instances like the
Docket
? If so, which fields from Docket would be worthwhile to add directly? - Jordan mentioned that it would be nice to include the
Person
data nested within everyPerson
related field. To do this, we would need to create a new serializer for Person sincePersonSerializer
also usesHyperlinkedRelatedField
for some fields. Therefore, we need to define which fields are worth including in this new serializer.
I have created the template files for the test webhook functions. These need to be updated with real test data and I don't have my hands on any atm.
Great, thank you. I can update these templates with a real test example once we have defined the structure of the serializers.
I am yet to implement the webhook update method for the OpinionCluster and Opinion models. I can add them to the post_save signal where created=False however I am unsure how the FieldTracking comes into it and if they should be mounted to that model or not. Thoughts?
Yes, as I remember, we wanted the field_tracker
to send a payload only with the fields that changed during an update. This approach has some benefits:
- It allows for smaller payloads by only sending the fields that changed.
- It simplifies the logic for the client to determine which fields changed and take appropriate action.
We could also send the same payload as the created webhook and not use thefield_tracker
. Do you have an opinion on this, Mike?
In a follow-up comment, I'm posting the process we could follow to use the field tracker in case we decide to go with that approach.
I can't seem to run tests locally as the tests are linked to someones personal files ("/home/elliott/freelawmachine/flp/columbia_data/opinions"). Any tips for getting this running locally so I can test the UI?
I think you're looking to generate some OpinionClusters/Opinions? I did see that the file you mentioned is linked to html_test.py
, which seems like a command to import opinions.
If that's the case, instead of using that importer, you could generate fake Opinions/OpinionClusters using the make_dev_data
command:
To generate 1 OpinionWithParentsFactory, use the following command:
manage.py make_dev_data --make-objects 103 --count 1
To generate 1 OpinionClusterWithParentsFactory, use the following command:
manage.py make_dev_data --make-objects 102 --count 1
You can also generate some instances using the admin panel in case you want to create Opinions/OpinionClusters with specific data.
If you're interested in real data, maybe Mike can suggest which command you can use to import some real opinions.
cl/search/api_serializers.py
Outdated
judges = PersonSerializer(many=True, read_only=True) | ||
docket_id = serializers.ReadOnlyField() | ||
# To Delete: CONFIRM THIS | ||
# docket = serializers.PrimaryKeyRelatedField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, when using fields = "__all__"
, a docket field is generated that shows the related docket ID. Therefore, we need to decide which name makes more sense: it can either be docket_id
or docket
.
cl/search/signals.py
Outdated
Send a webhook to the webapp when an opinion is created. | ||
""" | ||
if created: | ||
return send_opinion_created_webhook(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but it doesn't process the method asynchronously as a Celery task.
It should be invoked as:
send_opinion_created_webhook.delay(instance.id)
Additionally, we prefer to only send the instance ID to the task and retrieve the instance within the task for two reasons: it's more lightweight, and in case the instance is modified in the meantime, the data to be sent is the most recent one from the database.
Regarding the delete webhooks, I notice that you're sending a list of IDs. I assume this is in preparation for sending multiple IDs when we implement batching? If so, would it be a good idea to also prepare the creation/update methods to support batching in the future? This way, the JSON structure would be ready to support multiple instances.
For instance, for OpinionCluster
we can have an array like:
{
"payload":{
"clusters":[
{
}
]
}
}
And something similar for Opinions
too.
{ | ||
"payload":{ | ||
"docket_id": 1, | ||
"panel": [ | ||
{ } | ||
] , | ||
"non_participating_judges": [ | ||
{} | ||
], | ||
"judges": "J Gallegar", | ||
"date_filed": "", | ||
"date_filed_is_approximate": True, | ||
"slug": "", | ||
"case_name_short": "", | ||
"case_name": "", | ||
"case_name_full": "", | ||
"scdb_id": "", | ||
"scdb_decision_direction": "", | ||
"scdb_votes_majority": "", | ||
"scdb_votes_majority": "", | ||
"scdb_votes_minority": "", | ||
"source": "", | ||
"procedural_history": "", | ||
"attorneys": "", | ||
"nature_of_suit": "", | ||
"posture": "", | ||
"syllabus": "", | ||
"headnotes": "", | ||
"summary": "", | ||
"disposition": "", | ||
"history": "", | ||
"other_dates": "", | ||
"cross_reference": "", | ||
"correction": "", | ||
"citations": [ | ||
{} | ||
] | ||
"citation_count": "", | ||
"precedential_status": "", | ||
"date_blocked": "", | ||
"blocked": "", | ||
"filepath_json_harvard": "", | ||
"arguments": "", | ||
"headmatter": "", | ||
} | ||
"webhook":{ | ||
"version":1, | ||
"event_type":3, | ||
"date_created":"2024-01-06T14:21:40.855097-07:00", | ||
"deprecation_date":null | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | |
"payload":{ | |
"docket_id": 1, | |
"panel": [ | |
{ } | |
] , | |
"non_participating_judges": [ | |
{} | |
], | |
"judges": "J Gallegar", | |
"date_filed": "", | |
"date_filed_is_approximate": True, | |
"slug": "", | |
"case_name_short": "", | |
"case_name": "", | |
"case_name_full": "", | |
"scdb_id": "", | |
"scdb_decision_direction": "", | |
"scdb_votes_majority": "", | |
"scdb_votes_majority": "", | |
"scdb_votes_minority": "", | |
"source": "", | |
"procedural_history": "", | |
"attorneys": "", | |
"nature_of_suit": "", | |
"posture": "", | |
"syllabus": "", | |
"headnotes": "", | |
"summary": "", | |
"disposition": "", | |
"history": "", | |
"other_dates": "", | |
"cross_reference": "", | |
"correction": "", | |
"citations": [ | |
{} | |
] | |
"citation_count": "", | |
"precedential_status": "", | |
"date_blocked": "", | |
"blocked": "", | |
"filepath_json_harvard": "", | |
"arguments": "", | |
"headmatter": "", | |
} | |
"webhook":{ | |
"version":1, | |
"event_type":3, | |
"date_created":"2024-01-06T14:21:40.855097-07:00", | |
"deprecation_date":null | |
} | |
} | |
{ | |
"payload":{ | |
"docket_id":1, | |
"panel":[ | |
], | |
"non_participating_judges":[ | |
], | |
"judges":"J Gallegar", | |
"date_filed":"", | |
"date_filed_is_approximate":true, | |
"slug":"", | |
"case_name_short":"", | |
"case_name":"", | |
"case_name_full":"", | |
"scdb_id":"", | |
"scdb_decision_direction":"", | |
"scdb_votes_majority":"", | |
"source":"", | |
"procedural_history":"", | |
"attorneys":"", | |
"nature_of_suit":"", | |
"posture":"", | |
"syllabus":"", | |
"headnotes":"", | |
"summary":"", | |
"disposition":"", | |
"history":"", | |
"other_dates":"", | |
"cross_reference":"", | |
"correction":"", | |
"citations":[ | |
], | |
"citation_count":"", | |
"precedential_status":"", | |
"date_blocked":"", | |
"blocked":"", | |
"filepath_json_harvard":"", | |
"arguments":"", | |
"headmatter":"" | |
}, | |
"webhook":{ | |
"version":1, | |
"event_type":8, | |
"date_created":"2024-01-06T14:21:40.855097-07:00", | |
"deprecation_date":null | |
} | |
} |
The webhook dummy templates need some syntax corrections to work and also they could benefit from an example with more fields filled out so they have more value from a testing perspective. However, I believe these updates should wait until we have figured out the final JSON payload for every webhook event. At that point, I can help in updating them while also adding the test suite for the PR.
Here you can find more details about how the field tracker works, in case we decide to go with this approach: If you look at the
These fields are tracked for Elasticsearch indexing purposes, meaning we only update the document in ES when one or more of these fields change, ignoring updates that don't affect a field that's indexed. So, we'll need to add a new
And within the signal you can do something like:
And within
|
Sorry to be a bit slow here. Thank you both for all the careful work here. This is going to be a major new feature. A few replies:
Once again, thank you guys! |
Yeah, the webhook celery tasks are designed to fail if the webhook endpoint is down. The HTTP request has a set timeout of 3 seconds. If the endpoint fails to respond within that time, or there is a We use a custom mechanism to retry webhook events instead of relying on celery tasks. |
So in other words, you think that should be fine, Alberto? Sorry I forget the details here... |
Correct, I think we'll be fine. Celery tasks won't pile up if an endpoint goes down. The task will be able to handle that gracefully while finishing sending all the events to other endpoints. |
Co-authored-by: Alberto Islas <albertisfu@gmail.com>
Co-authored-by: Alberto Islas <albertisfu@gmail.com>
For the batching do you want to use something like this: https://celery-batches.readthedocs.io/en/latest/ It looks pretty straight forward to implement the time and size based batching discussed. I have updated all the tasks to be ready for batching. |
Edited 3/21 to choose postgresql as temporary storage and note how transactions will matter. I'm a little hesitant to use that batching solution. I took a quick look at it, and I didn't like how it required a certain worker configuration nor how it relies on workers staying up to work properly (ours come and go pretty often thanks to k8s). But somehow we need to create and send batches of events so we need some sort of temporary place to store the events and another tool to dispatch them. For storage I previously noted we could use Redis or the DB and said that Celery was a bad choice. I've firmed up my opinion on this: We should use the DB because it has nearly unlimited storage, and it has transactions that work well. The speed penalty should be minimal vs Redis in this context. Let's use the DB for temporary storage. For dispatching, I think you can do it this way:
I think this would work nicely. We'd only send tasks every The compromise is that if we get a lot of updates all at once, the first batch won't trigger until This introduces no new dependencies, is simple to do, and should work well? |
ce13ff8
to
83cb449
Compare
Ok so I have had a stab at this. I don't have the best testing set up locally yet so I was just playing around with some mock implementations of this. I have created a general The meat and potatoes is the
This task will keep running due to the recursion and it is being fired on start up in the ready method of the Search Apps config (is this the best spot). One thought is whether we want a task per webhook event or just a global task that does batching with a switch statement for the webhooks. Do you guys have a good way to test this implementation? |
Thanks Jordan. I'm going to move to Alberto's queue so he can take a look at your work. How are you feeling time wise? More eager to be done or more hanging in there for now? |
haha, I am hanging in! I am 100% keen to see this through. I am allocating a day / half-day a week to this until its done. My time commitment has to stay about the 1 day a week as I am trying to ship a bunch of stuff with lawme in preparation for v.1.0.0 and i'm a bit understaffed. This is important for the roadmap of lawme as the real time syncing answers a lot of "whataboutism". |
Sounds good. Alberto should have time to take another look at it soon. |
Sorry for the delay in review here, Jordan. Alberto reviewed your code and he and I had some conversations via email about your approach, mine, and his. Here's the summary:
The other note I made (via an edit to my comment above) is that we should really do this in the DB. The downside is that it's a bit harder on the coding side (we'll need a little model to store stuff), and that it's a bit slower than Redis, but the upsides are that it's more reliable (transactions, ACID guarantees, etc), and it won't run out of space. I realized that if we have a LOT of opinions piling up faster than we can send them out, we'll be glad those aren't in redis piling up and threatening to bring down the whole system. Do you think you can take another swing at this Jordon, with the changes above? |
@mlissner sounds like solid brainstorm! @albertisfu is a Django wizard I see. Love the feedback! Let me have a stab at the daemon this week. Once we are happy that the daemon is working. I can have a stab at moving the queue to the DB. The db stuff sounds pretty straight forward. |
…arker6/courtlistener into feature/opinions-webhook
@albertisfu I have the daemon stuff up. Still lacking a good way to test this though. Is that along the lines of what you had in mind? |
@jordanparker6 Great, thanks! I'll be reviewing the new approach. And yes, the daemon tests are part of the tests I have in mind to add. |
Adding webhooks to allow users to subscribe to updates to the Opinion and OpinionCluster entity.
Notes: