-
-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working end-to-end transcription integrations. #1774
base: main
Are you sure you want to change the base?
Working end-to-end transcription integrations. #1774
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still an open question around what transcription data we should actually store - that doesn't necessarily have to be part of this PR, but we should get at least a tentative answer/plan on that before we run any backfill
@@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |||
const hearing = Hearing.check(eventData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db, eventData
would be undefined - does Hearing.check
blow up (and prevent us from returning the non-transcription related event data)?
@@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> { | |||
maybeVideoURL = firstVideoSource.src | |||
|
|||
transcript = await assembly.transcripts.submit({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've worked out the kinks here, could we refactor getEvent
a bit?
I feel like there's at least two functions here that could be broken out to make this more readable :
shouldScrapeVideo
(handles fetching any previous event + checking for existingVideo + the date cutoff logic, returns a boolean yes/no)getHearingVideoUrl
(does the JSDOM wrangling, returns astring | undefined
with the videoUrl`)- And maybe
submitTranscription
(takes avideoUrl
, sends the AssemblyAI request + saves the webhookAuth in the DB, returnstranscriptId
)
Which would leave the new code as something like:
if (shouldScrapeVideo(EventId) {
const maybeVideoUrl = getHearingVideoUrl(EventId)
if (maybeVideoUrl) {
const transcriptId = await submitTranscription(maybeVideoUrl)
// add video/transcription data to Event
}
}
}) | ||
|
||
await db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I think we shouldn't actually save the hearing event to Firestore in getEvent
- the abstract class already handles that for us. The webhookAuth
is in a subcollection (which wouldn't be caught by the abstract class) and is the exception to that rule - this should be fine since Firestore lets you write to subcollections of documents that don't exist yet.
if (tokenInDbData) { | ||
return hashedToken === tokenInDbData.videoAssemblyWebhookToken | ||
} | ||
return false | ||
} | ||
) | ||
console.log("authenticatedEventsInDb", authenticatedEventsInDb) | ||
|
||
if (authenticatedEventsInDb) { | ||
try { | ||
await db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:
id
text
utterances
words
audio_url
(though we already have it on the hearing event, it would still be helpful here)audio_duration
(not sure what we'd need it for though, but it's conceivably useful)confidence
I'm most interested in text
/utterances
/words
- since those will take up >99% of the data size (and words
likely takes up >90%). We should chat with Matt V and design about the desired functionality, but I don't think we need all three. I goofed around with some of the example data: it looks like utterances
gives us mostly sensible, mostly accurate divisions of the content, and text
is good to have as a fallback.
IMO words
is likely unneccessary - I believe it would only be useful if we either:
- Find ourselves dissatisfied with the breakpoints/speaker split inherent to
utterances
and want to devise our own - Want to visually identify specific words that Assembly flagged as low confidence
Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut words
, we should also probably cut the words
array in utterances
.
@@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => { | |||
|
|||
authenticatedEventsInDb.forEach(async d => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking - is it possible that the firebase function will exit before the async function in the forEach
completes?
If so, it may be worth switching to a transaction for the writes here - something like
const batch = db.batch()
batch.set(
db.collection("transcriptions").doc(transcript.id),
{ _timestamp: new Date(), ...transcript }
)
authenticatedEventsInDb.forEach(doc => {
batch.update(doc.ref, {["x-maple-webhook"]: null})
})
await batch.commit()
Fixes from webhook debugging work. Follow this test plan to verify:
Local testing guide
The following instructions will let you test the scraping half of the implementation. The webhook half will of the implementation needs to be on the development project for an end-to-end test.
In order to perform this test you will need
Instructions
ngrok config add-authtoken your-auth-token
yarn run dev:functions
ngrok http http://localhost:5001
hearing-5091
.functions/src/events/scrapeEvents.ts
to create a local testing stateconst EventId = 5091
into the beginning ofgetEvent
https://put-your-ngrok-tunnel-id-here.ngrok-free.app/demo-dtp/us-central1/scrapeHearings-0
yarn firebase-admin -e dev console
const resp = await db.collection("events").doc("hearing-5091").get()
const data = resp.data()
JSON.stringify(data)
JSON.stringify(data)
without the surrounding quotes (so you have a JS object) and save it somewhere for the next stepyarn dev:functions
yarn firebase-admin -e local console
await db.collection('events').doc('hearing-5091').set(data)
where data is the js object on your clipboard. It should look something likeawait db.collection('events').doc('hearing-5091').set({"startsAt":{"_seconds":1741881600,"_nanoseconds":0},"id":"hearing-5091","type":"hearing","content":{"EventDate":"2025-03-13T12:00:00","HearingAgendas":[{"DocumentsInAgenda":[],"Topic":"Annual Hearing on a Potential Modification of the Health Care Cost Growth Benchmark","StartTime":"2025-03-13T12:00:00","EndTime":"2025-03-13T15:00:00"}],"RescheduledHearing":null,"StartTime":"2025-03-13T12:00:00","EventId":5091,"HearingHost":{"Details":"http://malegislature.gov/api/GeneralCourts/194/Committees/J24","CommitteeCode":"J24","GeneralCourtNumber":194},"Name":"Joint Committee on Health Care Financing","Description":"Joint Committee on Health Care Financing & Health Policy Commission Public Hearing on a Potential Modification of the CY 2026 Health Care Cost Growth Benchmark","Location":{"AddressLine2":null,"AddressLine1":"24 Beacon Street","State":"MA","ZipCode":"02133","City":"Boston","LocationName":"Gardner Auditorium"},"Status":"Completed"},"fetchedAt":{"_seconds":1741967707,"_nanoseconds":728000000}})
fetchedAt
andfetchedAt
, and replace them with timestamp type fields of the same property names and values of right now.yarn dev:functions
emulator runscrapeHearings()
const resp = await db.collection("events").doc("hearing-5091").get()
resp.data()