Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working end-to-end transcription integrations. #1774

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

boazsender
Copy link
Contributor

Fixes from webhook debugging work. Follow this test plan to verify:

Local testing guide

The following instructions will let you test the scraping half of the implementation. The webhook half will of the implementation needs to be on the development project for an end-to-end test.

In order to perform this test you will need

  • An assembly API key in your local environment (Ask Nathan Sanders or Matt Victor for this)
  • IAM permissions for the maple dev project in the Google Cloud Console (Ask Matt King for this)
  • A logged in and working firebase-admin session
  • A local copy of github.com/codeforboston/maple with installed deps

Instructions

  1. Setup an ngrok tunnel for the local copy of the transcription webhook
    1. Get an account at https://dashboard.ngrok.com
    2. Connect your ngrok account ngrok config add-authtoken your-auth-token
    3. Start the local function emulator with yarn run dev:functions
    4. Open an ngrok tunnel: ngrok http http://localhost:5001
  2. Find a hearing that has a video within the last 8 days. In this example we’ll use hearing-5091.
  3. Modify functions/src/events/scrapeEvents.ts to create a local testing state
    1. Hardcode const EventId = 5091 into the beginning of getEvent
    2. Change the webhook endpoint in the assembly api call to the ngrok endpoint: https://put-your-ngrok-tunnel-id-here.ngrok-free.app/demo-dtp/us-central1/scrapeHearings-0
  4. Run the firebase admin console on the dev project with yarn firebase-admin -e dev console
    1. From inside the the admin repl run the following command with the id of your hearing (using 5091 in this example):
      1. const resp = await db.collection("events").doc("hearing-5091").get()
      2. const data = resp.data()
      3. JSON.stringify(data)
    2. Copy the output of JSON.stringify(data) without the surrounding quotes (so you have a JS object) and save it somewhere for the next step
    3. Exit the repl with .exit
  5. Start your local firebase emulator with yarn dev:functions
    1. While that emulator is running, open a second shell
      1. Connect to the local firebase admin with yarn firebase-admin -e local console
      2. Load the data into your local instance with await db.collection('events').doc('hearing-5091').set(data) where data is the js object on your clipboard. It should look something like await db.collection('events').doc('hearing-5091').set({"startsAt":{"_seconds":1741881600,"_nanoseconds":0},"id":"hearing-5091","type":"hearing","content":{"EventDate":"2025-03-13T12:00:00","HearingAgendas":[{"DocumentsInAgenda":[],"Topic":"Annual Hearing on a Potential Modification of the Health Care Cost Growth Benchmark","StartTime":"2025-03-13T12:00:00","EndTime":"2025-03-13T15:00:00"}],"RescheduledHearing":null,"StartTime":"2025-03-13T12:00:00","EventId":5091,"HearingHost":{"Details":"http://malegislature.gov/api/GeneralCourts/194/Committees/J24","CommitteeCode":"J24","GeneralCourtNumber":194},"Name":"Joint Committee on Health Care Financing","Description":"Joint Committee on Health Care Financing & Health Policy Commission Public Hearing on a Potential Modification of the CY 2026 Health Care Cost Growth Benchmark","Location":{"AddressLine2":null,"AddressLine1":"24 Beacon Street","State":"MA","ZipCode":"02133","City":"Boston","LocationName":"Gardner Auditorium"},"Status":"Completed"},"fetchedAt":{"_seconds":1741967707,"_nanoseconds":728000000}})
      3. Exit the repl with .exit
    2. In the local emulator ui edit 5091 (http://localhost:3010/firestore/data/events/hearing-5091)
      1. Delete fetchedAt and fetchedAt, and replace them with timestamp type fields of the same property names and values of right now.
    3. Back in the yarn dev:functions emulator run scrapeHearings()
    4. Check the local hearing-5901 document with
      1. const resp = await db.collection("events").doc("hearing-5091").get()
      2. resp.data()
    5. Look for the following props on the hearing 5901-doc:
      1. videoURL: location of the video hosted by the MA Legislature
      2. videoFetchedAt: timestamp of when the video url was fetched
      3. videoAssemblyId: id of the transcript in the Assembly SaaS API
  6. Wait for the assembly job to finish (5-10 minutes usually)
  7. Check the transcriptions collection and look for a new transcript for hearing-5901:
    1. https://console.cloud.google.com/firestore/databases/-default-/data/panel/transcriptions/17c91397-c023-4f28-a621-4cef45c70749?authuser=1&project=digital-testimony-dev

Copy link

vercel bot commented Mar 26, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
maple-dev ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 27, 2025 4:06pm

Copy link
Collaborator

@Mephistic Mephistic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an open question around what transcription data we should actually store - that doesn't necessarily have to be part of this PR, but we should get at least a tentative answer/plan on that before we run any backfill

@@ -158,20 +158,6 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
const hearing = Hearing.check(eventData)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking here - how does this handle the first time a hearing is scraped? If there is no hearing in the db, eventData would be undefined - does Hearing.check blow up (and prevent us from returning the non-transcription related event data)?

@@ -191,25 +177,32 @@ class HearingScraper extends EventScraper<HearingListItem, Hearing> {
maybeVideoURL = firstVideoSource.src

transcript = await assembly.transcripts.submit({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've worked out the kinks here, could we refactor getEvent a bit?

I feel like there's at least two functions here that could be broken out to make this more readable :

  • shouldScrapeVideo (handles fetching any previous event + checking for existingVideo + the date cutoff logic, returns a boolean yes/no)
  • getHearingVideoUrl(does the JSDOM wrangling, returns a string | undefined with the videoUrl`)
  • And maybe submitTranscription (takes a videoUrl, sends the AssemblyAI request + saves the webhookAuth in the DB, returns transcriptId)

Which would leave the new code as something like:

if (shouldScrapeVideo(EventId) {
  const maybeVideoUrl = getHearingVideoUrl(EventId)
  if (maybeVideoUrl) {
      const transcriptId = await submitTranscription(maybeVideoUrl)
      // add video/transcription data to Event
  }
}

})

await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think we shouldn't actually save the hearing event to Firestore in getEvent - the abstract class already handles that for us. The webhookAuth is in a subcollection (which wouldn't be caught by the abstract class) and is the exception to that rule - this should be fine since Firestore lets you write to subcollections of documents that don't exist yet.

if (tokenInDbData) {
return hashedToken === tokenInDbData.videoAssemblyWebhookToken
}
return false
}
)
console.log("authenticatedEventsInDb", authenticatedEventsInDb)

if (authenticatedEventsInDb) {
try {
await db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start thinking about what parts of the response we actually want to save here - There are a lot of fields in the example response you posted in Slack, but the only ones that looks potentially relevant are:

  • id
  • text
  • utterances
  • words
  • audio_url (though we already have it on the hearing event, it would still be helpful here)
  • audio_duration (not sure what we'd need it for though, but it's conceivably useful)
  • confidence

I'm most interested in text/utterances/words - since those will take up >99% of the data size (and words likely takes up >90%). We should chat with Matt V and design about the desired functionality, but I don't think we need all three. I goofed around with some of the example data: it looks like utterances gives us mostly sensible, mostly accurate divisions of the content, and text is good to have as a fallback.

IMO words is likely unneccessary - I believe it would only be useful if we either:

  • Find ourselves dissatisfied with the breakpoints/speaker split inherent to utterances and want to devise our own
  • Want to visually identify specific words that Assembly flagged as low confidence

Want to get @mvictor55 's input on the desired functionality here before dropping the axe though (and @mertbagt 's take on what we actually need for the front-end). If we do cut words, we should also probably cut the words array in utterances.

@@ -48,7 +52,7 @@ export const transcription = functions.https.onRequest(async (req, res) => {

authenticatedEventsInDb.forEach(async d => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking - is it possible that the firebase function will exit before the async function in the forEach completes?

If so, it may be worth switching to a transaction for the writes here - something like

const batch = db.batch()

batch.set(
  db.collection("transcriptions").doc(transcript.id),
  { _timestamp: new Date(), ...transcript }
)

authenticatedEventsInDb.forEach(doc => {
  batch.update(doc.ref, {["x-maple-webhook"]: null})
})

await batch.commit()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants