Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 8: IndexError: list index out of range in transcript_enrich_bucket.py #587

Closed
bmerkle opened this issue Sep 14, 2024 · 1 comment
Assignees

Comments

@bmerkle
Copy link
Contributor

bmerkle commented Sep 14, 2024

Describe the bug
I tried Chapter 8 with my own set of data (some different youtube videos) and the data set yield a bug (index list out of range) in the transcript_enrich_bucket.py. Please see the stack trace below.

I also have a fix for this bug, please see the MR.

To Reproduce
Steps to reproduce the behavior:

  1. Chapter 8
  2. use you own data set (e.g. I used https://www.youtube.com/@SICKSensors)
  3. I created an import for all playlists.
  4. during transcript_enrich_bucket.py it fails for -7ckbQAqhe4.json.vtt

Stacktrace:
(.venv) PS C:\work\microsoft\generative-ai-for-beginners\08-building-search-applications\scripts> python transcript_enrich_bucket.py --verbose -f $TRANSCRIPT_FOLDER -m $TRANSCRIPT_BUCKET_MINUTES
DEBUG:main:Transcription folder: transcripts_sick
DEBUG:main:Segment length 3 minutes
Enriching Buckets... ---------------------------------------- 0% -:--:--DEBUG:main:Processing file: transcripts_sick-7ckbQAqhe4.json.vtt
Enriching Buckets... ---------------------------------------- 0% -:--:--
Traceback (most recent call last):
File "C:\work\microsoft\generative-ai-for-beginners\08-building-search-applications\scripts\transcript_enrich_bucket.py", line 218, in
get_transcript(meta)
File "C:\work\microsoft\generative-ai-for-beginners\08-building-search-applications\scripts\transcript_enrich_bucket.py", line 203, in get_transcript
parse_json_vtt_transcript(vtt, metadata)
File "C:\work\microsoft\generative-ai-for-beginners\08-building-search-applications\scripts\transcript_enrich_bucket.py", line 175, in parse_json_vtt_transcript
previous_segment_tokens = len(tokenizer.encode(segments[-1]["text"]))
~~~~~~~~^^^^
IndexError: list index out of range

Expected behavior
the transcript_enrich_bucket.py should not fail.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Copy link

👋 Thanks for contributing @bmerkle! We will review the issue and get back to you soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants