Convert lecture videos to notes using AI & machine learning.
Research Paper / Documentation / Project Page
Lecture2Notes is a project that summarizes lectures videos. At a high level, it parses both the visual and auditory components of the video, extracts text from each, combines them, and then summarizes the combined text using automatic summarization algorithms. These pages document the code for the entirety of "Lecture2Notes: Summarizing Lecture Videos by Classifying Slides and Analyzing Text using Machine Learning."
Check out the documentation for usage details.
To get started summarizing text, visit the tutorial.
Note-taking is a universal activity among students because of its benefits to the learning process. This research focuses on end-to-end generation of formatted summaries of lecture videos. Our automated multimodal approach will decrease the time required to create notes, increase quiz scores and content knowledge, and enable faster learning through enhanced previewing. The project is broken into three main components: the slide classifier, summarization models, and end-to-end-process. The system beings by extracting important keyframes using the slide classifier, a deep CNN. Then, unique slides are determined using a combination of clustering and keypoint matching. The structure of these unique slides is analyzed and converted to a formatted transcript that includes figures present on the slides. The audio is transcribed using one of several methods. We approach the process of combining and summarizing these transcripts in several ways including as keyword-based sentence extraction and temporal audio-slide-transcript association problems. For the summarization stage, we created TransformerSum, a summarization training and inference library that advances the state-of-the-art in long and resource-limited summarization, but other state-of-the-art models, such as BART or PEGASUS, can be used as well. Extractive and abstractive approaches are used in conjunction to summarize the long-form content extracted from the lectures. While the end-to-end process and each individual component yield promising results, key areas of weakness include the speech-to-text algorithm failing to identify certain words and some summarization methods producing sub-par summaries. These areas provide opportunities for further research.
The project is broken into four main components: the slide classifier (including the dataset), the summarization models (neural, non-neural, extractive, and abstractive), the end-to-end-process (one command to convert to notes), and finally the website that enables users to process their own videos.
Process:
-
Extract frames from video file
-
Classify extracted frames to find frames containing slides
-
Perspective crop images containing the presenter and slide to contain only the slide by matching temporal features
-
Cluster slides to group transitions and remove duplicates
-
Run a Slide Structure Analysis (SSA) using OCR on the slide frames to obtain a formatted transcript of the text on the slides
-
Detect and extract figures from the set of unique slide frames
-
Transcribe the lecture using a speech-to-text algorithm
-
Summarize the visual and auditory transcripts
- Combine
- Run some modifications (such as only using complete sentences)
- Extractive summarization
- Abstractive summarization
-
Convert intermediate outputs to a final notes file (HTML, TXT, markdown, etc.)
The summarization steps can be toggled off and on (see Combination and Summarization).
Hayden Housen – haydenhousen.com
Distributed under the GNU Affero General Public License v3.0 (AGPL). See the LICENSE for more information.
https://github.com/HHousen
- Fork it (https://github.com/HHousen/lecture2notes/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request