Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slide text and segments to Tobira harvest API #5757

Merged
merged 3 commits into from
May 6, 2024

Conversation

owi92
Copy link
Contributor

@owi92 owi92 commented Apr 23, 2024

See commits. With this PR, the generated slide text is exposed so we can use that in Tobira's search index. I'm not super sure if the filter I built for this could let through false positives or negatives, though in testing this didn't seem to be the case. Please let me know if you think this might be an issue, and/or have any suggestions how this could be solidified.

Furthermore, the slide segments with their respective starting time is passed in order to be used for Paella's slide plugin on the Tobira side.

Related Tobira issues: elan-ev/tobira#368, elan-ev/tobira#1065

Your pull request should…

Comment on lines 130 to 136
.filter(mpe -> {
final var flavor = mpe.getFlavor();
final var isCatalog = mpe.getElementType() == MediaPackageElement.Type.Catalog;
final var isXml = mpe.getMimeType().eq(MimeType.mimeType("text", "xml"));
final var isText = flavor.getSubtype().equals("text");
return isCatalog && isXml && isText;
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to only search for element with flavor mpeg-7/text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that is much simpler and more accurate. Thank you!

The slide texts are to be added to Tobira's search index,
and in order to do so, they need to be harvested. This
adds the generated ocr results to the `Item` class
used in the Tobira module.
This adds a function to collect generated slide segments and add
a corresponding timestamp to each. Tobira needs this to supply
a frame list to the paella slide plugin.
Copy link
Member

@LukasKalbertodt LukasKalbertodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it locally and it works! I only have one note, but that's not too important. I think this can be merged.

final var slideText = Arrays.stream(mp.getElements())
.filter(mpe -> mpe.getFlavor().eq("mpeg-7/text"))
.map(element -> element.getURI())
.findFirst();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder... what if there are multiple of these elements? You just take the first here. Can anyone say whether in the real world, there might be multiple such elements?

But this isn't a blocker. I also just trust Matthias with the mpeg-7/text filter. I don't have enough experience to say whether that filters for less or more than what we want. 🤷

@gregorydlogan
Copy link
Member

This seems to work to me. Have not checked the Tobira/UI side of things, but the server side stuff shows up just fine.

@gregorydlogan gregorydlogan merged commit 40cd08e into opencast:r/15.x May 6, 2024
3 checks passed
LukasKalbertodt added a commit to elan-ev/tobira that referenced this pull request Jun 3, 2024
…ella slide previews (#1163)

This adds the ocr'd slide texts as well as a list of timestamped frames
to the harvesting sync code and stores them in the DB.
In order the show the slide previews, `paella-slide-plugins` was added
and configured to use the timestamped frames.

Needs opencast/opencast#5757 to work. Once that
is merged, released and used on our test Opencast, the changes can be
tested with fresh uploads. We'll still need some mechanism to apply
segmentation and ocr (and speech-to-text as well) to existing videos.

(Can be reviewed commit by commit, though note that the migration from
the second commit was extended in the third)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants