Add slide text and segments to Tobira harvest API #5757

owi92 · 2024-04-23T14:40:44Z

See commits. With this PR, the generated slide text is exposed so we can use that in Tobira's search index. I'm not super sure if the filter I built for this could let through false positives or negatives, though in testing this didn't seem to be the case. Please let me know if you think this might be an issue, and/or have any suggestions how this could be solidified.

Furthermore, the slide segments with their respective starting time is passed in order to be used for Paella's slide plugin on the Tobira side.

Related Tobira issues: elan-ev/tobira#368, elan-ev/tobira#1065

Your pull request should…

have a concise title
close an accompanying issue if one exists
be against the correct branch
include migration scripts and documentation, if appropriate
pass automated tests
have a clean commit history
have proper commit messages (title and body) for all commits

mtneug · 2024-04-23T15:52:31Z

modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java

+ .filter(mpe -> {
+ final var flavor = mpe.getFlavor();
+ final var isCatalog = mpe.getElementType() == MediaPackageElement.Type.Catalog;
+ final var isXml = mpe.getMimeType().eq(MimeType.mimeType("text", "xml"));
+ final var isText = flavor.getSubtype().equals("text");
+ return isCatalog && isXml && isText;
+ })


You probably want to only search for element with flavor mpeg-7/text.

Oh right, that is much simpler and more accurate. Thank you!

The slide texts are to be added to Tobira's search index, and in order to do so, they need to be harvested. This adds the generated ocr results to the `Item` class used in the Tobira module.

This adds a function to collect generated slide segments and add a corresponding timestamp to each. Tobira needs this to supply a frame list to the paella slide plugin.

LukasKalbertodt

Tested it locally and it works! I only have one note, but that's not too important. I think this can be merged.

LukasKalbertodt · 2024-04-25T11:03:10Z

modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java

+ final var slideText = Arrays.stream(mp.getElements())
+ .filter(mpe -> mpe.getFlavor().eq("mpeg-7/text"))
+ .map(element -> element.getURI())
+ .findFirst();


I wonder... what if there are multiple of these elements? You just take the first here. Can anyone say whether in the real world, there might be multiple such elements?

But this isn't a blocker. I also just trust Matthias with the mpeg-7/text filter. I don't have enough experience to say whether that filters for less or more than what we want. 🤷

gregorydlogan · 2024-05-06T19:20:26Z

This seems to work to me. Have not checked the Tobira/UI side of things, but the server side stuff shows up just fine.

…ella slide previews (#1163) This adds the ocr'd slide texts as well as a list of timestamped frames to the harvesting sync code and stores them in the DB. In order the show the slide previews, `paella-slide-plugins` was added and configured to use the timestamped frames. Needs opencast/opencast#5757 to work. Once that is merged, released and used on our test Opencast, the changes can be tested with fresh uploads. We'll still need some mechanism to apply segmentation and ocr (and speech-to-text as well) to existing videos. (Can be reviewed commit by commit, though note that the migration from the second commit was extended in the third)

mtneug requested changes Apr 23, 2024

View reviewed changes

owi92 added 3 commits April 24, 2024 00:23

Add ocr'd slide text to Tobira harvest API

4011bd2

The slide texts are to be added to Tobira's search index, and in order to do so, they need to be harvested. This adds the generated ocr results to the `Item` class used in the Tobira module.

Add slide segments to Tobira harvest API

9307da6

This adds a function to collect generated slide segments and add a corresponding timestamp to each. Tobira needs this to supply a frame list to the paella slide plugin.

Bump harvest api version 1.5 to 1.6

a88cfd7

owi92 force-pushed the harvest-ocr branch from bbccee6 to a88cfd7 Compare April 23, 2024 22:24

LukasKalbertodt approved these changes Apr 25, 2024

View reviewed changes

owi92 mentioned this pull request Apr 26, 2024

Add slide segments and extracted text to harvesting and DB, enable paella slide previews elan-ev/tobira#1163

Merged

LukasKalbertodt requested a review from mtneug April 29, 2024 08:31

gregorydlogan self-assigned this May 6, 2024

gregorydlogan merged commit 40cd08e into opencast:r/15.x May 6, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add slide text and segments to Tobira harvest API #5757

Add slide text and segments to Tobira harvest API #5757

owi92 commented Apr 23, 2024

mtneug Apr 23, 2024

owi92 Apr 23, 2024

LukasKalbertodt left a comment

LukasKalbertodt Apr 25, 2024

gregorydlogan commented May 6, 2024

Add slide text and segments to Tobira harvest API #5757

Add slide text and segments to Tobira harvest API #5757

Conversation

owi92 commented Apr 23, 2024

Your pull request should…

mtneug Apr 23, 2024

Choose a reason for hiding this comment

owi92 Apr 23, 2024

Choose a reason for hiding this comment

LukasKalbertodt left a comment

Choose a reason for hiding this comment

LukasKalbertodt Apr 25, 2024

Choose a reason for hiding this comment

gregorydlogan commented May 6, 2024