[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118

ppanopticon · 2024-11-06T10:57:47Z

Description

One of the recent pull requests to dev has seen the introduction of the ContentAuthorAttribute with the idea, that ContentElements can be labeled and selectively processed based on the operator it was created by. Furthermore, and as a side-effect thereof, ContentElements are always appended to an Ingested to make sure that different execution paths in the pipeline have access to all the necessary information. Or in other words, Retrievables have become strictly append-only objects that accumulate different content representations (and other data structures such as Descriptors).

While I understand the desire to have the first part I see the second part of this mechanism highly critical for multiple reasons, not the least of which being, that I immediately run into issues even for very simple examples. Here are my observations in a nutshell:

The current approach is taxing on memory. Even for very simple pipelines (Decode > Aggregate > Extract) I run into out-of-memory issues within seconds. The reason is simple: The aggregation step no longer frees memory since all versions of the content are kept around until the video is fully extracted. Of course this can be worked around by adding more memory (unreliable) or using the CachedContentFactory (slow). But I think it's less than ideal, that it is no longer possible to construct pipelines with low memory footprints.
The approach adds a lot of complexity (which is currently poorly documented). Again, even in this simple example, I'm forced to somehow specify which of the many ContentElements I actually want to use, when it is actually self-evident from the pipeline setup. Without doing so, extraction takes place on all the content.
For me it is also unclear, how this mechanism behaves in more complex scenarios where we do aggregation and / or segmentation of Retrievables. What happens, for example, if we create new segments (i.e., new Retrievables) that replace the incoming ones? We can of course emit the new Retrievable for processing. But since the source Retrievable's relationships cannot be changed, both segmentations are kept around and - by the current logic - are thus persisted.

Overall I get the feeling, that we have added a lot of complexity to cover a specific edge case. This complexity seems to have a negative impact on the cases we cover on a regular basis. And to me it seems, that there remain open questions as to how this mechanism should behave in different scenarios.

Therefore, before expanding upon this feature, I would like to stop, pause and think about whether we're headed in the right direction here. This issue should be used to document the discussion and come-up with a design specification. Maybe this is something that needs to be discussed during one of our meetings.

@lucaro and @faberf: Since this is somehow your brain-child, I added you as assignees.

The text was updated successfully, but these errors were encountered:

lucaro · 2024-11-07T08:41:27Z

I agree with the points you raised and am not a fan of the append-only semantic for all operations. I think we need to fundamentally define what types of operators can do what kind of operations and then also put mechanisms in place that ensure a certain level of consistency. Various functionalities are currently handled by each feature separately (e.g., filtering by content source), which can easily lead to unexpected behavior when a single feature implements such common functionality differently for some reason. For several types of operators, there is also no need to operate directly on the flow, as the risks of breaking something, in my view, greatly outweigh the flexibility, which is often not even desired at that point. This would all certainly benefit from some re-thinking.

ppanopticon added bug Something isn't working enhancement New feature or request question Further information is requested labels Nov 6, 2024

ppanopticon added this to the Release Candidate #2 milestone Nov 6, 2024

ppanopticon assigned lucaro and faberf Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118

[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118

ppanopticon commented Nov 6, 2024

lucaro commented Nov 7, 2024

[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118

[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118

Comments

ppanopticon commented Nov 6, 2024

Description

lucaro commented Nov 7, 2024