[CHANGE] Content authorship and append-only Retrievables in the face of Aggregation and Segmentation #118
Labels
bug
Something isn't working
enhancement
New feature or request
question
Further information is requested
Milestone
Description
One of the recent pull requests to
dev
has seen the introduction of theContentAuthorAttribute
with the idea, thatContentElement
s can be labeled and selectively processed based on the operator it was created by. Furthermore, and as a side-effect thereof,ContentElements
are always appended to anIngested
to make sure that different execution paths in the pipeline have access to all the necessary information. Or in other words,Retrievable
s have become strictly append-only objects that accumulate different content representations (and other data structures such asDescriptor
s).While I understand the desire to have the first part I see the second part of this mechanism highly critical for multiple reasons, not the least of which being, that I immediately run into issues even for very simple examples. Here are my observations in a nutshell:
The current approach is taxing on memory. Even for very simple pipelines (Decode > Aggregate > Extract) I run into out-of-memory issues within seconds. The reason is simple: The aggregation step no longer frees memory since all versions of the content are kept around until the video is fully extracted. Of course this can be worked around by adding more memory (unreliable) or using the
CachedContentFactory
(slow). But I think it's less than ideal, that it is no longer possible to construct pipelines with low memory footprints.The approach adds a lot of complexity (which is currently poorly documented). Again, even in this simple example, I'm forced to somehow specify which of the many
ContentElement
s I actually want to use, when it is actually self-evident from the pipeline setup. Without doing so, extraction takes place on all the content.For me it is also unclear, how this mechanism behaves in more complex scenarios where we do aggregation and / or segmentation of
Retrievable
s. What happens, for example, if we create new segments (i.e., newRetrievable
s) that replace the incoming ones? We can of course emit the newRetrievable
for processing. But since the sourceRetrievable
's relationships cannot be changed, both segmentations are kept around and - by the current logic - are thus persisted.Overall I get the feeling, that we have added a lot of complexity to cover a specific edge case. This complexity seems to have a negative impact on the cases we cover on a regular basis. And to me it seems, that there remain open questions as to how this mechanism should behave in different scenarios.
Therefore, before expanding upon this feature, I would like to stop, pause and think about whether we're headed in the right direction here. This issue should be used to document the discussion and come-up with a design specification. Maybe this is something that needs to be discussed during one of our meetings.
@lucaro and @faberf: Since this is somehow your brain-child, I added you as assignees.
The text was updated successfully, but these errors were encountered: