Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object with Files and FileSets? #17

Closed
azaroth42 opened this issue Mar 14, 2016 · 66 comments
Closed

Object with Files and FileSets? #17

azaroth42 opened this issue Mar 14, 2016 · 66 comments

Comments

@azaroth42
Copy link
Contributor

@no-reply @mjgiarlo @cbeer @jpstroop

In Scenario 5, there is a digital object (a book) with its own files (a PDF and OCR'd text) as well as a set of components (pages) with their own files (TIFF and OCR).

There seem to be several options here:

  • (1) The book is an Object that hasFile the files, and you dig through tech MD to find out more about them.
  • (2) The book has files, and is thus a FileSet. That implies that FileSets can have member FileSets.
  • (3) The book has a "self" FileSet for the PDF and OCR, and component FileSets for the pages. The self FileSet should then be distinguished and not part of the ordered list of page FileSets.
  • (3b) Further distinguish ComponentFileSet [becomes a Canvas in IIIF] from ObjectFileSet [becomes a rendering of the Manifest in IIIF].
@jpstroop
Copy link

I think Hydra::Works breaks this--maybe. In pure PCDM, the Book is a pcdm:Object, with files representing it in its entirety, like a PDF, associated via pcdm:hasFile. Each page is also a pcdm:Object, with associated files representing the page only.

I'm not sure how you do this in Hydra::Works. Maybe there's a FileSet that's part of the work but not part of the order? :-(

@mjgiarlo
Copy link
Member

A naïve Hydra::Works-based model (w/o ordering represented...) could look like this:
2016-03-14 14 29 03

Maybe this is what @jpstroop was 😦 about!

@azaroth42
Copy link
Contributor Author

Right, pure PCDM is option 1... Object with files and more objects.
In Works, a FileSet that's not part of the order but is a member is (3) or (3b)
Or the Object also becomes a FileSet, as it has a set of files plus the member FileSets ... which is (2). For the record, I don't like (2).

The version from Mike is (4) ... Introduce another Object to represent the page separate from the FileSet that holds the Files. This is what we decided /against/ at the SD HDC ... no need for the object that becomes the Canvas separate from the FileSet that holds the images (and other content).

If (4) is still on the table, it's (still) my preference.

@mjgiarlo
Copy link
Member

@azaroth42 @jpstroop Can you refresh my memory about the downside of (4)? Is it just that the Page-level Work there is not necessary to make the mapping to IIIF (in which case, OK but ¯\_(ツ)_/¯ ❓) or is there more to it than that?

@azaroth42
Copy link
Contributor Author

I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.

There is some value, but mostly the 5% cases that Shared Canvas deals with:

  • Aligning multiple images or content in a single frame of reference
  • Having multiple filesets for multiple digitizations of the same page (2001 vs 2015)
  • Having multiple filesets for different aspects of the same page (image vs text)
  • Having page specific information about the image -- e.g. cropping boundaries to remove scanning bed
  • Pages without digital content but with notes (e.g. back of the photograph has a signature but no image)
  • Point of reference for external alignment of other filesets (e.g. folio from one institution, miniature from another)

@mjgiarlo
Copy link
Member

The main value of the extra PCDM object is that it aligns well with the Hydra::Works model -- we can do this using current codebases without making a single modification (right?). I'm not saying that concern overrides all others, but I'd just toss that in as a valuable thing to keep in mind. (Have I mentioned that I'm a fan of models for which there's already working code? 😀 )

@jpstroop
Copy link

I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.

I honestly can't remember why--maybe the restrictions of H::W just make it feel like kludge? I suppose I agree that (4) is the way to go if we don't want to bend H::W. I think @tpendragon was in on that conversation too. Maybe he remembers something?

@tpendragon
Copy link

I've got a lot of thoughts and it's hard to organize them in my head, so I'm just gonna stash them all here and let them be commented upon.

  1. I think what @mjgiarlo posted (option 4) is the expanded version of every possible interaction with Hydra Works. It basically reads like this: "Works are representations of the conceptual or physical - if a book has pages, a work (the book) has child works (the PHYSICAL pages). FileSets are representations of the digital file that sits on disk. If the book has a PDF, then it has a FileSet (the digital). Its conceptual pages (works) each have a digital representation (or doesn't!), each of which has a FileSet. Only digital things make sense to have binaries, so they have Files."
  2. This is both good and bad. It's ridiculously expensive to make a hierarchy that deep for everything, and querying it quickly becomes a nightmare - even if you did have something like SPARQL in your toolset, which we don't. @mjgiarlo says we can do this using current codebases without making a single modification, but we have no UI or interaction for actually doing this. Even in Plum it would be hard - LOTS of clicks. HOWEVER, if everything's modeled this way then a computer can have a generally sound interaction model with all things hydra:works and be able to reason about what to do at each layer.
  3. This probably shouldn't matter, but the IIIF mapping becomes...complex. Before FileSets were canvases, this would mean something more like Works are Canvases if they only have FileSets, and even more (although this probably should have always been the case) Works are Canvases if they only have FileSets which have an original File that's an image according to its technical metadata. Although maybe that's not right either? When you have a Collection which has a Work that has a PDF and images, should the PDF be on the canvas via something like IxIF or should it be its child works?

In summary, I THINK I prefer option 3 for "thing we can do easiest and I can imagine a UI for", but if this expanded model is something we can use everywhere (and my description is accurate) then maybe it's worth the expense, and it would make @cmh2166 happy I think.

@tpendragon
Copy link

Oh, also

  1. In option 4, is the FileSet ordered? Prooobably not, and maybe that's where the IIIF mapping works out.

@azaroth42
Copy link
Contributor Author

Thanks @tpendragon !

  1. Agree that 4 is the most expanded version, and with your reading. Given Distinguish Physical from Digital in Model? #12 (distinguish physical and digital in the Hybox models), plus the aim of having an extensible model that works with both image based and more traditional institutional repository objects, I wonder if the value of consistency across hybox instances and uses outweighs the cost of the additional objects.
  2. I hear you about the expense! Even in Triannon with relatively simple annotation objects, the number of interactions was very high :( That said, the costs of Fedora4 interactions is something that we (the community) have already noted, and should work with Duraspace and F4 committers to reduce. So while implementation efficiency is still important, making choices that will be hard to change later based on the current state of technology today seems like it would be good to avoid if there's at least hope that the technology will improve.
  3. I'm not sure that I follow. The Page object would be the Canvas, and the FileSet is more like the annotation that links the content to the canvas? So the page can then maintain the separate height and width of the canvas, without colliding with the height and width of the images. I'll build a diagram :)
    For the PDF of the Collection/Object, there's also the rendering property (http://iiif.io/api/presentation/2.1/#rendering) for resources that represent the entire thing, rather than individual views.
  4. I don't think that filesets would be ordered. What would the order represent, other than perhaps a preference for using the first resource rather than the last?

In terms of UI/UX, I don't think that the system should try to mirror the model directly into the UI. In particular, the distinction between RWO and the digitized object is important for some cases, but the vast majority of the properties can apply to one or the other -- a single form could be used to capture the information, regardless of where that information ends up in the persistence layer that implements the model. It might also be good to have admin configurable templates for these, such as setting up controlled vocabularies to use, fields to hide/show, fields to auto-populate, and so on.

@mjgiarlo
Copy link
Member

While I catch up on this thread, to be clear about my own intent: I am 💯 OK with bending Hydra::Works or just using Hydra::PCDM, or another Hydra::PCDM-based library, for our needs here. I thought it prudent to have at least one option that is the naïve Hydra::Works model.

@azaroth42
Copy link
Contributor Author

I think the overall model looks something like:

hybox

And the mapping for IIIF falls only on the Digital Object side of the line, with metadata copied across from the RWO side. The mapping would be:

Collection --> Collection
Object --> Manifest
(order of object) --> Sequence
Part --> Canvas
FileSet --> Annotation
File --> Content
TechMD --> info.json for images

And then FileSet for an Object or Collection would be rendering, rather than an Annotation.

The provenance/history/versioning features of the repository likely wouldn't be put into the IIIF, but are important to capture for HyBox on the digital side.

@tpendragon
Copy link

👍. It's just determining when a Work is a Part, yeah? It may just be "if it's ordered". There's also an issue where width/height are required for the canvas, and if there's no FileSet then there's no width/height...probably

@azaroth42
Copy link
Contributor Author

Yes, determining when a work is a part is key. Given that in IIIF all canvases must be part of a sequence, the ordered-ness is a great way to do that. 👍 from me.

The part should be able to record the h/w in the event that there isn't a fileset. And provides a resource to hang other properties off like the description/note that the back of a photograph has a signature on it, but there's no image that depicts it.

@cmharlow
Copy link

Thanks for cc'ing me, @tpendragon.

I like the overall model @azaroth42 just shared (+1 to option 4 from my viewpoint), and my questions are more about what relationship we use to connect HW:Work/PCDM:Object to RWO (or whatever other instances in other repositories/domain models we want to link to - WEMI Works, CHOs, etc.), and if that RWO/CHO/WEMI instance gets representation somehow in Fedora/PCDM.

FWIW in this discussion, we've got some edge cases currently where we want to make descriptive metadata assertions on a HW:Work/PCDM:Object that stands in for the Page (basically migrating legacy functionalities). @azaroth42 covers similar needs in his comments.

Here is a simple overview of what I'd hoped we would do (we've gone the route of conflating of the dpla:SR instance/stand in for descriptive metadata on the WEMI Work with the Book PCDM:Object). The red writing are the very specific metadata needs that made me originally look at this idea.
image

Hope this helps, and that I didn't misunderstand the discussion so far. Thanks for all you guys do.

@azaroth42
Copy link
Contributor Author

@cmh2166 -- our graphs look isomorphic, which is encouraging that independent analysis came to the same result :)

My thinking is that for HyBox we would want to include one level of RWO for Object and Collection to act as the resource that maintains the descriptive metadata. This would be a relatively simple resource that can be replaced or extended as desired on a case by case basis, without breaking the digital object model. If someone wants to make a module for BibFrame or CIDOC-CRM or BIBO or ... then there are clear anchoring points across that black line.

We should discuss next week at LDCX :D

@tpendragon
Copy link

@cmh2166 We should talk about the RWO vs intellectual work split at LDCX, because if you include "file on disk" I think there's three levels, there, versus the two that would be here.

@cmharlow
Copy link

@azaroth42 - Absolutely, and I admittedly am sharing here for a different context than what you've got for Hybox (and I know nothing about IIIF other than it seems pretty rad and I wish I had a reason to be involved with it).

@tpendragon - yes, I agree. From my perspective, it's a question of what metadata domain models we want to bring over to pcdm representation versus deciding pcdm represent digital objects/collections and building bridges to those other domain models as described/stored elsewhere.

So, I like generally where this is going with option 4, feel bad about the performance inefficiencies though this can create, and I still have questions for Hydra::Works/PCDM more broadly. We can def discuss next week.

I'll let you guys get on with planning the hybox revolution.

@azaroth42
Copy link
Contributor Author

@tpendragon Good point! pcdm:File is the file, so the pointer doesn't really exist in @cmh2166's diagram. I skipped over that. Also the OCR file currently on the AF:Book would live inside a FileSet. So maybe not entirely isomorphic yet

@cmharlow
Copy link

@azaroth42 OCR is a lingering question (where to put and how). The file pointers to AWS will be added in an upcoming test instance (right now, we're just storing files in fedora for sake of testing all this in a sandbox). Any advice on how to handle those pointers, +1.

@azaroth42
Copy link
Contributor Author

We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora. It's (IMO) an important question, as the distinction is between an RDFSource and a NonRDFSource (in LDP terms). I've been assuming that this will work itself out, but it needs to be scheduled (tag @cbeer @mjgiarlo @hannahfrost @anarchivist) as a not insignificant piece of work

@escowles
Copy link

@azaroth42's list of reasons to have a Page object separate from the FileSet seems reasonably convincing to me. I don't think all of those absolutely need a separate Page object, but it would certainly make them more elegant.

Is the intention to always use separate Page objects, or to use them for the 5% of cases that need them?

@azaroth42
Copy link
Contributor Author

I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice. And then have to test for the results of that choice all the time in the code.

From the SD HDC, if we can use batch ops to F4 to speed up some of these interactions -- both create and retrieve -- I would like to believe that the cost of the additional objects will be relatively low.

@tpendragon
Copy link

I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice.

👍 Branches are the death of productivity.

@mjgiarlo
Copy link
Member

@tpendragon 💬

Branches are the death of productivity.

👍

@escowles
Copy link

Though, AFAICT, the current F4 batch operations draft spec doesn't anticipate fewer HTTP requests -- it's really a refinement of the current transaction support to accept or abandon a set of changes, not a way to doing a bunch of operations in a single request.

@mjgiarlo
Copy link
Member

@tpendragon 💬

@mjgiarlo says we can do this using current codebases without making a single modification, but we have no UI or interaction for actually doing this.

Fair point. I was assuming a near-term future where these tickets blocking the Sufia 7.0.0 release were already done. Sorry, product owner blinders were on. 😉

@mjgiarlo
Copy link
Member

@azaroth42 💬

We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora.

Is this a HyBox need or a different need? Which of our many needs are you referencing here, @azaroth42? 😉

@azaroth42
Copy link
Contributor Author

Not a need that has been identified for HyBox to my knowledge, but certainly one that has come up at Stanford. However, if the median repo size is on the order of 5TB, I do wonder whether it /is/ actually a HyBox need? We could push it to Business to decide?

@mjgiarlo
Copy link
Member

Meh, I don't think it is a HyBox need at the moment so we should punt. Thanks for clarifying!

@no-reply
Copy link

@escowles 💬

Have a FileSet for the PDF, and a child Object to hold the page image FileSets.

Is this genuinely easier to implement than a "part" object for each page? If so, is the reason mainly that it reduces the number of HTTP round trips?

@escowles
Copy link

It reduces both the number of Objects and number of HTTP requests. Particularly on objects with many pages, this could be hundreds of extra objects and thousands of extra HTTP requests.

Having an extra object to represent RWO or RW collections or to group the page image FileSets together adds a handful of extra objects. But adding an extra object for every file adds many, many more.

@azaroth42
Copy link
Contributor Author

To channel @cbeer from courtyard discussion ... the model can make the distinction, and the implementation can (even today) use # URIs to avoid the HTTP request overhead. Then when there's a technology solution for the overhead (e.g. something like LDP-batch), there's a trivial transition rather than an impossible one.

The request overhead issue is well known ... are there other concerns about the separation?

@escowles
Copy link

@azaroth42 it also seems superfluous to have a separate Page object, because we already have a FileSet which can be used to hold descriptive metadata about the page. But, I fully admit that's based on the use cases I've worked on and not on the prospective HyBox user input.

@tpendragon
Copy link

I agree that the model and the implementation can be different, and I don't want to get too deep into implementation, but # URIs won't work here (Hash URIs can't have contained resources, so you can't make the FileSet a # URI. They work great for leaf nodes.)

@azaroth42
Copy link
Contributor Author

Have a FileSet for each page plus a FileSet for the PDF, and use mime type to separate them.

👎 that works when it's a PDF ... what about a thumbnail of the set of images?

Have a FileSet for the PDF, and a child Object to hold the page image FileSets.

Which is Lynette's Raven Pages object, no?

Hash URIs can't have contained resources, so you can't make the FileSet a # URI.

Huh 😿 Tagging @cbeer ...

@azaroth42
Copy link
Contributor Author

@escowles So ... convincing evidence would be:

  • if there is a needed 1:many relationship between Object and FileSet (e.g. one page, 2 filesets)
  • if there is an object needed without any FileSet -- e.g. if there are no files associated with the object, but the object is needed to associate metadata with it. (Assuming FileSets must have Files)
  • if the same predicate is attached to the fileset as to the object, but with a different value (e.g. identifiedBy different identifiers)

Yes?

@escowles
Copy link

@azaroth42 Yes, those sound like the things to talk through the details of. The last one in particular sounds like an important issue to consider: if the Page/Part and the FileSet had the same predicate pointing at different values, that would be the kind of thing that would require a separate resource.

@no-reply
Copy link

Running with the proposal that the FileSet is both a FileSet and a Page. I see two problems this leaves:

  1. How to understand that the FileSet with the PDF and OCR is distinct from the grouping of pages.
  2. Overloading of the kind described by the first and third points in Object with Files and FileSets? #17 (comment)

The proposed solutions to (1) are:

  1. A distinct (non-FileSet) Page object for each page.
  2. An intermediary ("Page Set" ?) object that groups the pages.
  3. Relying on non-structural metadata (e.g. mimetype); or on orthogonal structure (e.g. ordering)
  4. Use a different predicate for top-level fileset; e.g. ore:Aggregates rather than pcdm:hasMember on the theory that it is not a component part.
    • (I think this post is the first appearance of this option)

The proposed solution to (2) is simply that we don't have use cases for those
(EDIT: reading back, I think there's a more generous interpretation: the alternate solution is to point an RWO at the FileSet, wiping away the overloading. This solution seems more compelling for the property overloading issue that the cardinality one. Cardinality is, presumably, part of what we want to capture in core repository structural metadata.)

Is this a fair summary of the state of things?

@azaroth42
Copy link
Contributor Author

I agree we're unlikely to need the same predicate on both FileSet and Page, as FileSet is primarily a grouping construct for representations (in the webarch sense).

The solution to needing a FileSet without Files is obvious ... allow empty FileSets (which will break current code, but could be implemented).

The distinction between member resource and rendering of complete resource seems mostly orthogonal -- we need to do it regardless.

It's the first one that I think we have several use cases for where dumping all of the files in a single FileSet seems less than ideal (or a big change to what we understand as a FileSet). In order of convincing-ness to me:

  • Derivatives from the master, with their own derivatives (e.g. raw TIFF, plus cropped and color corrected version, each with separate thumbnails)
  • Multiple media types associated with a single page, which need their own FileSet. (e.g. image and video, or image and text, or video and audio, or ...)
  • Multiple digitizations of the same page (e.g. 1970s black and white, 2000s full color or depositor supplied vs done in house) and maintaining the distinguishing features (e.g. date, agent, etc)
  • Storage optimization -- it would be nice to allow infrequently used representations to be managed on slower storage compared to the access copy (plus access derivs).
  • Partial digitizations or distributed digitizations (In the worst/most interesting real case ... BNF holds the excised illumination, BVMM holds the folio ... but wind back down to in-scope scenarios where one institution holds both parts)

Convincing to others?

@no-reply
Copy link

Note that we are raising some of the use cases in @azaroth42's most recent comment to the business side to determine whether they are in-scope.

@barmintor
Copy link

FWIW: In the instances in which the FileSet-to-thing mapping has been problematic for me, I've had to work around the issue by spinning off auxiliary works or collections (of FileSets, ahem) that can fake the simpler relationship, sometimes lossily, for the sake of the engine. In particular, paged works that were photographed multiple times at different resolutions, color corrections, detail re-photographs, composite images, etc. I am 👍 @escowles not wanting to default to complexity, but wary that sometimes opt-in complexity isn't actually supported.

edit: the auxiliaries I mention exist exclusively to drive appearance in the site.

@escowles
Copy link

I'd like to hear more about the use case of re-digitized items. I would expect to either replace the old scans (with versioning to retain them if desired), or to create a new digital object linked with dct:replaces or similar if you wanted to have both more readily available. In the case of auxiliaries solely for display (or for uncropped/uncorrected/etc. variants), couldn't you have a different File Use value to distinguish them?

@barmintor
Copy link

I would expect to either replace the old scans (with versioning to retain them if desired)

In the particular case I'm thinking of, it's not really that scans were old, they were photographed multiple times in one digitization. As you might expect, only the auxiliary was made public, but I'll try to

couldn't you have a different File Use value to distinguish them?

I guess, but at some point the FileSet is really a collection of FileSets, isn't it? I think of the class pretty strongly as an original_file and its related files.

edit: That is, if FileSet becomes "Files that have a relationship to a RWO" rather than "Files that have a relationship to each other", don't we need to recreate FileSet for the latter? Keeping up with "technical metadata/derivative of original_file_X" relationships without the benefit of containment will be bad.

@elrayle
Copy link

elrayle commented Mar 18, 2016

@barmintor, in your use case where each page has multiple resolution images, do you envision one resolution driving the contents presented in the IIIF viewer? Or would you have a low-res version and a high-res version both of which should be available to a IIIF viewer?

if only one needs to be viewable: It may be acceptable to have all resolutions in a single FileSet.

If both need to be viewable: We have discussed the ability to have multiple orders over the same set of objects (in this case Filesets). Could that be applied here? The low-res and high-res would each reside in their own Fileset. All the low-res would be in one ordering. All the high-res would be in another ordering.

I know this is complex, but how common is this use case? It seems ok to have a high level of complexity if the use case is rare.

@elrayle
Copy link

elrayle commented Mar 18, 2016

I just saw @azaroth42 comment on issue #19 which states that the use case of one Page Work per page with 1:m Filesets for the page is the current proposed model.

@barmintor
Copy link

In this particular case, low/high isn't a perfect fit, but we choose one
for the auxiliary, which is an alternate ordering of a subset.

But doing this means we go from "resource is a surrogate for a RWO" to
"resource is a surrogate for a presentation of a RWO", which compounds the
descriptive problems.
On Mar 18, 2016 9:09 AM, "E. Lynette Rayle" notifications@github.com
wrote:

@barmintor https://github.com/barmintor, in your use case where each
page has multiple resolution images, do you envision one resolution driving
the contents presented in the IIIF viewer? Or would you have a low-res
version and a high-res version both of which should be available to a IIIF
viewer?

We have discussed the ability to have multiple orders over the same set of
objects (in this case Filesets). Could that be applied here? The low-res
and high-res would each reside in their own Fileset. All the low-res would
be in one ordering. All the high-res would be in another ordering.

I know this is complex, but how common is this use case? It seems ok to
have a high level of complexity if the use case is rare.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#17 (comment)

@azaroth42
Copy link
Contributor Author

Per #22, at least two of the use cases that require one resource with multiple filesets are considered to be important to capture in HyBox. As that requires a 1:many relationship, we can't merge Object and Fileset. So I think (thank you all for your comments!!) we can close the issue?

@elrayle
Copy link

elrayle commented Mar 18, 2016

@azaroth42 Can you summarize what you consider to be the resolution to be sure we are all on the same page?

@azaroth42
Copy link
Contributor Author

Sure!

I think it boils down to:

  • Descriptive metadata about a physical object that has been digitized will be associated with a separate resource from the digital object that maintains the binary data structures. (e.g. real world book is not owl:sameAs the pcdm:Object for the scan of that book)
  • Parts of digitized objects, such as the pages of a book, have a separate Object from the FileSet that groups the Non RDF Sources. Each Part object can have 0 .. many FileSets.
  • Objects and Collections can have their own FileSets as well as Parts, for representations of the complete object. (e.g. a PDF of the book, plus the Parts are the Pages, with Image FileSets)
  • FileSets are not ordered internally, or with relation to each other. Only Objects and Collections are ordered (internally, and relative to a parent Object/Collection)

@azaroth42
Copy link
Contributor Author

Discussed in depth at LDCX with the community and documentation is in process in other issues. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants