-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object with Files and FileSets? #17
Comments
I think Hydra::Works breaks this--maybe. In pure PCDM, the Book is a I'm not sure how you do this in Hydra::Works. Maybe there's a FileSet that's part of the work but not part of the order? :-( |
A naïve Hydra::Works-based model (w/o ordering represented...) could look like this: Maybe this is what @jpstroop was 😦 about! |
Right, pure PCDM is option 1... Object with files and more objects. The version from Mike is (4) ... Introduce another Object to represent the page separate from the FileSet that holds the Files. This is what we decided /against/ at the SD HDC ... no need for the object that becomes the Canvas separate from the FileSet that holds the images (and other content). If (4) is still on the table, it's (still) my preference. |
@azaroth42 @jpstroop Can you refresh my memory about the downside of (4)? Is it just that the Page-level Work there is not necessary to make the mapping to IIIF (in which case, OK but |
I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time. There is some value, but mostly the 5% cases that Shared Canvas deals with:
|
The main value of the extra PCDM object is that it aligns well with the Hydra::Works model -- we can do this using current codebases without making a single modification (right?). I'm not saying that concern overrides all others, but I'd just toss that in as a valuable thing to keep in mind. (Have I mentioned that I'm a fan of models for which there's already working code? 😀 ) |
I honestly can't remember why--maybe the restrictions of H::W just make it feel like kludge? I suppose I agree that (4) is the way to go if we don't want to bend H::W. I think @tpendragon was in on that conversation too. Maybe he remembers something? |
I've got a lot of thoughts and it's hard to organize them in my head, so I'm just gonna stash them all here and let them be commented upon.
In summary, I THINK I prefer option 3 for "thing we can do easiest and I can imagine a UI for", but if this expanded model is something we can use everywhere (and my description is accurate) then maybe it's worth the expense, and it would make @cmh2166 happy I think. |
Oh, also
|
Thanks @tpendragon !
In terms of UI/UX, I don't think that the system should try to mirror the model directly into the UI. In particular, the distinction between RWO and the digitized object is important for some cases, but the vast majority of the properties can apply to one or the other -- a single form could be used to capture the information, regardless of where that information ends up in the persistence layer that implements the model. It might also be good to have admin configurable templates for these, such as setting up controlled vocabularies to use, fields to hide/show, fields to auto-populate, and so on. |
While I catch up on this thread, to be clear about my own intent: I am 💯 OK with bending Hydra::Works or just using Hydra::PCDM, or another Hydra::PCDM-based library, for our needs here. I thought it prudent to have at least one option that is the naïve Hydra::Works model. |
I think the overall model looks something like: And the mapping for IIIF falls only on the Digital Object side of the line, with metadata copied across from the RWO side. The mapping would be: Collection --> Collection And then FileSet for an Object or Collection would be rendering, rather than an Annotation. The provenance/history/versioning features of the repository likely wouldn't be put into the IIIF, but are important to capture for HyBox on the digital side. |
👍. It's just determining when a Work is a Part, yeah? It may just be "if it's ordered". There's also an issue where width/height are required for the canvas, and if there's no FileSet then there's no width/height...probably |
Yes, determining when a work is a part is key. Given that in IIIF all canvases must be part of a sequence, the ordered-ness is a great way to do that. 👍 from me. The part should be able to record the h/w in the event that there isn't a fileset. And provides a resource to hang other properties off like the description/note that the back of a photograph has a signature on it, but there's no image that depicts it. |
Thanks for cc'ing me, @tpendragon. I like the overall model @azaroth42 just shared (+1 to option 4 from my viewpoint), and my questions are more about what relationship we use to connect HW:Work/PCDM:Object to RWO (or whatever other instances in other repositories/domain models we want to link to - WEMI Works, CHOs, etc.), and if that RWO/CHO/WEMI instance gets representation somehow in Fedora/PCDM. FWIW in this discussion, we've got some edge cases currently where we want to make descriptive metadata assertions on a HW:Work/PCDM:Object that stands in for the Page (basically migrating legacy functionalities). @azaroth42 covers similar needs in his comments. Here is a simple overview of what I'd hoped we would do (we've gone the route of conflating of the dpla:SR instance/stand in for descriptive metadata on the WEMI Work with the Book PCDM:Object). The red writing are the very specific metadata needs that made me originally look at this idea. Hope this helps, and that I didn't misunderstand the discussion so far. Thanks for all you guys do. |
@cmh2166 -- our graphs look isomorphic, which is encouraging that independent analysis came to the same result :) My thinking is that for HyBox we would want to include one level of RWO for Object and Collection to act as the resource that maintains the descriptive metadata. This would be a relatively simple resource that can be replaced or extended as desired on a case by case basis, without breaking the digital object model. If someone wants to make a module for BibFrame or CIDOC-CRM or BIBO or ... then there are clear anchoring points across that black line. We should discuss next week at LDCX :D |
@cmh2166 We should talk about the RWO vs intellectual work split at LDCX, because if you include "file on disk" I think there's three levels, there, versus the two that would be here. |
@azaroth42 - Absolutely, and I admittedly am sharing here for a different context than what you've got for Hybox (and I know nothing about IIIF other than it seems pretty rad and I wish I had a reason to be involved with it). @tpendragon - yes, I agree. From my perspective, it's a question of what metadata domain models we want to bring over to pcdm representation versus deciding pcdm represent digital objects/collections and building bridges to those other domain models as described/stored elsewhere. So, I like generally where this is going with option 4, feel bad about the performance inefficiencies though this can create, and I still have questions for Hydra::Works/PCDM more broadly. We can def discuss next week. I'll let you guys get on with planning the hybox revolution. |
@tpendragon Good point! pcdm:File is the file, so the pointer doesn't really exist in @cmh2166's diagram. I skipped over that. Also the OCR file currently on the AF:Book would live inside a FileSet. So maybe not entirely isomorphic yet |
@azaroth42 OCR is a lingering question (where to put and how). The file pointers to AWS will be added in an upcoming test instance (right now, we're just storing files in fedora for sake of testing all this in a sandbox). Any advice on how to handle those pointers, +1. |
We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora. It's (IMO) an important question, as the distinction is between an RDFSource and a NonRDFSource (in LDP terms). I've been assuming that this will work itself out, but it needs to be scheduled (tag @cbeer @mjgiarlo @hannahfrost @anarchivist) as a not insignificant piece of work |
@azaroth42's list of reasons to have a Page object separate from the FileSet seems reasonably convincing to me. I don't think all of those absolutely need a separate Page object, but it would certainly make them more elegant. Is the intention to always use separate Page objects, or to use them for the 5% of cases that need them? |
I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice. And then have to test for the results of that choice all the time in the code. From the SD HDC, if we can use batch ops to F4 to speed up some of these interactions -- both create and retrieve -- I would like to believe that the cost of the additional objects will be relatively low. |
👍 Branches are the death of productivity. |
👍 |
Though, AFAICT, the current F4 batch operations draft spec doesn't anticipate fewer HTTP requests -- it's really a refinement of the current transaction support to accept or abandon a set of changes, not a way to doing a bunch of operations in a single request. |
Fair point. I was assuming a near-term future where these tickets blocking the Sufia 7.0.0 release were already done. Sorry, product owner blinders were on. 😉 |
Is this a HyBox need or a different need? Which of our many needs are you referencing here, @azaroth42? 😉 |
Not a need that has been identified for HyBox to my knowledge, but certainly one that has come up at Stanford. However, if the median repo size is on the order of 5TB, I do wonder whether it /is/ actually a HyBox need? We could push it to Business to decide? |
Meh, I don't think it is a HyBox need at the moment so we should punt. Thanks for clarifying! |
Is this genuinely easier to implement than a "part" object for each page? If so, is the reason mainly that it reduces the number of HTTP round trips? |
It reduces both the number of Objects and number of HTTP requests. Particularly on objects with many pages, this could be hundreds of extra objects and thousands of extra HTTP requests. Having an extra object to represent RWO or RW collections or to group the page image FileSets together adds a handful of extra objects. But adding an extra object for every file adds many, many more. |
To channel @cbeer from courtyard discussion ... the model can make the distinction, and the implementation can (even today) use # URIs to avoid the HTTP request overhead. Then when there's a technology solution for the overhead (e.g. something like LDP-batch), there's a trivial transition rather than an impossible one. The request overhead issue is well known ... are there other concerns about the separation? |
@azaroth42 it also seems superfluous to have a separate Page object, because we already have a FileSet which can be used to hold descriptive metadata about the page. But, I fully admit that's based on the use cases I've worked on and not on the prospective HyBox user input. |
I agree that the model and the implementation can be different, and I don't want to get too deep into implementation, but # URIs won't work here (Hash URIs can't have contained resources, so you can't make the FileSet a # URI. They work great for leaf nodes.) |
👎 that works when it's a PDF ... what about a thumbnail of the set of images?
Which is Lynette's Raven Pages object, no?
Huh 😿 Tagging @cbeer ... |
@escowles So ... convincing evidence would be:
Yes? |
@azaroth42 Yes, those sound like the things to talk through the details of. The last one in particular sounds like an important issue to consider: if the Page/Part and the FileSet had the same predicate pointing at different values, that would be the kind of thing that would require a separate resource. |
Running with the proposal that the
The proposed solutions to (1) are:
The proposed solution to (2) is simply that we don't have use cases for those Is this a fair summary of the state of things? |
I agree we're unlikely to need the same predicate on both FileSet and Page, as FileSet is primarily a grouping construct for representations (in the webarch sense). The solution to needing a FileSet without Files is obvious ... allow empty FileSets (which will break current code, but could be implemented). The distinction between member resource and rendering of complete resource seems mostly orthogonal -- we need to do it regardless. It's the first one that I think we have several use cases for where dumping all of the files in a single FileSet seems less than ideal (or a big change to what we understand as a FileSet). In order of convincing-ness to me:
Convincing to others? |
Note that we are raising some of the use cases in @azaroth42's most recent comment to the business side to determine whether they are in-scope. |
FWIW: In the instances in which the FileSet-to-thing mapping has been problematic for me, I've had to work around the issue by spinning off auxiliary works or collections (of FileSets, ahem) that can fake the simpler relationship, sometimes lossily, for the sake of the engine. In particular, paged works that were photographed multiple times at different resolutions, color corrections, detail re-photographs, composite images, etc. I am 👍 @escowles not wanting to default to complexity, but wary that sometimes opt-in complexity isn't actually supported. edit: the auxiliaries I mention exist exclusively to drive appearance in the site. |
I'd like to hear more about the use case of re-digitized items. I would expect to either replace the old scans (with versioning to retain them if desired), or to create a new digital object linked with dct:replaces or similar if you wanted to have both more readily available. In the case of auxiliaries solely for display (or for uncropped/uncorrected/etc. variants), couldn't you have a different File Use value to distinguish them? |
In the particular case I'm thinking of, it's not really that scans were old, they were photographed multiple times in one digitization. As you might expect, only the auxiliary was made public, but I'll try to
I guess, but at some point the FileSet is really a collection of FileSets, isn't it? I think of the class pretty strongly as an original_file and its related files. edit: That is, if FileSet becomes "Files that have a relationship to a RWO" rather than "Files that have a relationship to each other", don't we need to recreate FileSet for the latter? Keeping up with "technical metadata/derivative of original_file_X" relationships without the benefit of containment will be bad. |
@barmintor, in your use case where each page has multiple resolution images, do you envision one resolution driving the contents presented in the IIIF viewer? Or would you have a low-res version and a high-res version both of which should be available to a IIIF viewer? if only one needs to be viewable: It may be acceptable to have all resolutions in a single FileSet. If both need to be viewable: We have discussed the ability to have multiple orders over the same set of objects (in this case Filesets). Could that be applied here? The low-res and high-res would each reside in their own Fileset. All the low-res would be in one ordering. All the high-res would be in another ordering. I know this is complex, but how common is this use case? It seems ok to have a high level of complexity if the use case is rare. |
I just saw @azaroth42 comment on issue #19 which states that the use case of one Page Work per page with 1:m Filesets for the page is the current proposed model. |
In this particular case, low/high isn't a perfect fit, but we choose one But doing this means we go from "resource is a surrogate for a RWO" to
|
Per #22, at least two of the use cases that require one resource with multiple filesets are considered to be important to capture in HyBox. As that requires a 1:many relationship, we can't merge Object and Fileset. So I think (thank you all for your comments!!) we can close the issue? |
@azaroth42 Can you summarize what you consider to be the resolution to be sure we are all on the same page? |
Sure! I think it boils down to:
|
Discussed in depth at LDCX with the community and documentation is in process in other issues. Closing. |
@no-reply @mjgiarlo @cbeer @jpstroop
In Scenario 5, there is a digital object (a book) with its own files (a PDF and OCR'd text) as well as a set of components (pages) with their own files (TIFF and OCR).
There seem to be several options here:
The text was updated successfully, but these errors were encountered: