Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Ingest #147

Closed
ruebot opened this issue Feb 12, 2016 · 17 comments
Closed

Bulk Ingest #147

ruebot opened this issue Feb 12, 2016 · 17 comments

Comments

@ruebot
Copy link
Member

ruebot commented Feb 12, 2016

Issue by daniel-dgi
Tuesday Feb 03, 2015 at 15:31 GMT
Originally opened as https://github.com/islandora-interest-groups/Islandora-Fedora4-Interest-Group/issues/13


Reformatting this to use the Use Case template.

Title (Goal) Bulk Ingest Objects into Fedora
Primary Actor Repository architect, implementer
Scope Architecture
Level Low
Story As a repository architect, I want to be able to ingest large numbers of files into Fedora using as little programming as possible

Remarks:

  • Drop box style ingest is often tossed around, and coincides with typical route testing/debugging techniques using Apache Camel.
  • File/Manifest format would need to be decided upon, or be made pluggable for those with custom interests.
  • Foxml could be used to help perform the upgration from 3 to 4.
  • Probably not achievable for OR deadline, but needs to be worked on before then.
@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by ruebot
Tuesday Feb 03, 2015 at 15:33 GMT


I really like the idea of using a manifest, and I think it would be great with we stuck with a directory convention like BagIt. That would take care of the manifest, and a user could verify checksum as they are ingested.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by mjordan
Tuesday Feb 03, 2015 at 16:20 GMT


Of course I'd vote for BagIt, for the reasons @ruebot mentions. But, I'd be cautious about requiring it since not all sites will have or want to convert their stuff to Bags. Then again, if we're going to require a manifest, requiring BagIt is not all that different.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by daniel-dgi
Tuesday Feb 03, 2015 at 18:04 GMT


I'm not terribly familiar with BagIt. It's not something that I've dealt with in my work for clients. But at first glance it seems pretty appropriate.

METS is another option, I guess. Or we could just use a simple JSON or YAML manifest, but something tells me an actual metadata standard would make people feel better about things.

Other than BagIt (which I'm assuming contains all the data in one package), we could probably get away with just dropping the manifest in the watch folder, so long as it details the location of files and the user running the camel process has access to those locations.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by awoods
Tuesday Feb 03, 2015 at 18:50 GMT


@daniel-dgi, "holey" bags are also an option if not all of the data is available in the package, with the optional fetch.txt file.
See: http://tools.ietf.org/html/draft-kunze-bagit-06#section-2.2.3

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by ruebot
Thursday Feb 19, 2015 at 18:48 GMT


Adding fcrepo and upgration tags since this could also inform the proposed upgration migration tool discussed on today's Fedora Tech call.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by dmoses
Saturday Oct 17, 2015 at 23:16 GMT


I think one of the most common patterns in the Drupal community for batch ingesting is using Feeds. It has a number of suppport modules for importing XML as well. @mjordan wrote a module a while back. BagIt would be good choice too and may add predictability to the ingest process.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by daniel-dgi
Tuesday Oct 20, 2015 at 14:21 GMT


Thanks for being awesome, @dmoses. Feeds seem attractive from a Drupal front end point of view. Could maybe parse rdfxml? Would like to hear what @mjordan has to say about pros/cons of using feeds and nodes. His module means he's probably got the most experience in that realm of Drupal land.

Not the first time bags have come up, either. I'm interested in seeing if we can zip them and use them to replace our hand-rolled format for zip importer. Are bags of bags possible, as well? It would be amazing if we could mimic what we're doing in 1.x batch but with a well defined standard.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by ruebot
Tuesday Oct 20, 2015 at 15:52 GMT


Serialized bags are totally a thing. Are you thinking of the book and newspaper batch ingest w/r/t the bags in bags idea?

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by mjordan
Tuesday Oct 20, 2015 at 16:18 GMT


@daniel-dgi Bags are agnostic to the content in their 'data' directory and that content's organization, so as @ruebot says, it's legal to have a Bag of Bags. The child Bags would just be serialized into .zip or .tgz files.

To answer your question about nodes in Islandora Feeds, I took that approach because 1) it was easy/I am lazy and 2) it uncouples the steps of importing data and committing that data to the Fedora repo as objects. For example, you can perform various types of QA on the nodes before using Views Batch Operations to create the Islandora objects, add other datastreams, etc.

I wrote that module about two years ago, in fact, I started it at OR3013, with @dmoses, @ruebot and some of the usual suspects sitting right beside me in the back few rows of seats. Now that we have a clear path for Islandora 7.x-2.x, it makes even more sense to create nodes (for obvious reasons) than it did then.

A back of the envelope diagram for using an existing tool like Feeds to manage the import and Bags to wrap file assets might look something like: Feeds creates Drupal nodes that contain F4 object properties (maybe using a Feeds RDF parser?), with pointers to Bags on the Drupal filesystem. Each Bag contains the file assets for an Islandora object. The organization of the content within each Bag would likely be specific to each content model (basic image, newspaper issue, book, etc.). It is legal to also include a (non-Bag) manifest that represents the content model in some way e.g., OAI-ORE, METS), so we might want to explore that option as well.

Using both Feeds and Bags like this is probably overkill, and preparing the Bags would put an additional burden on content handlers. But, there are a lot of other benefits to Bags that may justify that burden, like built-in checksum generation and packaging. Using holey Bags as @awoods points out would add even more flexibility.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by daniel-dgi
Tuesday Oct 20, 2015 at 16:24 GMT


Maybe we're really talking about two things here? Just using feeds to import nodes, and then zipped bags as a zip importer replacement? Heck, we could even just accept zip files on our services endpoints and use that to consume entire objects as opposed to the multipart/form-data shenanigans I've got going on right now.

Would be nice to use bags in that way since it's a drupal agnostic fashion to move things around. Within Drupal, feeds definitely seems like a great way to go. Maybe we should make a ticket for someone to dabble around?

This is getting interesting :)

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by manez
Tuesday Oct 20, 2015 at 16:28 GMT


My (probably not typical) use case would be vastly improved by a bulk export/ingest interface - some way to pull down a small bunch of objects and their metadata, then upload them back up to another Islandora site. Sounds like that's something in the Bags wheelhouse?

That said, +1 for Feeds being a nice GUI/Drupal-y way to import

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by mjordan
Tuesday Oct 20, 2015 at 16:36 GMT


My (recyclable envelope) diagram used both Feeds and Bags because AFAIK Feeds doesn't deal with file assets in any standardized way and I was assuming that the nodes created by Feeds would have some binary files hanging off them. But, the two could be completely separate. Will jump back into the discussion later, must attend all the meetings now 😞

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by daniel-dgi
Tuesday Oct 20, 2015 at 19:10 GMT


@mjordan ah, i see. wasn't thinking about feeds not being able to handle files.

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

Comment by dmoses
Tuesday Oct 20, 2015 at 19:33 GMT


I've got the 7.x.2 vm downloaded ... you can do files with feeds. I will investigate and try a proof of concept. Potentially?? it could be another migration tool by parsing the FOXML xml ... which includes paths to the binaries. Not sure. Will report back.

@ruebot ruebot closed this as completed Feb 12, 2016
@mjordan
Copy link
Contributor

mjordan commented Feb 12, 2016

I know this issue has gone stale, but why close it?

@ruebot
Copy link
Member Author

ruebot commented Feb 12, 2016

I'm working on migrating issues over. this was a bad migration. The original on is still here https://github.com/islandora-interest-groups/Islandora-Fedora4-Interest-Group/issues/13

@mjordan
Copy link
Contributor

mjordan commented Feb 12, 2016

Ah figured it had moved elsewhere, sorry for the unnecessary ping 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants