-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using OpenRefine as part of the migration process #898
Comments
@dhlamb I don't see how it could be fully automated into a migration workflow. If nothing else, any actual mapping from "strings to things" is going to have a decent number of mistakes, and those will require human intervention. Or are you thinking of breaking up the automation to insert OpenRefine? In that case maybe a PHP-side client would be the bridge. |
I'm like 99% sure this'll have to be something done before kicking off Drupal migrate, and something that you'd manually take to the extent you feel comfortable with, because you can do a lot with it. I don't think it's applicable to every migration, but enough folks have mentioned it that we should find a way to squeeze it into the workflow for those who want to use it. Even if that means "Just run OpenRefine first" and we can provide some guidance. If we're lucky, maybe we make a plugin that uses https://github.com/keboola/openrefine-php-client that people can turn on in their yml if they want it. But full automation is doubtful because everybody's repository is different. |
Ok, cool, that's a much less ambitious / more practical approach than I thought was intended. |
Assigning myself @exsilica - lacking permissions for this repo |
Assigning myself as well - @carakey |
Maybe this tool? https://github.com/LibreCat/Catmandu I had a problem to export MODS to RDF.. |
Under development at LSU for converting from XML to CSV: https://github.com/lsulibraries/xml2csv |
I'm happy to jump in on this one, too -- but don't seem to have the permissions -- @mbolam |
Happy to jump on this @amcshane --- also happy to help compile the notes once folks have a chance to explore. |
Regarding reconciliation using "Conciliator" -- https://github.com/codeforkjeff/conciliator. My troubles turned out to be rated to Java versions and my Mac pointing at an outdated version. Tested with latest version of Java and it is working on my desktop. No need for developer support, assuming people can get the Java 1.8 working on device. |
Are there any particular authority files folks want to make sure work? I've got a bunch of MARC that I can offer if anybody needs a bit of a mess to play with. It will not migrate prettily, I promise. |
@carakey and I were able to get xml2csv running on 15 of the sample MODS Islandora 7.x users have provided to MIG. Here's the branch with those resulting files: https://github.com/rtilla1/xml2csv . Point of interest: 15 well-formed MODS files have together 285 unique "fields" (every xpath that points to contents and which has a different combination of elements or attributes). There are holes in how these are being counted, but it's an interesting starting point. |
@rtilla1 -- once you import the xml as a csv and complete the clean-up work, do you export as a csv and use another tool to recreate the MODS or are you using the Templating export feature in OpenRefine? |
The xml2csv project at https://github.com/lsulibraries/xml2csv has been updated to use the current mapping spreadsheet from the MIG - i.e., only the mapped xpaths are included in the csv output. The latest incorporates most of the action items from the 8/28 call. Example of output: 15 MODS files as CSV in Google sheets If anyone takes it for a spin, I'd love feedback. |
@carakey I'll give it a shot tomorrow morning! |
Using OpenRefine as part of the migration process requires a number of steps, some of which are complete, and a number of which have multiple steps left undone at this point.
Step 2(#913), 3(#914), 4, 5, and 7 need work. Step 1, 6, 8, and 9 are theoretically ready to go and have been tested with other applications or data. |
So, this might be crazy talk, but could we get OpenRefine to export these records back out as mods records in a single modsCollection and Agents as MADSXML? Then we won't have to deal with the nested name delimiters we've been talking about in Zoom meetings. I can migrate XML documents just as easily (and sometimes more so) than CSV. |
@seth-shaw-unlv -- One could potentially do templating in OpenRefine to export as MODS and/or MADS. I've played around with templating, but not used it extensively. It probably wouldn't be too tough to come up with a "basic version" that at least handles the core elements we've been considering for the sprint. |
Templates do exist; this is the top hit for MODS templating and I’m pretty sure some of our consortium partners have built off of this version: https://gist.github.com/sallain/7604ffb0c155294fcfaf
…________________________________
From: Michael Bolam <notifications@github.com>
Sent: Thursday, August 30, 2018 3:12 PM
To: Islandora-CLAW/CLAW
Cc: Cara M Key; Mention
Subject: Re: [Islandora-CLAW/CLAW] Investigate using OpenRefine as part of the migration process (#898)
@seth-shaw-unlv<https://github.com/seth-shaw-unlv> -- One could potentially do templating in OpenRefine to export as MODS and/or MADS.
http://digitalscholarship.utsc.utoronto.ca/content/blogs/converting-spreadsheets-modsxml-using-open-refine
I've played around with templating, but not used it extensively. It probably wouldn't be too tough to come up with a "basic version" that at least handles the core elements we've been considering for the sprint.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#898 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQ1lPy3qfxCd8VbYDiXfWb3w_EHNb5XTks5uWEcwgaJpZM4WEBoy>.
|
Were folks actually able to perform reconciliation with the provided test MODS? The version I'm seeing retains data about MARC subfields in the text, making reconciliation against LOC (for example) impossible. My assumption was that each subfield needed its own column, as well -- not unlike creating MARC records from delimited text files. |
Many interested parties have mentioned using http://openrefine.org/ to clean up metadata before migrating to CLAW. We need users to investigate how it can be utilized to find external URIs for existing authorities and how it can clean up our MODS for us.
We also have to find out how to interact with it and where in the migration process it belongs. If we can call out to it over HTTP via an API, we may be able to integrate it using Drupal's migration framework. If not, it will have to be a step done before migration, while data is still in 7.x or working on a export, etc...
Download and install openrefine, then try it out and let us know what you think! More than one person can tackle this. No need for one person to hog all the metadata glory.
The text was updated successfully, but these errors were encountered: