Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish ERA data migration prep work. #3151

Open
1 of 4 tasks
pgwillia opened this issue Jul 5, 2023 · 12 comments
Open
1 of 4 tasks

Finish ERA data migration prep work. #3151

pgwillia opened this issue Jul 5, 2023 · 12 comments
Assignees

Comments

@pgwillia
Copy link
Member

pgwillia commented Jul 5, 2023

Finish ERA data migration prep work.

Started https://gist.github.com/pgwillia/eed7dd858e17a9a67f9d90cb1d703adb

  • One csv per collection
  • Ensure that we're mapping to the "standard metadata schema"
  • Add lists of files as a column in the csv
  • One zip file per csv file containing the filenames listed in csv
@pgwillia
Copy link
Member Author

pgwillia commented Oct 29, 2024

Create a rake task that packages all items in a collection. Packaging in units of collections is how Scholaris expects that import will work. For each collection in ERA we will set up a collection (manually) in Scholaris and we will use that collection ID to then import all the items that belong to that collection.

  • determine what the largest size of package this would create and will it fit on disk at era-app-prd-4
  • determine if we can export metadata as dublin core XML (work with @ualbertalib/metadata-team)

@pgwillia
Copy link
Member Author

  • aip description contains markdown - is this what we want?

@lagoan lagoan self-assigned this Nov 29, 2024
@pgwillia
Copy link
Member Author

pgwillia commented Dec 5, 2024

<style type="text/css"></style>

collection link collection title collection size collection size for humans
https://era.library.ualberta.ca/communities/680db7d9-e196-408f-8ea2-243c545ccfa3/collections/30741075-3f88-4e1c-8036-1ba1ecf1392b Research Materials (Linguistics) 22802934879 21.2 GB
https://era.library.ualberta.ca/communities/b41cdbfd-6af2-4a13-8ba6-59725565d445/collections/82fbdb4f-1202-4123-acf5-96d7aeb874a5 BoardEx Reports 171641887893 160 GB
https://era.library.ualberta.ca/communities/017b5983-6bca-47d1-9760-0435ed3aedd8/collections/bcf586ac-ab61-49e1-839f-b9f517839e9e Bryan/Gruhn Archaeology Collection 222760099446 207 GB
https://era.library.ualberta.ca/communities/db9a4e71-f809-4385-a274-048f28eb6814/collections/f42f3da6-00c3-4581-b785-63725c33c7ce Theses and Dissertations 316812040004 295 GB

These are the four largest collections. I requested ~600 GB be added to era-app-prd-4 to accommodate the thesis collection export. We can start small and use /era_tmp

@anayram
Copy link
Member

anayram commented Dec 9, 2024

Metadata preparation specs for SAF packages creation:
 

  • Headers format: headers to be expressed in the format namespace.element.qualifier (qualifier optional)
  • Header mappings: specific namespace.element.qualifier to be used can be found in UAL Metadata Mappings
  • remove unmapped fields (see UAL Metadata Mappings)
  • merge fields as indicated (see UAL Metadata Mappings) - http://terms.library.ualberta.ca/ingestBatch and batch_ingest_id
  • add field prefixes for provenance information (see UAL Metadata Mappings)
  • Multiple value fields: separators expressed as \|\| (two pipe characters) for isversionof, creator, contributor, temporal,spatial, language, subject, and 'provenance'.
  • include 'filename' column with all files included in an ERA resource

@lagoan @pgwillia @sfarnel

@lagoan
Copy link
Contributor

lagoan commented Dec 11, 2024

@anayram @pgwillia @sfarnel

When looking into adding the file names to the SAF, I noticed a potential problem if there are duplicate filenames in different Item/Thesis in a collection. Files need to be on the same folder when creating the SAF and when moving them to create the SAF the files can be overwritten.

I generated the report duplicate_filenames_report.txt to aid in analysing potential file name conflict.

Please let me know if a different format would simplify this work.

@lagoan
Copy link
Contributor

lagoan commented Dec 17, 2024

@anayram

This is the schema definition for the users model. This can be used to create the mapping to export the users:

  create_table "users", force: :cascade do |t|
    t.string "email", null: false
    t.string "name", null: false
    t.boolean "admin", default: false, null: false
    t.integer "sign_in_count", default: 0, null: false
    t.datetime "last_sign_in_at"
    t.string "last_sign_in_ip"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.boolean "suspended", default: false, null: false
    t.datetime "previous_sign_in_at"
    t.string "previous_sign_in_ip"
    t.datetime "last_seen_at"
    t.string "last_seen_ip"
    t.string "api_key_digest"
    t.boolean "system", default: false, null: false
    t.index ["email"], name: "index_users_on_email", unique: true
  end

@pgwillia
Copy link
Member Author

@anayram

This is the schema definition for the users model. This can be used to create the mapping to export the users:

  create_table "users", force: :cascade do |t|
    t.string "email", null: false
    t.string "name", null: false
    t.boolean "admin", default: false, null: false
    t.integer "sign_in_count", default: 0, null: false
    t.datetime "last_sign_in_at"
    t.string "last_sign_in_ip"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.boolean "suspended", default: false, null: false
    t.datetime "previous_sign_in_at"
    t.string "previous_sign_in_ip"
    t.datetime "last_seen_at"
    t.string "last_seen_ip"
    t.string "api_key_digest"
    t.boolean "system", default: false, null: false
    t.index ["email"], name: "index_users_on_email", unique: true
  end

Related to #3662

@anayram
Copy link
Member

anayram commented Dec 17, 2024

To generate SAFs we could have a migration order criteria that isolates groups of resources according to resource type (item, then theses), visibility, and other indicators arising from pending queries like pending mapping issues or SAF requirements (e.g. duplicate filenames issue).

Possible criteria for selection. I included what I've found via metadata mapping but there may be other criteria I am not aware of @pgwillia @lagoan

  1. Start with Item-only collections (no theses) - this mapping work is ready
  2. Collections with no duplicate filenames
  3. By visibility
    1. Collections with only public items (31,555)
    2. embargo (119)
    3. authenticated (14,081) - we will assign files permissions to a group for now called "Authenticated" temporarily open to admins only so that we don't hold migration with SAML authentication functionality work
    4. draft or private (it looks like we have none at the moment)
  4. Collections in order of storage needs (smaller to larger)
  5. Theses-only collections - mapping work to be completed
  6. Theses by visibility
    1. Public (35,222)
    2. Embargo (256) - this may require some manual work if we can't set up embargo via SAF
  7. Collections with mixed resources (theses and resources). Are there any?

@pgwillia suggested we use a spreadsheet to track collection SAFs (from our Nov 26 discussion with Scholaris)
image

@anayram
Copy link
Member

anayram commented Jan 2, 2025

@lagoan @pgwillia when it is time, would it be possible to generate the first SAF package for the following test collection? It includes two embargoed test resources. I can include other resources if useful.

The current items are meant to test all available metadata that can be created from the UI forms as well as file order. Hopefully we can test thumbnails as well at some point.

https://era.library.ualberta.ca/communities/60f42a7e-88bf-4e68-81e9-0bfcd8d587ba/collections/be3f7b49-889a-4af6-9b30-812f89084a86

@lagoan
Copy link
Contributor

lagoan commented Jan 2, 2025

Certainly @anayram , I can create that SAF package tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants