Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIH AIM:1 YR:2 TASK:1.1.1 | 2.1.1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model #36

Closed
mreekie opened this issue Apr 10, 2023 · 4 comments
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository pm.GREI https://docs.google.com/document/d/1RdifpHJDFqx8Y8-Dsv_VnnTgezjNHKpSyRei4cw3C-k/edit?usp=sharing pm.GREI-d-2.1.1 NIH, yr2, aim1, task1: Metadata reg lg datasets integr w rsrch comp NIH-fund cost recvry mdl Project: NIH GREI Tasks related to the NIH GREI project

Comments

@mreekie
Copy link
Collaborator

mreekie commented Apr 10, 2023

Planning continues around supporting the Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE) and moving beyond the MVP. The MVP enables connection from Harvard Dataverse to the Globus endpoint and storage but does not support real time browsing for large files yet due to specific technological characteristics of tape support. Technical plan for this last step is anchored in issue 9123. This activity will be performed in the first half of year 2, as the necessary development resources have been identified

┆Issue is synchronized with this Smartsheet row by Unito

@mreekie mreekie added pm.GREI https://docs.google.com/document/d/1RdifpHJDFqx8Y8-Dsv_VnnTgezjNHKpSyRei4cw3C-k/edit?usp=sharing pm.GREI-d-2.1.1 NIH, yr2, aim1, task1: Metadata reg lg datasets integr w rsrch comp NIH-fund cost recvry mdl labels Apr 10, 2023
@mreekie
Copy link
Collaborator Author

mreekie commented Apr 10, 2023

This issue represents a deliverable funded by the NIH
This deliverable supports the NIH Initiative to Improve Access to NIH-funded Data

Aim 1: Support the sharing of very large datasets (>TBs) by integrating the metadata in the repository with the data in the research computing storage

An increasing number of research studies deal with very large datasets (>TB to PBs). When the study is completed or ready to be distributed, it is not always feasible nor desirable to deposit the data in the repository. Instead, in this project we propose to publish the metadata to the repository for discoverability of the study and access the data remotely from the research computing cluster or cloud storage. In this scenario, the data does not need to be downloaded to the user’s computer but can be viewed, explored, and analyzed directly in the research computing environment. The Harvard Dataverse repository will leverage the Northeast Storage Exchange (NESE) and the New England Research Cloud (NERC) to provide storage and compute for these very large datasets by finding and accessing them through the repository and keeping the metadata and data connected via a persistent link. These two services - NESE and NERC - are large-scale multi-institutional infrastructure components of the Massachusetts Green High Performance Computing Center (MGHPCC) -- a five member public-private partnership between Boston University, Harvard University, Massachusetts Institute of Technology, Northeastern University, and the University of Massachusetts. MGHPCC is a $90 million facility that has the capacity to grow up to 768 rack, 15 MW of power and 1 terabit of network capacity in the current 95,000 sq. ft data center. One of the key integration points to support large data transfers is to incorporate Globus endpoints. Globus is a distributed data transfer technology developed at University of Chicago that is becoming ubiquitous for research computing services. This will allow the realistic transfer of TBs of data in less than an hour. Globus will also be a front end of NESE Tape, a 100+ PB tape library within MGHPCC. The integration of the repository with research computing is one of the components of a Data Commons that will facilitate collaboration, dissemination, preservation and validation of data-centric research.

Related Deliverables:
2.1.1 | 1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model | 10
2.1.1 | 1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model | 10
4.1.1 | 1 | Assess cost recovery model | 10
4.1.2 | 1 | Improve metadata registry UX/UI and integration with remote storage and computation based on user feedback from the previous years | 10


This work also represents a deliverable funded internally.

Harvard Data Commons MVP: Objective 1
Objective: Integrate Harvard Research Computing environments and Harvard repositories to facilitate publishing data and/or metadata throughout the research project lifecycle
Publish datasets, w/ data and metadata (GBs size)
Publish metadata only and reference data in RC (TBs size)

  • Work package 1: Review and assess an existing open-source Globus connector tool
  • Work package 2: Implement the rest of the connector tool to support use cases A and B
  • Work package 3: Extend Dataverse UI to support the connector tool in a user-friendly way
  • Work package 4: Beta test with real data and users

This picture shows how the the Harvard Data Commons work maps to Dataverse work.

Image

This is a closer look at the Harvard Datacommons work: GDCC DataCommons Objective 1 Task Tracking

  • This is not a public link

@mreekie
Copy link
Collaborator Author

mreekie commented Apr 10, 2023

March update:

The closing update for 1.1.1 inf Febrary 2023 pretty much identifies where we're going to start the work on this for year 2. I put that into the description.

(2.1.1) Planning continues around supporting the Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE) and moving beyond the MVP. The MVP enables connection from Harvard Dataverse to the Globus endpoint and storage but does not support real time browsing for large files yet due to specific technological characteristics of tape support. Technical plan for this last step is anchored in issue 9123. This activity will be performed in the first half of year 2, as the necessary development resources have been identified

@mreekie mreekie changed the title NIH AIM:1 YR:2 TASK:1 | 2.1.1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model NIH AIM:1 YR:2 TASK:1B | 2.1.1B | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model Apr 17, 2023
@cmbz cmbz changed the title NIH AIM:1 YR:2 TASK:1B | 2.1.1B | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model NIH AIM:1 YR:2 TASK:1.1.1 | 2.1.1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model May 2, 2023
@cmbz
Copy link
Contributor

cmbz commented May 2, 2023

  • Updated title from "NIH AIM:1 YR:2 TASK:1B | 2.1.1B | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model" to "NIH AIM:1 YR:2 TASK:1.1.1 | 2.1.1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model"

@cmbz cmbz added GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository Dataverse Project Issues related to Dataverse Project software Project: NIH GREI Tasks related to the NIH GREI project labels Dec 22, 2023
@cmbz
Copy link
Contributor

cmbz commented Jan 3, 2024

@cmbz cmbz closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository pm.GREI https://docs.google.com/document/d/1RdifpHJDFqx8Y8-Dsv_VnnTgezjNHKpSyRei4cw3C-k/edit?usp=sharing pm.GREI-d-2.1.1 NIH, yr2, aim1, task1: Metadata reg lg datasets integr w rsrch comp NIH-fund cost recvry mdl Project: NIH GREI Tasks related to the NIH GREI project
Projects
None yet
Development

No branches or pull requests

2 participants