recover-parquet-internal

Purpose

This repository hosts code for the pipeline that indexes files from S3 storage in a Synapse project/folder using the STS process. The code is written in R and offers two methods for running the pipeline:

Executing the code in an R environment, e.g., RStudio in an EC2 instance (see Method 1: Running in an R Environment)
Running the pipeline using a Docker container for easy setup and portability (see Method 2: Running with Docker)

Prerequisites

Synapse account with relevant access permissions
Synapse authentication tokens

A Synapse authentication token is required for use of the Synapse APIs and CLI client. For help with Synapse, Synapse APIs, Synapse authentication tokens, etc., please refer to the Synapse documentation.

Installation & Usage

Method 1: Running in an R Environment

Setup your environment

Provision an EC2 Rstudio Notebook instance
Upgrade pip in the terminal of the EC2 instance with python -m pip install --upgrade pip
Install the Synapse CLI client in the terminal with pip install synapseclient

Note: If you are having issues during installation of the Synapse CLI client, consider upgrading pip with python -m pip install --upgrade pip before attempting to reinstall synapseclient. If you still have issues, force re-install synapseclient from the terminal via pip install --upgrade --force-reinstall synapseclient.

Create a .Renviron file in the home folder with the environment variable SYNAPSE_AUTH_TOKEN='<personal-access-token>'. Replace <personal-access-token> with your actual token. Your personal access token should have View, Modify and Download permissions. If you don't have a Synapse personal access token, refer to the instructiocs here to get a new token: Personal Access Token in Synapse.

Setup SSM Access to the S3 bucket

Please follow the instructions to Create a Synapse personal access token (for AWS SSM Access)
Please follow instructions 3-5 to set up SSM Access to an Instance. (Note: AWS CLI version that is installed on the EC2 offering is ver 2.x)

Clone the repo and install required R libraries

Clone this repository and switch to the new project
Modify the parameters in sts_params_internal.R
Run install_requirements.R
Start a new R session (type q() in the R console)

Run the ingress pipeline

Run sts_synindex_internal.R
Set up a job to run sts_synindex_internal.R at your required frequency

Method 2: Running with Docker

Pull the docker image with docker pull ghcr.io/sage-bionetworks/recover-parquet-internal:main
Run a container with docker run -e AWS_SYNAPSE_TOKEN=<aws-cli-token> -e SYNAPSE_AUTH_TOKEN=<synapse-auth-token> <image-name>
If desired, setup a scheduled job (AWS Scheduled Jobs, cron, etc.) using the docker image (ghcr.io/sage-bionetworks/recover-parquet-internal:main) to run the pipeline at your desired frequency

Note: Replace <aws-cli-token> and <synapse-auth-token> with the actual token values. When provisioning a Scheduled Job, <aws-cli-token> and <synapse-auth-token> should be specified in the Secrets and/or EnvVars fields of the provisioning settings page.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
install_requirements.R		install_requirements.R
misc.R		misc.R
sts_params_internal.R		sts_params_internal.R
sts_synindex_internal.R		sts_synindex_internal.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

recover-parquet-internal

Purpose

Prerequisites

Installation & Usage

Method 1: Running in an R Environment

Setup your environment

Setup SSM Access to the S3 bucket

Clone the repo and install required R libraries

Run the ingress pipeline

Method 2: Running with Docker

About

Releases

Packages

Contributors 2

Languages

Sage-Bionetworks/recover-parquet-internal

Folders and files

Latest commit

History

Repository files navigation

recover-parquet-internal

Purpose

Prerequisites

Installation & Usage

Method 1: Running in an R Environment

Setup your environment

Setup SSM Access to the S3 bucket

Clone the repo and install required R libraries

Run the ingress pipeline

Method 2: Running with Docker

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages