CLI, config, and resources for running 10X Genomics pipelines
Configuration is kept in the environment variable "TENX_CONFIG_FILE". These values are filled in from the google deployment YAML, but would need to be provided otherwise. Here are the known config keys. Not all keys are not always necessary. Configs used for deployments are in resources/config.
- TENX_DATA_PATH: Base path of the local data, ex: /mnt/disks/data
- TENX_REMOTE_URL: GCP base URL of the data
- TENX_NOTIFICATIONS_SLACK: Slack URL for posting notifcations
- TENX_SUPERNOVA_SOFTWARE_URL: URL of the supernova tgz to instal
- TENX_MACHINE_TYPE: GCP machine type for supernova
- TENX_ASM_PARAMS: Additional params for the
supernova run
command - TENX_CROMWELL_PATH: Path for cromwell installation, default is /app/cromwell
- TENX_CROMWELL_VERSION: Cromwell version. Use >= 53
- TENX_LONGRANGER_SOFTWARE_URL: URL for longerrange tgz to install
- TENX_REMOTE_REFS_URL: url for the longranger references
- TENX_ALN_MODE: longranger aligner mode
- TENX_ALN_CORES: longranger aligner cores to use
- TENX_ALN_MEM: longranger aligner mem to use
- TENX_ALN_VCMODE: longranger variant caller [ex: freebayes]
The data structure is important for successful runs of alignment nad assembly. The base paths are kept in the above config nad can be local and remote. Local path is needed for assembly and alignemnt. Reads need to be uploaded to the base-path / sample / reads URL.
base-path/ sample/ assembly/ alignemnt/ reads/
The 10X de novo assembler.
Property | Required | Recommended |
---|---|---|
Cores | 32 | 64 |
Mem | 256+ Gb | 400+ Gb |
Disk | 3 Tb | 2 Tb |
GCP Machine recommended: n1-highmem-64
Update these properties need to be set in the YAML (resources/google/supernova.yaml) configuration. Check supernova.jinja.schema for supernova properties documentation.
Property | Notes |
---|---|
service_account | service account email to have authorized on the supernova VM |
region/zone | area to run instances, should match data location region/zone |
remote_data_url | bucket location of reads, software, and assemblies |
supernova_software_url | supernova software TGZ URL (GS://) to download and untar |
Property | Notes |
---|---|
project_name | project name label to add to instances, useful for accounting |
node_count | number of compute instances to spin up. It is recommended to only run one supernova assemble per instance |
notification | slack url to post message (see making a slack app) |
ssh_source_ranges | whitelist of IP ranges to allow SSH access to supernova compute instance |
In an authenticated GCP session, enter the resources/google directory. Run the command below to create the deployment named supernova1. The deployment name will be prepended to all assoiciated assets. Use a different deployment name as needed.
$ gcloud deployment-manager deployments create supernova01 --config supernova.yaml
This is list of assets created in the deployment. All assests are preppended with the deployment name and a '-'. The compute instances with have a number appended to them. The number of compute instances depends on the node_count in the deployment YAML. It is recommended to only run one supernova assembly per compute instance.
Assest | Name | Purpose |
---|---|---|
supernova01-1 (to node_count) | compute.v1.instance | the supernova compute instances, run supernova here |
supernova01-network | compute.v1.network | network for compute instance and firewalls |
supernova01-network-subnet | compute.v1.subnetwork | subnet for compute instance and firewalls |
supernova01-network-tenx-ssh-restricted | compute.v1.firewall | firewall of whitelisted IPS for SSH |
supernova01-network-tenx-web-ui | compute.v1.firewall | firewall to allow access to the 10X web UI |
SSH into the supernova01-1 compute instance.
$ gcloud compute ssh supernova01-1
Create a tmux
session. This will allow the pipeline comand to persist after loggin out of the supernova instance. A name can be provided for the session.
[you@soupernova01-1 ~]$ tmux new ${SAMPLE_NAME}
Inside the tmux
session, run the supernova pipeline using the tenx CLI providing a sample name. The pipeline expects reads to be in ${REMOTE_DATA_URL}/${SAMPLE_NAME}/reads and will put the resulting assembly and outputs into ${REMOTE_DATA_URL}/${SAMPLE_NAME}/assembly. Use tee
to print STDOUT/ERR while redirecting this output to a file.
[you@soupernova01-1 ~]$ tenx asm pipeline ${SAMPLE_NAME} | tee ${SAMPLE_NAME}.log
Logout of the tmux session using D-B to preserve it, then log out of the supernova instance. The tmux session will persist.
To re-attach to the tmux session:
$ gcloud compute ssh supernova01-1
[you@soupernova01-1 ~]$ tmux attach -t ${SAMPLE_NAME}
The 10X aligner suite.
8-core Intel or AMD processor per node 6GB RAM per core CentOS >=6 NFS w/ 2TB free disk space
View loupe files created by the longranger WGS pipeline.
Cores: 2 Mem: 8G+ Disk: 32G+ (loupe files are ~4G each)
There is a docker container (ebelter/tenx:latest
) to use to interact between the REMOTE and LOCAL data paths. This image does not have supernova or longrnger installed. This image is to upload/download reads and assemblies, and includes gcloud
and gsutil
commands.
In order to use tenx
CLI and the GCP commands
$ bsub -q docker-interactive -a 'docker(ebelter/tenx:latest)' /bin/bash
Check the config...
$ gcloud config list
If needed, reauth GCP:
$ gcloud init
Then use tenx
CLI and GCP commands. Jobs can also submit to the LSF cluster. This command shows all the remote samples.
$ bsub -q research-hpc -a 'docker(ebelter/tenx:latest)' tenx list
The TenX CLI use a configuration file to retreive data locations, both local and remote. These are base directories/URLs, and will have sample names as subdirectories. These sample directories then may have subdirectories of alignment, assembly, and reads. There is more deatil about the config file adn data structure above.
Create a config file (YAML format) to hold the local MGI disk location and the remote GCP bucket. Create the file in a mounted disk spot.
$ cd /mnt/disk/data # where ever...
$ vim tenx.yaml # use editor and file name of your liking
Then add these lines, changing the locations to your values.
TENX_DATA_PATH: /mnt/disk/data
TENX_REMOTE_URL: gs://mgi-rg-linked-reads-ccdg-pilot
Set in the environment...
$ TENX_CONFIG_FILE=/mnt/disk/data/tenx.yaml; export TENX_CONFIG_FILE
Use in the CLI...
$ TENX_CONFIG_FILE=/mnt/disk/data/tenx.yaml tenx asm download <SAMPLE_NAME>
There are commands to upload or download and download assemblies. Setup the TenX config file above to use in the following commands.
Get an interactive session to and setup the environemnt. You should use the ebelter/temx:latest docker image.
$ LSF_DOCKER_PRESERVE_ENVIRONMENT=false bsub -Is -q docker-interactive -a 'docker(ebelter/tenx:latest)' /bin/bash
$ TENX_CONFIG_FILE=/mnt/disk/data/tenx.yaml; export TENX_CONFIG_FILE
You will also need to auth into GCP. Then run or submit downloads...
$ tenx asm download <SAMPLE_NAME>