Skip to content

Commit

Permalink
Add ability to start pipeline from cell ranger output, edit documenta…
Browse files Browse the repository at this point in the history
…tion for this new input, change pipeline rules to keep condition cell ranger input on the same line as previous input
  • Loading branch information
chenv3 committed Jan 8, 2025
1 parent 4c0a949 commit d2177a9
Show file tree
Hide file tree
Showing 10 changed files with 177 additions and 97 deletions.
20 changes: 14 additions & 6 deletions cell-seek
Original file line number Diff line number Diff line change
Expand Up @@ -324,17 +324,25 @@ def parsed_arguments(name, description):
{3}{4}Description:{5}
To run the cell-seek pipeline with your data raw data, please
provide a space seperated list of FastQ (globbing is supported) and an output
provide a space separated list of FastQ (globbing is supported) and an output
directory to store results.
{3}{4}Required arguments:{5}
--input INPUT [INPUT ...]
Input FastQ file(s) to process. The pipeline does NOT
support single-end data. FastQ files for one or more
samples can be provided. Multiple input FastQ files
should be seperated by a space. Globbing for multiple
file is also supported.
Input FastQ file(s) or Cell Ranger output folders to
process. The pipeline does NOT support single-end data.
FastQ files for one or more samples can be provided.
Multiple input FastQ files per sample can be provided.
Multiple input FastQ files should be separated by a
space.
Cell Ranger output folders can be provided. It is
expected that the outs folder is contained within the
Cell Ranger output folders.
Globbing for multiple files/folders is also supported.
FastQ Input:
Example: --input .tests/*.R?.fastq.gz
Cell Ranger Input:
Example: --input .tests/*/
--output OUTPUT
Path to an output directory. This location is where
the pipeline will create all of its output files, also
Expand Down
93 changes: 67 additions & 26 deletions docs/usage/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,18 @@ The following is a breakdown of the required and optional arguments for each of
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.

`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/

---
`--output OUTPUT`
Expand Down Expand Up @@ -219,12 +225,17 @@ Each of the following arguments are optional, and do not need to be provided.
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/
---
`--output OUTPUT`
Expand Down Expand Up @@ -300,12 +311,17 @@ Each of the following arguments are required. Failure to provide a required argu
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/
---
`--output OUTPUT`
Expand Down Expand Up @@ -347,7 +363,11 @@ Each of the following arguments are required. Failure to provide a required argu
>
> ***Example:*** `--cellranger 7.1.0`
---
#### 2.3.2 Conditionally Required Arguments
The following arguments are only required when FastQ files are used as input. They are not required when Cell Ranger output file is used as input.
`--libraries LIBRARIES`
> **Libraries file.**
> *type: file*
Expand Down Expand Up @@ -407,7 +427,7 @@ Each of the following arguments are required. Failure to provide a required argu
>
> ***Example:*** `--features features.csv`
#### 2.3.2 Analysis Options
#### 2.3.3 Analysis Options
`--exclude-introns`
> **Exclude introns from the count alignment.**
Expand Down Expand Up @@ -458,13 +478,18 @@ There are multiple different combinations of library types that may result in th
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
`--input INPUT [INPUT ...]`
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/
---
`--output OUTPUT`
Expand Down Expand Up @@ -506,7 +531,10 @@ Each of the following arguments are required. Failure to provide a required argu
>
> ***Example:*** `--cellranger 7.1.0`
---
#### 2.4.2 Conditionally Required Arguments
The following arguments are only required when FastQ files are used as input. They are not required when Cell Ranger output file is used as input.
`--libraries LIBRARIES`
> **Libraries file.**
> *type: file*
Expand Down Expand Up @@ -535,7 +563,7 @@ Each of the following arguments are required. Failure to provide a required argu
>
> ***Example:*** `--libraries libraries.csv`
#### 2.4.2 Analysis Options
#### 2.4.3 Analysis Options
Each of the following arguments are optional, and do not need to be provided.
Expand Down Expand Up @@ -682,12 +710,17 @@ Each of the following arguments are optional, and do not need to be provided.
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/
---
`--output OUTPUT`
Expand Down Expand Up @@ -776,13 +809,18 @@ Each of the following arguments are required. Failure to provide a required argu
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
`--input INPUT [INPUT ...]`
> **Input FastQ file(s).**
> *type: file(s)*
`--input INPUT [INPUT ...]`
> **Input FastQ file(s) or Cell Ranger folder(s).**
> *type: file(s) or folder(s)*
>
> One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should seperated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
> FastQ Input: One or more FastQ files can be provided. The pipeline does NOT support single-end data. From the command-line, each input file should separated by a space. Multiple input FastQ files per sample can be provided. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should always be gzipp-ed.
>
> ***Example:*** `--input .tests/*.R?.fastq.gz`
>
>
> Cell Ranger Input: Cell Ranger output folders can be provided. It is expected that the outs folder is contained within the Cell Ranger output folders, and keep the normal output folder structure. Globbing is supported!
>
> ***Example:*** `--input .tests/*/
---
`--output OUTPUT`
Expand Down Expand Up @@ -816,7 +854,10 @@ Each of the following arguments are required. Failure to provide a required argu
> ***Example:*** `--genome hg38`
---
#### 2.6.2 Conditionally Required Arguments
The following arguments are only required when FastQ files are used as input. They are not required when Cell Ranger output file is used as input.
`--libraries LIBRARIES`
> **Libraries file.**
> *type: file*
Expand All @@ -842,7 +883,7 @@ Each of the following arguments are required. Failure to provide a required argu
> ***Example:*** `--libraries libraries.csv`
#### 2.6.2 Analysis Options
#### 2.6.3 Analysis Options
The multiome pipeline currently does not have any applicable analysis flags.
Expand Down
70 changes: 60 additions & 10 deletions src/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def sym_safe(input_data, target, link):
as input. If a symlink already exists, it will not try to create a new symlink.
If relative source PATH is provided, it will be converted to an absolute PATH.
It is currently forcing a link to be created for cellranger output folders, even if the provided
link parameter is False
link parameter is Fals
@param input_data <list[<str>]>:
List of input files to symlink to target location
@param target <str>:
Expand All @@ -90,14 +90,26 @@ def sym_safe(input_data, target, link):
input_fastqs = [] # store renamed fastq file names
for file in input_data:
if os.path.isdir(file): #Checking if provided file is a directory. If so, assumes it is a cellranger outs folder
filename = os.path.join(os.path.basename(os.path.dirname(file)), os.path.basename(file))
file = os.path.dirname(file)
link = True
if os.path.exists(os.path.join(file, 'outs')):
#filename = os.path.join(os.path.basename(os.path.dirname(file)), os.path.basename(file))
filename = os.path.basename(file)
link = True
else:
raise NameError("""\n\tFatal: Provided input '{}' does not match expected format!
Cannot determine if existing folder is a cellranger output folder.
Please check the folder name and structure before trying again.
Here is example of expected cellranger output folder structure:
input: sampleName structure: sampleName/outs
""".format(file, sys.argv[0])
)
else:
filename = os.path.basename(file)
try:
renamed = rename(filename)
renamed = os.path.join(target, renamed)
if not link:
renamed = rename(filename)
renamed = os.path.join(target, renamed)
else:
renamed = os.path.join(target, filename)
except NameError as e:
if not link:
# Don't care about creating the symlinks
Expand All @@ -107,11 +119,12 @@ def sym_safe(input_data, target, link):
raise e

input_fastqs.append(renamed)
print(filename, file, renamed)

if not exists(renamed) and link:
# Create a symlink if it does not already exist
# Follow source symlinks to resolve any binding issues
os.symlink(os.path.abspath(os.path.realpath(file)), renamed)
os.symlink(os.path.abspath(os.path.realpath(file)), renamed, target_is_directory=True)

return input_fastqs

Expand Down Expand Up @@ -188,6 +201,9 @@ def setup(sub_args, ifiles, repo_path, output_path):
# of FastQ and BAM files
mixed_inputs(ifiles)

# Check if inputs are folders
folder_inputs(ifiles)

# Resolves PATH to reference file
# template or a user generated
# reference genome built via build
Expand Down Expand Up @@ -412,6 +428,39 @@ def mixed_inputs(ifiles):
""".format(" ".join(fq_files), " ".join(bam_files), sys.argv[0])
)

def folder_inputs(ifiles):
"""Check if a user has provided directories as input.
@params ifiles list[<str>]:
List containing pipeline input files (renamed symlinks)
"""
folder_files, file_files = [], []
folders = False
files = False
for file in ifiles:
if os.path.isdir(file):
folders = True
folder_files.append(file)
else:
files = True
file_files.append(file)

if folders and files:
# User provided a mix of folders and files
raise TypeError("""\n\tFatal: Detected a mixture of --input data types.
A mixture of folders and files were provided; however, the pipeline
does NOT support processing a mixture of input FastQ files and
cellranger outputs.
Input Folders:
{}
Input Files:
{}
Please do not run the pipeline with a mixture of files and folders.
This feature is currently not supported within '{}'. If you feel like
this functionality should exist, feel free to open an issue on Github.
""".format(" ".join(folder_files), " ".join(file_files), sys.argv[0])
)
return(folders)

def add_user_information(config):
"""Adds username and user's home directory to config.
@params config <dict>:
Expand Down Expand Up @@ -823,18 +872,19 @@ def check_conditional_parameters(config):
Config dictionary containing metadata to run pipeline
"""
errorMessage = []
input_folders = folder_inputs(config['options']['input'])
#Check if cellranger version is provided when required
if config['options']['pipeline'] in ['gex', 'cite', 'multi'] and config['options']['cellranger'] == '':
errorMessage += [
"Error: Version of cellranger to use is required for {} pipeline\n \
└── Please use the --cellranger flag to select one of the available versions: {}".format(
config['options']['pipeline'],
', '.join(['7.1.0', '7.2.0', '8.0.0'])
', '.join(['7.1.0', '7.2.0', '8.0.0', '9.0.0'])
)
]

#Check if libraries file is provided when required
if config['options']['pipeline'] in ['cite', 'multi', 'multiome'] and config['options']['libraries'] == 'None':
if config['options']['pipeline'] in ['cite', 'multi', 'multiome'] and config['options']['libraries'] == 'None' and not input_folders:
errorMessage += [
"Error: Libraries file is required for {} pipeline\n \
└── Please use the --libraries flag to provide the CSV file with the columns: {}".format(
Expand All @@ -844,7 +894,7 @@ def check_conditional_parameters(config):
]

#Check if features file is provided when required
if config['options']['pipeline'] in ['cite'] and config['options']['features'] == 'None':
if config['options']['pipeline'] in ['cite'] and config['options']['features'] == 'None' and not input_folders:
errorMessage += [
"Error: Features file is required for {} pipeline\n \
└── Please use the --features flag to provide the CSV file with the columns: {}".format(
Expand Down
2 changes: 2 additions & 0 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ filter_file = config['options']['filter'] # Filter threshold file for QC analysi
METADATA_FILE = config['options']['metadata'] # Metadata file for QC analysis (not used in all pipelines)
if 'libraries' in config:
lib_samples = list(config['libraries'].keys()) # Libraries file samples
else:
lib_samples = samples # Handling the situation where cellranger outputs is used as input and no libraries file is provided
pipeline_output = []


Expand Down
Loading

0 comments on commit d2177a9

Please sign in to comment.