CAGEscan-Clustering - Create CAGEscan cluster from BED12, BAM or SAM pairs
CAGEscan-Clustering.pl -i <input bed12, bam or sam file to be clustered>
Create CAGEscan cluster from paired-end data in BED12, BAM or SAM format. Input data can be sent to STDIN, in which case the format (-f
) of the input must be specified, or read from a file -i
, in which case the format of the data can be specified using the -f
option, in the absence of which it is guessed from the suffix of the input file. Valid suffixes / format are: 'bed', 'sam' 'bam' or gziped version of those formats).
Importantly if the data is sent in BAM or SAM format, it must be sorted by name.
An optional BED6 formatted cluster file (-c
) can be used to guide the construction of CAGEscan clusters. Note that only paired-end data whose TSS overlap with at least one cluster will be grouped.
In the absence of such an optional BED6 formatted cluster file, "guiding" clusters are created from the paired-end data themselves using a simple "single-linkage clustering" approach using the distance -d
(ie clustering together all the data based on their TSS being distant of -d
basepair). The result of the TSS single-linkage-based clustering can be stored (BED6) if a file path (-s
) is provided or is otherwise discarded.
Output data will be BED12 formatted CAGEscan cluster.
the name of each cluster will correspond to the name of the guiding cluster (BED6 formatted cluster name), or, if clusters are created from the paired-end data, will correspond to:
Lx_chr_strand_start_end
with the chr,strand,start and end of the TSS single-linkage derived clusters.The score of each cluster will correspond to the number of input paired-end tags composing the resulting CAGEscan cluster.
The hallmark of the TSS/guiding cluster on the CAGEscan cluster will be encoded by a 3'UTR region with :
The cdsStart will correspond to either the start of CAGEscan cluster or end of the guiding cluster (when on the plus strand), or to the start of the guiding cluster or end of the CAGEScan cluster (when on the minus strand).
Similarly, cdsEnd will correspond to either the start of the guiding cluster or end of the CAGEscan cluster (when on the plus strand), or to the start of the CAGEscan cluster or end of the guiding cluster (when on the minus strand).
(instead of de-novo clustering from the paired-end TSS)
Whenever the boundaries of the guiding cluster extend beyond those of the intersected paired-end input data, the reported CAGEscan cluster boundaries will be those of the paired-end data, not that of the guiding cluster. As it can be problematic to integrate multiple run with various input paired-end data, keep in mind that the cluster name is that of the guide and shall be used.
The rate limiting step of this script is the preparation of the paired-end input data that need to be transformed into a BED6 with start and end corresponding to the 0-based pe_tss position, name corresponding to the semicolon delimited concatenated full BED12 line and (more importantly and time-intensive) sorted by TSS position.
This step can be `precomputed` and skipped providing the Boolean flag -p
which will informs the script that the provided input file/stdin already correspond to the precomputed data content, format and sorting. Also since CAGEscan cluster are by definition, assemblies of reads located on the same chromosome and strand, I recommend for large files to split up the input paired-end reads by chromosome and strand and precompute the sorted BED6 internal temporary file (with start and end corresponding to the 0-based pe_tss position, name corresponding to the semicolon delimited concatenated full BED12 line and sorted by TSS position).
- -h
-
Show this message
- -v
-
Verbose level (1-9)
- -i,--input
-
Input paired-end data file. Can be a BAM, SAM, BED or gzip compressed BED12 file. It is recommended to use sensible suffix for the detection of the format of the file, although this can also be specified using the
-f
option. Note that this parameter is not mandatory, as input data can also be read from STDIN. - -q,--min_map_quality
-
The minumum "mapping quality value" for including the input data into a CAGEscan cluster. If the input is SAM or BAM this value corresponds to the sum of each pair member MapQ. If the input is BED12 , this value is extracted from the score column. This value is inclusive (aka "bigger or equal than"). [Default : undefined, aka no filtering]
- -f,--format
-
Format of the input (paired-end) data. Valid format are:
bed
,sam
,bam
,bed.gz
,sam.gz
, orbam.gz
.-f
is mandatory if the input comes from STDIN but optionnal if the input (-i
) is a file ending with one of the valid formats. - -c,--cluster_file
-
Path to a BED6 formatted cluster file directing the construction of CAGEscan clusters. Note that only paired-end data whose TSS overlap with at least one cluster will be grouped. If none provided, the input data itself will be used.
- -d,--denovo_cluster_distance
-
The distance for paired-end data TSS single linkage clustering. If no BED6 formatted cluster file directing the construction of CAGEscan clusters were provided, the paired-end data themselves using a simple "single-linkage clustering" approach using this distance to group neighboring TSS.
- -s,--stored_denovo_cluster_file
-
The path of the BED6 file where (TSS) cluster file directing the construction of CAGEscan clusters is to be stored (otherwise stored in a temporary file).
- -x,--stored_uprocessed_bam_file
-
The path where non properly paired or pairs not fulfilling the min_map_quality criteria of BAM or SAM input can be stored.
- -p,--presorted_bed6_tss_input_file
-
Boolean flag skipping the transformation of the input paired-end data into a BED6 with start and end correspond to the 0-based pe_tss position name correspond to the semicolon delimited concatenated full BED12 line sort by TSS position. Importantly, it of course means that the input file (
-i
) must conform to the above mentionned specifications
- --tmp_bed6_tss_input_file
-
Path to the location where a temporary BED12 formatted file holding the input data might be created. This is in particular the case when "guiding" clusters are created from the paired-end data themselves. [default '/tmp/CAGEscan-Clustering.bed12_formatted_input.PID.tmp']
- --tmp_bed6_cluster_file
-
Path to the location where a temporary BED6 formatted file holding the "guiding" clusters can be created. [default '/tmp/CAGEscan-Clustering.bed6_formatted_induced_cluster.PID.tmp']
- --bin_samtools
-
Path to
samtools
binary ("samtools view -Sb ..." used only when input is in SAM format). - --bin_bamtobed
-
Path to
pairedBamToBed12
(re-enginering of BEDToolsbamToBed
) binary which transform properly paired BAM alignments into a single BED entry (used only when input is in BAM or SAM format). - --bin_mergebed
-
Path to BedTools::mergeBed binary (used only when "guiding" clusters must be generated from single-linkage clustering of the the paired-end data themselves).
- --bin_intersectbed
-
Path to BedTools
intersectBed
binary (always used). - --fit_to_guiding_cluster_size
-
Enforce the read length to be at least the length of the guiding cluster.