Fastq REad Name DemultiplexER
Next-generation sequencing projects often pool multiple samples in the same sequencing lane to increase throughput and data-to-cost ratio. Reads from such 'multiplexed' sequencing runs must then be deconvolved in a process known as demultiplexing. Often, this will be done by Illumina's bcl2fastq
tool using a 'sample sheet', which provides associations between samples and their index sequences. This tool produces a fastq
file (or files, if paired-end sequencing was performed) for each sample, as well as an Undetermined
file(s) containing reads that could not be unambiguously assigned to a sample.
Generally, these files are the starting point for subsequent analyses. However, you might desire to inspect these files for various reasons:
- You didn't supply a sample sheet and thus all your reads are in an
Undetermined
file, and you want to demultiplex them yourself - You have read about index hopping and are curious whether this affects your data (if you use combinatorial dual indexes, it does 😉 ), and to what extent
- You use a sequencing service or vendor, and want to double-check whether your files are correctly demultiplexed, or (if not), re-demultiplex them yourself
- Some of your samples are mysteriously missing or have nonsensical data, while other samples appear to be just fine
- You are dissatisfied with the proportion of reads in the
Undetermined
file and want to investigate if you could recover more data (for example, by allowing 1 or 2 mismatches when identifying indexes)
frender
is designed to address these needs, and more!
-
Reverse complement mode: some Illumina machines (e.g. HiSeq 4000) read the reverse complement of Index 2 rather than the forward sequence (due to their chemistry). This can be confusing, but some sequencing providers automatically 'correct' if you submit index 2 sequences in the wrong orientation. This is wonderful when it works, but when it doesn't (e.g. low read number per sample), you may get some very confusing results! To address this,
frender
has a-rc
flag, which tries both the forward and reverse-complement sequence of each index 2, and chooses the orientation that makes the most sense in the context of all the sequence data provided. -
Allowing mismatches:
frender
allows you to specify how many mismatches to tolerate when comparing sequencing data to your sample sheet. This allows you to:- Check if your sequencing provider demultiplexed your data according to your specification
- See if you can recover more data by specifying a larger number of mismatches (since most indexes are 3-4 edits away from each other, this could be reasonable in some circumstances)
- Filter your data more stringently by specifying a smaller number of mismatches
-
Full data reporting:
frender
provides a complete report containing all unique index combinations and their frequency in a.csv
format. This can easily be picked up (e.g. byR
scripts) for further analysis of index hopping, sample swaps, contamination, etc. -
No dependencies (besides the Python 3 standard libraries) to reduce installation headaches
-
Multicore support for fast processing
-
Scan a file, list of files, or directory:
python3 ./frender.py scan -n 1 -rc -b barcode_table.csv input.fastq.gz [input2.fastq.gz] [input3.fastq.gz]
python3 ./frender.py scan -n 1 -rc /path/to/sequencing/directory
Option Explanation -n 1 allow 1 mismatch when trying to match indexes -rc consider the reverse complement as well as the forward sequence of index 2 -b barcode association table, .csv
format -
Demultiplex based on scan results:
python3 ./frender.py demux -r frender-scan-results_1-mismatches_sequencing-directory.csv input.fastq.gz [input2.fastq.gz] [input3.fastq.gz]
python3 ./frender.py demux -r frender-scan-results_1-mismatches_sequencing-directory.csv /path/to/sequencing/directory
Option Explanation -r frender
scan result file (contains information necessary for demultiplexing)
-
-b
(barcode association table or sample sheet)- Must be in csv format or contain csv formatted data after the line
[Data]
(Illumina sample sheets use this format) - This file must contain a header row with column names similar to 'index', 'index2', and 'id' or 'name'. All indexes in a column must be the same length (although index 1 and index 2 can be different lengths); this should correspond to the length of the indexes in the
fastq
files to be scanned. - You must specify a barcode association table unless you supply
frender
with a single directory path that already contains such a file.- If this is the case, the directory will be recursively searched for
.csv
or.txt
files that matchbarcode.*association
orsample.*sheet
(case insensitive). If multiple files are found, the one with the shortest path will be selected. - If you specify a barcode association table with the
-b
option and also supply a directory containing such a file, the explicitly specified file takes precedence
- If this is the case, the directory will be recursively searched for
- Must be in csv format or contain csv formatted data after the line
-
input.fastq.gz
or/path/to/dir
(input file(s)/directory)- If a directory is specified, it will be search recursively for
*.fastq.gz
and*.fq.gz
files. For the purposes of thescan
function, only files matching_R1_
(read 1 files) are used, as the barcodes are identical in both read 1 and read 2.
- If a directory is specified, it will be search recursively for
(default: 1
)
0
: use all available cores0 < c < 1
: use a fraction of available cores (e.g.-c 0.5
would use 4 cores on an 8-core machine)c ≥ 1
: use the specified number of cores
- Each barcode is allowed to have this number of mismatches. At this time you cannot specify a different number of mismatches for index 1 and index 2.
- No default; this parameter must be specified
-
If this flag is set, for each sample,
frender
will count the number of reads that could be assigned to that sample (from the provided fastq files) using the forward (supplied) index 2 sequence, and the number of reads that could be assigned with the reverse complement of that sequence. If the reverse complement would yield more demuxable reads than the forward sequence, that sample is flagged as 'reverse complement'. A report is printed at the command line (and saved to afrender-index-2-calls.csv
file):Sample Name Supplied Index 2 Reads supporting (forward) Reverse complement Index 2 Reads supporting (rev comp) Final call FT-SA49175 ACTTGAAT 3374 ATTCAAGT 0 forward FT-SA49189 ATGATCTG 1 CAGATCAT 1021281 reverse complement
- Optional; if specified, only the first
n
reads in the file (s) will be examined.
- Optional; if specified, this string will be added to the output file name
- Optional; if specified, this string will be removed from the sample id's provided in the barcode association table (for the purpose of calling properly/improperly demuxed)
A data file containing the following metrics for each unique combination of barcodes discovered in the scanned fastq
files:
idx1,idx2,reads,matched_idx1,matched_idx2,read_type,sample_name,demux_ok
GATCAGCG,CTTGTAAT,1000000,GATCAGCG,CTTGTAAT,demuxable,FT-SA49186,False
NTCACGTT,NTCACGAT,59,ATCACGTT,ATCACGAT,index_hop,True
NTCCCGTT,NAGATCAT,59,,,undetermined,True
reads
refers to the total number of reads that possess the specified idx1
and idx2
, summed over all the provided input files.
Possible read_type
s are:
demuxable
: this barcode combination can be uniquely assigned to one sample fileindex_hop
: this barcode combination matches at least one of the supplied index 1s and at least one of the supplied index 2s, but none of the matched indexes are associated with the same sample.ambiguous
: this barcode combination matches the indexes assigned to more than one sampleundetermined
: none of the above conditions are met
If idx1
and idx2
can be unambiguously assigned to a sample in the provided barcode association table, sample_name
is populated accordingly. This field is blank for all other reads.
After analyzing all supplied files and the barcodes found in them, frender
will determine if the barcodes are correctly distributed among the provided files according to the following rules:
Index-hop
andambiguous
reads are found only in files whose names matchindex-hop
andambiguous
(case-insensitive), respectively, or in a file whose name matchesundetermined
(case-insensitive).Undetermined
reads are found only in a file whose name matchesundetermined
(case-insensitive).Demuxable
reads are found only in a file whose name matches the sample id in the barcode association file.
If frender
detects any incorrectly demuxed reads, a warning will be issued and the demux_ok
flag will be set to 'False' for that barcode. A list of incorrectly demuxed files will also be printed to the terminal.
Produced only if -rc
is specified. A data file containing the folllowing metrics for each sample in the barcode association file:
sample_name,supplied_index_2,reads_supplied_index_2,rc_index_2,reads_rc_index_2,use_rc
FT-SA49175,ACTTGAAT,3374,ATTCAAGT,0,FALSE
FT-SA49189,ATGATCTG,1,CAGATCAT,1021281,TRUE
reads_supplied_index_2
and reads_rc_index_2
refer to the number of reads that could be demuxed if the forward or reverse complement of the supplied index 2 is used, respectively. use_rc
is FALSE
if frender
used the supplied index 2 sequence for that sample when writing the scan-results
file, and TRUE
if the reverse complement was used. The format of the scan-results
file is unchanged whether or not the -rc
flag was set.
-
input.fastq.gz
or/path/to/dir
(input file(s)/directory)- If a directory is specified, it will be search recursively for
*.fastq.gz
and*.fq.gz
files. To allowfrender
to pair files properly, Read 1 and read 2 files must have identical paths and names except for the single number (1
or2
) denoting the read. - If individual files are specified, you need to include both read 1 and read 2 files.
frender
will pair them up for you automatically based on their names, as above.
- If a directory is specified, it will be search recursively for
-
-r
(frender result file)- A
frender-scan-result
file is required. This file associates index combinations with read types and sample names. The content of this file is affected by the parameters of thefrender scan
command as well as the inputfastq
files provided to that command. - As a result, this file must be generated using the exact same files that are fed to
frender demux
.
- A
By default, frender
produces a pair of output files (read 1 and read 2) for each sample in the sample sheet, as well as a pair of Undetermined
files, a pair of Ambiguous
files, and a pair of Index-hop
files (corresponding to the read types above). You can alter this behavior with the following switches:
Flag | Action |
---|---|
-a |
Don't produce separate files for ambiguous reads (these reads will be sent to the Undetermined file unless -u is used; the name of the Undetermined file will be altered to reflect this) |
-i |
Don't produce separate files for index hopped reads (these reads will be sent to the Undetermined file unless -u is used; the name of the Undetermined file will be altered to reflect this) |
-s |
Don't produce individual sample files |
-u |
Don't produce Undetermined files |
-o
(output infix)
- Optional; if specified, this string will be added to the output file names
-d
(output directory)
- All output files will be collected in this directory. Defaults to
./frender-demux-output_{current date and time}
.
Gzipped fastq files as noted above.
I wrote this script partially to address our lab's specific needs and partially to develop my skillset. I hope to add updates in the future (speedup/optimization, single index support, more options) but no guarantees are made. I hope this may be helpful to others in similar situations.
Thanks to @jamorrison for help with refactoring and polishing.
This software is released under GPL v3 or later.