diff --git a/docs/Novel allele detection with MentaLiST 0.2.ipynb b/docs/Novel allele detection with MentaLiST 0.2.ipynb new file mode 100644 index 0000000..f1edddf --- /dev/null +++ b/docs/Novel allele detection with MentaLiST 0.2.ipynb @@ -0,0 +1,459 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MentaLiST 0.2\n", + "\n", + "MentaLiST 0.2 has a new calling algorithm and also detects and reconstructs putative novel alleles, also calling non-present loci, allowing the use for wgMLST schemes.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Downloading the new version\n", + "\n", + "MentaLiST 0.2 is on a different branch on the github repository. You can clone the repository as usual and then switch to the new branch: " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cloning into 'MentaLiST'...\n", + "remote: Counting objects: 742, done.\u001b[K\n", + "remote: Compressing objects: 100% (103/103), done.\u001b[K\n", + "remote: Total 742 (delta 59), reused 51 (delta 22), pack-reused 611\u001b[K\n", + "Receiving objects: 100% (742/742), 28.62 MiB | 49.27 MiB/s, done.\n", + "Resolving deltas: 100% (379/379), done.\n" + ] + } + ], + "source": [ + "git clone https://github.com/WGS-TB/MentaLiST" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Branch mentalist_v0.2 set up to track remote branch mentalist_v0.2 from origin.\n", + "Switched to a new branch 'mentalist_v0.2'\n" + ] + } + ], + "source": [ + "cd MentaLiST\n", + "git checkout mentalist_v0.2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can create an alias to make it easier to access this version, specially if you also have the original MentaLiST version installed with bioconda." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "alias mentalist=\"julia --depwarn=no $PWD/src/mentalist\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "MLST calling has some new options, in relation the the previous version: " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "usage: mentalist call -o O -s S --db DB [-t MUTATION_THRESHOLD]\n", + " [--kt KT] [--output_votes] [--output_special]\n", + " [-h] files...\n", + "\n", + "positional arguments:\n", + " files FastQ input files\n", + "\n", + "optional arguments:\n", + " -o O Output file with MLST call\n", + " -s S Sample name\n", + " --db DB Kmer database\n", + " -t, --mutation_threshold MUTATION_THRESHOLD\n", + " Maximum edit distance (number of mutations)\n", + " when looking for novel alleles. (type: Int64,\n", + " default: 6)\n", + " --kt KT Minimum # of times a kmer is seen, to be\n", + " considered 'solid', meaning actually present\n", + " in the sample. (type: Int64, default: 10)\n", + " --output_votes Also outputs the results for the original\n", + " voting algorithm, without novel.\n", + " --output_special Also outputs a FASTA file with the alleles\n", + " from special cases such as incomplete\n", + " coverage, novel, and multiple alleles. This\n", + " can help for creating a smaller MentaLiST\n", + " database for testing different parameters or\n", + " for using a read mapper to investigate the\n", + " special cases more thoroughly.\n", + " -h, --help show this help message and exit\n", + "\n" + ] + } + ], + "source": [ + "mentalist call -h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The -t is the maximum distance, in number of mutations (nucleotide substitutions, insertions and/or deletions) that MentaLiST will apply to an existing allele, while searching for a novel allele present in a sample. Larger values might take considerably longer.\n", + "\n", + "The --kt option determines the minimum number of times that a kmer has to be observed in the FASTQ sample to be considered 'solid'. This will be used on the calling phase, to determine if a particular allele is present (all its k-mers have to be present in the sample and above the --kt threshold), and also in the search for novel alleles. You might increase this value if you have larger depth in your sample.\n", + "\n", + "The --output_votes enables the output of the MLST calling files for the old algorithm, based only on the maximum vote count, without considering allele coverage.\n", + "\n", + "When --output_special is given, a FASTA file with all the 'special cases' will be created. This includes loci where no allele has 100% coverage, or there is more than one allele with 100% coverage, and also novel alleles. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running MentaLiST 0.2\n", + "\n", + "Because of the new calling algorithm, new information has to be stored on the MentaLiST database, so databases created with the previous version are not compatible. The command for creating a database is exactly the same as before." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "usage: mentalist build_db --db DB -k K -f FASTA_FILES [FASTA_FILES...]\n", + " [-p PROFILE] [-c] [-h]\n", + "\n", + "optional arguments:\n", + " --db DB Output file (kmer database)\n", + " -k K Kmer size (type: Int8)\n", + " -f, --fasta_files FASTA_FILES [FASTA_FILES...]\n", + " Fasta files with the MLST scheme\n", + " -p, --profile PROFILE\n", + " Profile file for known genotypes.\n", + " -c, --disable_compression\n", + " Disables the default compression of the\n", + " database, that stores only the most\n", + " informative kmers. Not recommended unless for\n", + " debugging.\n", + " -h, --help show this help message and exit\n", + "\n" + ] + } + ], + "source": [ + "mentalist build_db -h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The folder ../MTB_scheme has all FASTA files from a M. tuberculosis cgMLST scheme. To build a MentaLiST database for this scheme, run: " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2017-11-20T14:26:09.925 - info: Opening FASTA files ... \n", + "2017-11-20T14:33:05.367 - info: Combining results for each locus ...\n", + "2017-11-20T14:33:11.98 - info: Saving DB ...\n", + "2017-11-20T14:33:15.073 - info: Done!\n" + ] + } + ], + "source": [ + "mentalist build_db --db mtb_cgMLST.db -k 31 -f ../MTB_scheme/*.fa" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-rw-r--r-- 1 pfeijao users 46M Nov 20 14:33 mtb_cgMLST.db\n" + ] + } + ], + "source": [ + "ls -lh mtb_cgMLST.db" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2017-11-20T14:42:31.451 - info: Opening kmer database ... \n", + "2017-11-20T14:42:36.915 - info: Opening fastq file(s) and counting kmers ... \n", + "2017-11-20T14:43:24.928 - info: Voting for alleles ... \n", + "2017-11-20T14:43:26.77 - info: Calling alleles and novel alleles ...\n", + "2017-11-20T14:44:21.856 - info: Writing output ...\n", + "2017-11-20T14:44:24.359 - info: Done.\n" + ] + } + ], + "source": [ + "mentalist call -o sample.call -s sample --db mtb_cgMLST.db --kt 10 --output_votes --output_special ../sample.fastq.gz" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample.call sample.call.novel.fa sample.call.ties.txt\n", + "sample.call.byvote sample.call.novel.txt sample.call.votes.txt\n", + "sample.call.coverage.txt sample.call.special_cases.fa\n" + ] + } + ], + "source": [ + "ls sample.call*" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ">Rv0024\n", + "GTGAATACAGCGAGGTCGAGCTGTTGAGTCGCGCTCATCAACTGTTCGCCGGAGACAGTCGGCGACCGGGGTTGGATGCGGGCACCACACCCTACGGGGATCTGCTGTCTCGGGCTGCCGACCTGAATGTGGGTGCGGGCCAGCGCCGGTATCAACTCGCCGTGGACCACAGCCGGGCGGCCTTGCTGTCTGCTGCGCGAACCGATGCCGCGGCCGGGGCCGTCATCACCGGCGCTCAACGGGATCGGGCATGGGCCCGGCGGTCGACCGGAACCGTTCTCGACGAGGCTCGCTCGGATACCACCGTTACTGCGGTTATGCCGATAGCCCAGCGCGAAGCCATACGCCGTCGTGTGGCGCGGCTGCGCGCGCAACGAGCCCATGTGCTGACGGCGCGACGACGGGCACGACGGCACCTGGCGGCGCTGCGTGCGCTGCGGTACCGGGTGGCGCACGGCCCGGGGGTCGCGCTGGCCAAACTTCGGCTGCCGTCGCCGAGCGGTCGCGCCGGCATCGCGGTCCACGCCGCGCTGTCGCGACTTGGCCGTCCCTATGTCTGGGGCGCAACGGGGCCCAACCAGTTCGACTGTTCCGGTTTGGTCCAGTGGGCCTACGCCCAGGCGGGTGTTCACCTGGATCGCACCACCTATCAACAGATCAACGAGGGGATCCCGGTGCCGCGCTCACAGGTCCGGCCGGGCGATCTGGTCTTCCCGCACCCCGGGCACGTGCAGCTGGCGATCGGCAACAATCTGGTCGTCGAGGCGCCCCATGCGGGCGCGTCGGTTCGGGTCAGCTCGCTGGGCAACAACGTGCAGATTCGGCGACCGCTGAGTGGCAGATAA\n", + ">Rv0045c\n", + "TCAGCGTGTGTCGAGCACCCCGCGCACGATCTCGATCAGGGCGCGCGGTTGGTCACTTTGCACCGAGTGGCCTGACTTCTCGACGATGTGAACGCCACGGAAATGCGTTGCACGCCTGTGGAGTTCGGCGGTGTCCTGGTCGGTGACGAAGCCCGACGAGCCGCCGCGCACGAGTGTGATCGGCGCGGACAGGGCGTCGACGTCGTCCCAGAGCCCTGCGAAATCTCCGAACGTGCGGATCGCGTCATAGCGCCACACCCAGTTGCCGTTGTCCAGCCGGCGGGAGTTGTGGAACACGCCGCGGCGCAACGACTTGATATCGCGGTGCGGGGCCGCGGCGATCGTTAGGTCCAGCATGGCCTGAAAGCTGGGGAATTCCCGCTCGCCGTGCATCAGCGCCACCGTGCCGCGCTGCTCGGCGGTCAGCTCGGCGTGCCGTTGCAATGCCGACGGGGTGACGTCGACGAGAACGAGTTCGCCGACCAGGTCGGGTGCCATCGCGGCCAGCCGTATCGCAGTCAACCCGCCCAGCGACATGCCGACCACGAATTCGGCACCCGGCGCAAGCTCGCGTAGCACCGGCGCCAAGGTCTCGGAGTTGAGCTGCGGCGAGTAATTGCCGTCCTCCCGCCAAGCGGAATGGCCGTGCTGGAAGGTCCACCGCCAGCGCCGGCTCACCCAGGCCGACGATCACGGTGTCCCAGGTATGGGCGTTCTGTCCGCCGCCGTGCAGAAAGATCACCCGCGGCGCAGAGCCGCCCCAGCGCAGCGCGCTGATGGCTCCCGCTTGGACCCGCTCGACTTCAGGCAGTGGACCATTGACACCGGCCTGCTCAGCGTTCTCAGCCAGCAGGGCAAACTCGTCCAGTCCGGTCAGTTCGTCGTCAGATAGCAC\n" + ] + } + ], + "source": [ + "# novel alleles:\n", + "head sample.call.novel.fa -n4" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loci\tMinKmerDepth\tNmut\tDesc\n", + "Rv0024\t20\t1\tDel of len 1 at pos 6\n", + "Rv0045c\t27\t1\tIns of base G at pos 373\n", + "Rv0063\t33\t1\tIns of base C at pos 1417\n", + "Rv0134\t32\t1\tDel of len 1 at pos 386\n", + "Rv0165c\t31\t2\tIns of base C at pos 176, Ins of base G at pos 178\n", + "Rv0174\t26\t1\tSubst C->G at pos 632\n", + "Rv0217c\t30\t1\tDel of len 1 at pos 777\n", + "Rv0266c\t21\t1\tSubst T->C at pos 2164\n", + "Rv0322\t27\t1\tSubst C->G at pos 97\n" + ] + } + ], + "source": [ + "# Description of novel alleles:\n", + "head sample.call.novel.txt" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Locus\tCoverage\tMinKmerDepth\tCall\n", + "Rv0014c\t1\t27\tCalled allele 5.\n", + "Rv0015c\t1\t27\tCalled allele 2.\n", + "Rv0016c\t1\t27\tCalled allele 1.\n", + "Rv0017c\t1\t22\tCalled allele 1.\n", + "Rv0018c\t1\t28\tCalled allele 2.\n", + "Rv0019c\t1\t24\tCalled allele 1.\n", + "Rv0021c\t1\t20\tCalled allele 1.\n", + "Rv0022c\t1\t34\tCalled allele 1.\n", + "Rv0023\t1\t23\tCalled allele 1.\n", + "Rv0024\t1\t20\tNovel, 1 mutation from allele 1: Del of len 1 at pos 6\n", + "Rv0025\t1\t33\tCalled allele 1.\n", + "Rv0033\t1\t42\tCalled allele 1.\n", + "Rv0034\t1\t31\tCalled allele 2.\n", + "Rv0035\t1\t21\tCalled allele 2.\n" + ] + } + ], + "source": [ + "# Description of each call:\n", + "head sample.call.coverage.txt -n15" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Rv0767c\t0.538\t17\tNot \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K, allele 18 is the best voted but below threshold with 283 missing kmers.\n", + "Rv0768\t0.168\t99\tNot \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K, allele 20 is the best voted but below threshold with 1195 missing kmers.\n", + "Rv1269c\t0.945\t52\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 1 has 19 missing kmers, and no novel was found.\n", + "Rv2017\t0.981\t35\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 2 has 19 missing kmers, and no novel was found.\n", + "Rv2947c\t0.986\t109\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 2 has 21 missing kmers, and no novel was found.\n", + "Rv3160c\t0.984\t23\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 1 has 10 missing kmers, and no novel was found.\n", + "Rv3180c\t0.835\t27\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 1 has 67 missing kmers, and no novel was found.\n", + "Rv3181c\t0.922\t19\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 1 has 33 missing kmers, and no novel was found.\n", + "Rv3183\t0.753\t28\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 2 has 74 missing kmers, and no novel was found.\n", + "Rv3188\t0.739\t57\tEither novel or not \u001b[01;31m\u001b[Kpresent\u001b[m\u001b[K; Allele 1 has 83 missing kmers, and no novel was found.\n" + ] + } + ], + "source": [ + "# Missing alleles:\n", + "grep present sample.call.coverage.txt" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sample\tRv0014c\tRv0015c\tRv0016c\tRv0017c\tRv0018c\tRv0019c\tRv0021c\tRv0022c\tRv0023\tRv0024\tRv0025\tRv0033\tRv0034\n", + "sample\t5\t2\t1\t1\t2\t1\t1\t1\t1\tN\t1\t1\t2\n" + ] + } + ], + "source": [ + "# All calls: N for novel allele.\n", + "cut -f1-14 sample.call" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Bash", + "language": "bash", + "name": "bash" + }, + "language_info": { + "codemirror_mode": "shell", + "file_extension": ".sh", + "mimetype": "text/x-sh", + "name": "bash" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}