Skip to content

Davo36/bioinformatics-algorithms

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CODE CHALLENGE: Implement BWMATCHING.

Input: A string BWT(Text), followed by a collection of Patterns. Output: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.

Extra Dataset

Sample Input: TCCTCTATGAGATCCTATTCTATGAAACCTTCA$GACCAAAATTCTCCGGC CCT CAC GAG CAG ATC Sample Output: 2 1 1 0 1

CODE CHALLENGE: Implement BETTERBWMATCHING.

Input: A string BWT(Text) followed by a collection of strings Patterns. Output: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.

Extra Dataset

Sample Input: GGCGCCGC$TAGTCACACACGCCGTA ACC CCG CAG Sample Output: 1 2 1

Time Limit: 5 mins

CODE CHALLENGE: Solve the Multiple Pattern Matching Problem.

Input: A string Text followed by a collection of strings Patterns. Output: All starting positions in Text where a string from Patterns appears as a substring.

Extra Dataset

Sample Input: AATCGGGTTCAATCGGGGT ATCG GGGT Sample Output: 1 4 11 15

Time Limit: 5 mins

You should now be ready to design your own approach to solve the Multiple Approximate Pattern Matching Problem and use this solution to map real sequencing reads.

CODE CHALLENGE: Solve the Multiple Approximate Pattern Matching Problem. Input: A string Text, followed by a collection of strings Patterns, and an integer d. Output: All positions where one of the strings in Patterns appears as a substring of Text with at most d mismatches.

Extra Dataset

Sample Input: ACATGCTACTTT ATT GCC GCTA TATT 1 Sample Output: 2 4 4 6 7 8 9

Time Limit: 5 mins

general test

smnpbnnaaaaa$a ana

Sample Input: 4 0->4:11 1->4:2 2->5:6 3->5:7 4->0:11 4->1:2 4->5:4 5->4:4 5->3:7 5->2:6 Sample Output: 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0

Sample Input: 4 1 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0 Sample Output: 2

Sample Input: 4 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0 Sample Output: 0->4:11 1->4:2 2->5:6 3->5:7 4->0:11 4->1:2 4->5:4 5->4:4 5->3:7 5->2:6

Sample Input: 4 0 20 17 11 20 0 20 13 17 20 0 10 11 13 10 0 Sample Output: 0->5:7.000 1->6:8.833 2->4:5.000 3->4:5.000 4->2:5.000 4->3:5.000 4->5:2.000 5->0:7.000 5->4:2.000 5->6:1.833 6->5:1.833 6->1:8.833

Sample Input: 4 0 23 27 20 23 0 30 28 27 30 0 30 20 28 30 0 Sample Output: 0->4:8.000 1->5:13.500 2->5:16.500 3->4:12.000 4->5:2.000 4->0:8.000 4->3:12.000 5->1:13.500 5->2:16.500 5->4:2.000

CODE CHALLENGE: Implement SmallParsimony to solve the Small Parsimony Problem. Input: An integer n followed by an adjacency list for a rooted binary tree with n leaves labeled by DNA strings. Output: The minimum parsimony score of this tree, followed by the adjacency list of the tree corresponding to labeling internal nodes by DNA strings in order to minimize the parsimony score of the tree.

Note: Remember to run SmallParsimony on each individual index of the strings at the leaves of the tree.

Extra Dataset

Sample Input: 4 4->CAAATCCC 4->ATTGCGAC 5->CTGCGCTG 5->ATGGACGA 6->4 6->5 Sample Output: 16 ATTGCGAC->ATAGCCAC:2 ATAGACAA->ATAGCCAC:2 ATAGACAA->ATGGACTA:2 ATGGACGA->ATGGACTA:1 CTGCGCTG->ATGGACTA:4 ATGGACTA->CTGCGCTG:4 ATGGACTA->ATGGACGA:1 ATGGACTA->ATAGACAA:2 ATAGCCAC->CAAATCCC:5 ATAGCCAC->ATTGCGAC:2 ATAGCCAC->ATAGACAA:2 CAAATCCC->ATAGCCAC:5

CODE CHALLENGE: Solve the Small Parsimony in an Unrooted Tree Problem.

Input: An integer n followed by an adjacency list for an unrooted binary tree with n leaves labeled by DNA strings. Output: The minimum parsimony score of this tree, followed by the adjacency list of the tree corresponding to labeling internal nodes by DNA strings in order to minimize the parsimony score of the tree.

Extra Dataset

Sample Input: 4 TCGGCCAA->4 4->TCGGCCAA CCTGGCTG->4 4->CCTGGCTG CACAGGAT->5 5->CACAGGAT TGAGTACC->5 5->TGAGTACC 4->5 5->4 Sample Output: 17 TCGGCCAA->CCAGGCAC:4 CCTGGCTG->CCAGGCAC:3 TGAGTACC->CAAGGAAC:4 CCAGGCAC->CCTGGCTG:3 CCAGGCAC->CAAGGAAC:2 CCAGGCAC->TCGGCCAA:4 CACAGGAT->CAAGGAAC:4 CAAGGAAC->CACAGGAT:4 CAAGGAAC->TGAGTACC:4 CAAGGAAC->CCAGGCAC:2

Time Limit: 5 mins

Sample Input:

5 4 0->4 4->0 1->4 4->1 2->5 5->2 3->5 5->3 4->5 5->4 Sample Output: 1->4 0->5 3->4 2->5 5->2 5->4 5->0 4->1 4->5 4->3

1->5 0->4 3->4 2->5 5->2 5->4 5->1 4->0 4->5 4->3

Sample Input: 4 CGAAGATTCTAA->4 ATGCCGGGCTCG->4 CTTTTAGAAGCG->5 AACTCATGATAT->5 5->AACTCATGATAT 5->CTTTTAGAAGCG 5->4 4->ATGCCGGGCTCG 4->CGAAGATTCTAA 4->5 Sample Output: 22 AACTCATGATAT->CTATCAGGATCG:6 CTTTTAGAAGCG->CTATCAGGATCG:4 CGAAGATTCTAA->CTAACAGGCTCG:6 CTATCAGGATCG->CTTTTAGAAGCG:4 CTATCAGGATCG->CTAACAGGCTCG:2 CTATCAGGATCG->AACTCATGATAT:6 CTAACAGGCTCG->ATGCCGGGCTCG:4 CTAACAGGCTCG->CGAAGATTCTAA:6 CTAACAGGCTCG->CTATCAGGATCG:2 ATGCCGGGCTCG->CTAACAGGCTCG:4

21 AACTCATGATAT->CTATCATGCTAA:5 CTTTTAGAAGCG->CTTTCAGGCTCG:4 CGAAGATTCTAA->CTATCATGCTAA:4 CTATCATGCTAA->CTTTCAGGCTCG:4 CTATCATGCTAA->CGAAGATTCTAA:4 CTATCATGCTAA->AACTCATGATAT:5 CTTTCAGGCTCG->ATGCCGGGCTCG:4 CTTTCAGGCTCG->CTTTTAGAAGCG:4 CTTTCAGGCTCG->CTATCATGCTAA:4 ATGCCGGGCTCG->CTTTCAGGCTCG:4

Time Limit: 5 mins

CODE CHALLENGE: Implement the FarthestFirstTraversal clustering heuristic.

Input: Integers k and m followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying FarthestFirstTraversal(Data, k), where the first point from Data is chosen as the first center to initialize the algorithm.

Extra Dataset

Sample Input: 3 2 0.0 0.0 5.0 5.0 0.0 5.0 1.0 1.0 2.0 2.0 3.0 3.0 1.0 2.0 Sample Output: 0.0 0.0 5.0 5.0 0.0 5.0

Squared Error Distortion Problem:
 Compute the squared error distortion of a set of data points with respect to a set of centers.

  Input: A set of points Data and a set of centers Centers.
   Output: The squared error distortion Distortion(Data, Centers).

CODE CHALLENGE: Solve the Squared Error Distortion Problem. Input: Integers k and m, followed by a set of centers Centers and a set of points Data. Output: The squared error distortion Distortion(Data, Centers).

Extra Dataset

Sample Input: 2 2 2.31 4.55 5.96 9.08


3.42 6.03 6.23 8.25 4.76 1.64 4.47 4.33 3.95 7.61 8.93 2.97 9.74 4.03 1.73 1.28 9.72 5.01 7.27 3.77 Sample Output: 18.246

CODE CHALLENGE: Implement the Lloyd algorithm for k-means clustering.

Input: Integers k and m followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying the Lloyd algorithm to Data and Centers, where the first k points from Data are selected as the first k centers.

Extra Dataset

Sample Input: 2 2 1.3 1.1 1.3 0.2 0.6 2.8 3.0 3.2 1.2 0.7 1.4 1.6 1.2 1.0 1.2 1.1 0.6 1.5 1.8 2.6 1.2 1.3 1.2 1.0 0.0 1.9 Sample Output: 1.800 2.867 1.060 1.140

Time Limit: 5 mins

CODE CHALLENGE: Implement the expectation maximization algorithm for soft k-means clustering.

Input: Integers k and m, followed by a stiffness parameter β, followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying the expectation maximization algorithm for soft k-means clustering. Select the first k points from Data as the first centers for the algorithm and run the algorithm for 100 E-steps and 100 M-steps. Results should be accurate up to three decimal places.

Extra Dataset

Sample Input: 2 2 2.7 1.3 1.1 1.3 0.2 0.6 2.8 3.0 3.2 1.2 0.7 1.4 1.6 1.2 1.0 1.2 1.1 0.6 1.5 1.8 2.6 1.2 1.3 1.2 1.0 0.0 1.9 Sample Output: 1.662 2.623 1.075 1.148

Time Limit: 5 mins

CODE CHALLENGE: Implement HierarchicalClustering.

Input: An integer n, followed by an n x n distance matrix. Output: The result of applying HierarchicalClustering to this distance matrix (using Davg), with each newly created cluster listed on each line.

Extra Dataset

Sample Input: 7 0.00 0.74 0.85 0.54 0.83 0.92 0.89 0.74 0.00 1.59 1.35 1.20 1.48 1.55 0.85 1.59 0.00 0.63 1.13 0.69 0.73 0.54 1.35 0.63 0.00 0.66 0.43 0.88 0.83 1.20 1.13 0.66 0.00 0.72 0.55 0.92 1.48 0.69 0.43 0.72 0.00 0.80 0.89 1.55 0.73 0.88 0.55 0.80 0.00 Sample Output: 4 6 5 7 3 4 6 1 2 5 7 3 4 6 1 2 5 7 3 4 6

Time Limit: 5 mins

CODE CHALLENGE: Solve the Probability of a Hidden Path Problem.

Given: A hidden path π followed by the states States and transition matrix Transition of an HMM (Σ, States, Transition, Emission). Return: The probability of this path, Pr(π).

Note: You may assume that transitions from the initial state occur with equal probability.

Extra Dataset

Sample Input: ABABBBAAAA


A B


A B A 0.377 0.623 B 0.26 0.74 Sample Output: 0.000384928691755

Time Limit: 5 mins

CODE CHALLENGE: Solve the Probability of an Outcome Given a Hidden Path Problem.

Input: A string x, followed by the alphabet from which x was constructed, followed by a hidden path π, followed by the states States and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: The conditional probability Pr(x|π) that x will be emitted given that the HMM follows the hidden path π.

Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset

Sample Input: zzzyxyyzzx


x y z


BAAAAAAAAA


A B


x y z A 0.176 0.596 0.228 B 0.225 0.572 0.203 Sample Output: 3.59748954746e-06

Time Limit: 5 mins

CODE CHALLENGE: Implement the Viterbi algorithm solving the Decoding Problem.

Input: A string x, followed by the alphabet from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: A path that maximizes the (unconditional) probability Pr(x, π) over all possible paths π.

Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset

Sample Input: xyxzzxyxyy


x y z


A B


A B A 0.641 0.359 B 0.729 0.271


x y z A 0.117 0.691 0.192 B 0.097 0.42 0.483 Sample Output: AAABBAAAAA

Time Limit: 5 mins

CODE CHALLENGE: Solve the Outcome Likelihood Problem.

Input: A string x, followed by the alphabet from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: The probability Pr(x) that the HMM emits x.

Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset

Sample Input: xzyyzzyzyy


x y z


A B


A B A 0.303 0.697 B 0.831 0.169


x y z A 0.533 0.065 0.402 B 0.342 0.334 0.324 Sample Output: 1.1005510319694847e-06

Time Limit: 5 mins

CODE CHALLENGE: Solve the Profile HMM Problem.

Input: A threshold θ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: The transition matrix followed by the emission matrix of HMM(Alignment, θ).

Note: Your matrices can be either space-separated or tab-separated.

Extra Dataset

Sample Input: 0.289


A B C D E


EBA E-D EB- EED EBD EBE E-D E-D Sample Output: S I0 M1 D1 I1 M2 D2 I2 E S 0 0 1.0 0 0 0 0 0 0 I0 0 0 0 0 0 0 0 0 0 M1 0 0 0 0 0.625 0.375 0 0 0 D1 0 0 0 0 0 0 0 0 0 I1 0 0 0 0 0 0.8 0.2 0 0 M2 0 0 0 0 0 0 0 0 1.0 D2 0 0 0 0 0 0 0 0 1.0 I2 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0


A B C D E S 0 0 0 0 0 I0 0 0 0 0 0 M1 0 0 0 0 1.0 D1 0 0 0 0 0 I1 0 0.8 0 0 0.2 M2 0.143 0 0 0.714 0.143 D2 0 0 0 0 0 I2 0 0 0 0 0 E 0 0 0 0 0

Time Limit: 5 mins

CODE CHALLENGE: Solve the Profile HMM with Pseudocounts Problem.

Input: A threshold θ and a pseudocount σ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: The transition and emission matrices of HMM(Alignment, θ, σ).

Note: Your matrices can be either space-separated or tab-separated.

Extra Dataset

Sample Input: 0.358 0.01


A B C D E


A-A ADA ACA A-C -EA D-A Sample Output: S I0 M1 D1 I1 M2 D2 I2 E S 0 0.01 0.819 0.172 0 0 0 0 0 I0 0 0.333 0.333 0.333 0 0 0 0 0 M1 0 0 0 0 0.398 0.592 0.01 0 0 D1 0 0 0 0 0.981 0.01 0.01 0 0 I1 0 0 0 0 0.01 0.981 0.01 0 0 M2 0 0 0 0 0 0 0 0.01 0.99 D2 0 0 0 0 0 0 0 0.5 0.5 I2 0 0 0 0 0 0 0 0.5 0.5 E 0 0 0 0 0 0 0 0 0


A B C D E S 0 0 0 0 0 I0 0.2 0.2 0.2 0.2 0.2 M1 0.771 0.01 0.01 0.2 0.01 D1 0 0 0 0 0 I1 0.01 0.01 0.327 0.327 0.327 M2 0.803 0.01 0.168 0.01 0.01 D2 0 0 0 0 0 I2 0.2 0.2 0.2 0.2 0.2 E 0 0 0 0 0

Time Limit: 5 mins

CODE CHALLENGE: Solve the Sequence Alignment with Profile HMM Problem.

Input: A string x followed by a threshold θ and a pseudocount σ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: An optimal hidden path emitting x in HMM(Alignment, θ, σ).

Extra Dataset

Sample Input: AEFDFDC


0.4 0.01


A B C D E F


ACDEFACADF AFDA—CCF A–EFD-FDC ACAEF–A-C ADDEFAAADF Sample Output: M1 D2 D3 M4 M5 I5 M6 M7 M8

Time Limit: 5 mins

CODE CHALLENGE: Solve the HMM Parameter Estimation Problem.

Input: A string x of symbols emitted from an HMM, followed by the HMM’s alphabet Σ, followed by a path π, followed by the collection of states of the HMM. Output: A transition matrix Transition followed by an emission matrix Emission that maximize Pr(x, π) over all possible transition and emission matrices.

Extra Dataset

Sample Input: yzzzyxzxxx


x y z


BBABABABAB


A B C Sample Output: A B C A 0.0 1.0 0.0 B 0.8 0.2 0.0 C 0.333 0.333 0.333


x y z A 0.25 0.25 0.5 B 0.5 0.167 0.333 C 0.333 0.333 0.333

Time Limit: 5 mins

CODE CHALLENGE: Implement Viterbi learning for estimating the parameters of an HMM.

Input: A number of iterations j, followed by a string x of symbols emitted by an HMM, followed by the HMM’s alphabet Σ, followed by the HMM’s states, followed by initial transition and emission matrices for the HMM. Output: Emission and transition matrices resulting from applying Viterbi learning for j iterations.

Extra Dataset

Sample Input: 100


zyzxzxxxzz


x y z


A B


A B A 0.599 0.401 B 0.294 0.706


x y z A 0.424 0.367 0.209 B 0.262 0.449 0.289 Sample Output: A B A 0.5 0.5 B 0.0 1.0


x y z A 0.333 0.333 0.333 B 0.4 0.1 0.5

Time Limit: 5 mins

CODE CHALLENGE: Solve the Soft Decoding Problem.

Input: A string x, followed by the alphabet Σ from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: An |x| x |States| matrix whose (i, k)-th element holds the conditional probability Pr(πi = k|x).

Extra Dataset

Sample Input: zyxxxxyxzz


x y z


A B


A B A 0.911 0.089 B 0.228 0.772


x y z A 0.356 0.191 0.453 B 0.040 0.467 0.493 Sample Output: A B 0.5438 0.4562 0.6492 0.3508 0.9647 0.0353 0.9936 0.0064 0.9957 0.0043 0.9891 0.0109 0.9154 0.0846 0.964 0.036 0.8737 0.1263 0.8167 0.1833

Time Limit: 5 mins

CODE CHALLENGE: Construct the graph of a spectrum.

Given: A space-delimited list of integers Spectrum. Return: Graph(Spectrum).

Note: Throughout this chapter, all dataset problems implicitly use the standard integer-valued mass table for the regular twenty amino acids. Examples sometimes use the toy amino acid alphabet {X, Y} whose masses are 4 and 5, respectively.

Extra Dataset

Sample Input: 57 71 154 185 301 332 415 429 486 Sample Output: 0->57:G 0->71:A 57->154:P 57->185:K 71->185:N 154->301:F 185->332:F 301->415:N 301->429:K 332->429:P 415->486:A 429->486:G

Time Limit: 5 mins

CODE CHALLENGE: Solve the Decoding an Ideal Spectrum Problem.

Given: A space-delimited list of integers Spectrum. Return: An amino acid string that explains Spectrum.

Extra Dataset

Sample Input: 57 71 154 185 301 332 415 429 486 Sample Output: GPFNA

Time Limit: 5 mins

CODE CHALLENGE: Solve the Converting a Peptide into a Peptide Vector Problem.

Given: An amino acid string P. Return: The peptide vector of P (in the form of space-separated integers).

Extra Dataset

Sample Input: XZZXX Sample Output: 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1

Time Limit: 5 mins

CODE CHALLENGE: Solve the Converting a Peptide Vector into a Peptide Problem.

Given: A space-delimited binary vector P Return: An amino acid string whose binary peptide vector matches P. For masses with more than one amino acid, any choice may be used.

Extra Dataset

Sample Input: 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 Sample Output: XZZXX

Time Limit: 5 mins

CODE CHALLENGE: Solve the Peptide Sequencing Problem.

Given: A space-delimited spectral vector Spectrum’. Return: An amino acid string with maximum score against Spectrum’. For masses with more than one amino acid, any choice may be used.

Extra Dataset

Sample Input: 0 0 0 0 4 -2 -3 -1 -7 6 5 3 2 1 9 3 -8 0 3 1 2 1 8 Sample Output: XZZXX

Time Limit: 5 mins

CODE CHALLENGE: Solve the Peptide Identification Problem.

Given: A space-delimited spectral vector Spectrum’ and an amino acid string Proteome. Return: A substring of Proteome with maximum score against Spectrum’.

Extra Dataset

Sample Input: 0 0 0 0 4 -2 -3 -1 -7 6 5 3 2 1 9 3 -8 0 3 1 2 1 8 XZZXZXXXZXZZXZXXZ Sample Output: ZXZXX

Time Limit: 5 mins

CODE CHALLENGE: Implement PSMSearch to solve the Peptide Search Problem.

Given: A set of space-delimited spectral vectors SpectralVectors, an amino acid string Proteome, and an integer threshold. Return: The set PSMthreshold(Proteome, SpectralVectors).

Extra Dataset

Sample Input: 0 -1 5 -4 5 3 -1 -4 5 -1 0 0 4 -1 0 1 4 4 4 1 -4 2 -2 -4 4 -5 -1 4 -1 2 5 -3 -1 3 2 -3 XXXZXZXXZXZXXXZXXZX 5 Sample Output: XZXZ

Time Limit: 5 mins

CODE CHALLENGE: Solve the Size of Spectral Dictionary Problem.

Given: A spectral vector Spectrum’, an integer threshold, and an integer max_score. Return: The size of the dictionary Dictionarythreshold(Spectrum’).

Note: Use the provided max_score for the height of your table. Your answer should be the number of peptides whose score is at least T and at most max_score.

Extra Dataset

Sample Input: 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 1 5 Sample Output: 3

Time Limit: 5 mins

CODE CHALLENGE: Solve the Probability of Spectral Dictionary Problem.

Given: A spectral vector Spectrum’, an integer threshold, and an integer max_score. Return: The probability of the dictionary Dictionarythreshold(Spectrum’).

Note: Use the provided max_score for the height of your table.

Extra Dataset

Sample Input: 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 1 5 Sample Output: 0.375

Time Limit: 5 mins

CODE CHALLENGE: Solve the Spectral Alignment Problem.

Given: A peptide Peptide, a spectral vector Spectrum’, and an integer k. Return: A peptide Peptide’ related to Peptide by up to k modifications with maximal score against Spectrum’ out of all possibilities.

Extra Dataset

Sample Input: XXZ 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 2 Sample Output: XX(-1)Z(+2)

Time Limit: 5 mins

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%