- I started to write a parser to analyse the data feature
- Project file management: creating folders for notes, which is the most important thing for me
- Learning git: I finally figured out how to use git pull. Also I learned git rm, git clone,ect.(See in gitNotes)
1.I deleted the__ .gitignore__ file and lost everything.
git rm -r --cached .
git add .
git commit -m "Removing all files in .gitignore"
A SAD BRAND NEW START...
2.I couldn't assign two values from a list to one key in a dictionary. I tried to use dict(zip()) and also dict.append() The error message was "lists are unhashable"
And I am now wondering why I need to use a dictionary. There is not an obvious benefit of using dictionary compared with using lists. Also when should we turn different amino acids into numberic vectors?
1.Preparing for the convolution neural network journal club
I understood the mathematical meaning of convolution,but still couldn't understand the CNN I also found a 3d visualization website of CNN in imagine recognization.It was so cool!
2.Writing the project plan
3.I found 5 papers of SVM in protein structure prediction and read two of them.
1.Learning svm tutorial
Q1: what x, y represent (3rd block in ski-learn.py)? Q2: why use float instead of integer in clf.predict (3rd block in ski-learn.py)? Q3: why my clf.support returns a reversed list compared to the example code on the website? Q4: what is muliclass classification Q5: differences between 'one vs one' and 'one vs the rest'
A5: The difference is the number of classifiers you have to learn, which strongly correlates with the decision boundary they create.
Assume you have NN different classes. One vs all will train one classifier per class in total NN classifiers. For class ii it will assume ii-labels as positive and the rest as negative. This often leads to imbalanced datasets meaning generic SVM might not work, but still there are some workarounds.
In one vs one you have to train a separate classifier for each different pair of labels. This leads to N(N−1)2N(N−1)2 classifiers. This is much less sensitive to the problems of imbalanced datasets but is much more computationally expensive.
1.Prepare for the presentation (see /KB8024/notes/journal/) 2.An improved parser 3.Preprocessing 4.OneHotEncoder
Finish the most primary version of predictor
- Journal Club
- Presentation
- write a peer presentation evaluation
- Change data input to dataframe format
- succeedd in reading multiple sequence information
- succeedd in onehotencode and labelencode (in a brutal way)
- added to the head and tail of my data according to the window size
- read paper: Chen, Ke, Lukasz Kurgan, and Jishou Ruan. "Optimization of the sliding window size for protein structure prediction." Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB'06. 2006 IEEE Symposium on. IEEE, 2006.
- slide along the window
- attending the presentation
- Cross validation
- tried array, matrix.flatten, and many other methods to adjust the shape of the input data.
- solved the input data shape problems
- built crossvalidation, predictor, testdataparser, trainsetpreprocess,windowsize files
- JC club and group meeting
- solved the windowsize problems
- Tested windowsize optimization, windowsize = (3,21,step = 2)
- finished crossvalidation, including K-fold, and LOO method. Read papers and found the best split could be 7-fold.
- tried different svm functions, SVC,NuSVC and Linear SVC
- finished psi-blast
- read about PSSM