-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Things we finished by the end of today:
- Finishing the whole DSM-5 diagnoses text and finishing the cosine matrix with n-gram size of 3.
- Finishing converting samples of diagnoses cosine matrix into json file(still need to fix the format).
- Start to work on the manually encoded category list.
- Kelly has finished the adjacency matrix with D3 and waiting for the dictionaries of links and nodes.
Things we will finish by the end of tomorrow: - Try to change the size of n-gram and generate the cosine matrix of the whole DSM5 text again
- Finishing the binary codes list of the manually encoded categories and converting the file into the cosine similarity matrix
- Finishing both links and nodes dictionary and put them in Kelly's D3 program to compare.
- Hopefully, get the conclusion of whether to create new categories or delete duplicated categories by the end of tomorrow.
Things we finished by the end of today:
- Finishing copying and pasting all text from DSM-5 into the text file that is readable in the python program. (all text but exclude coding note and reference)
- Testing the sample text from Chapter 1 to 15. We get rid of all the repetitive codes, missing code and some ASCII encoding that cannot be read by the python program on Mac.
- The first 15 chapter takes about 22 minutes to run and we will try to run the whole text file on a faster computer tomorrow.
Things we will finish by the end of tomorrow: - Testing and checking the whole text file and trying to generate the cosine similarity matrix from whole text.
- Creating the manually tagged categories and rearranging the data to generate another cosine similarity matrix.
Things we finished by the end of today:
As the windows system cannot recognize the format of text file sent by Mac system, I have to copy and paste the online version of DSM. We have tried the cosine similarity matrix on diagnosis from Chapter 1 to Chapter 6 and it worked well and takes about 2.5 minutes to write the matrix into the csv file.
Things we will finish by the end of tomorrow:
- Finishing the text file of the whole DSM and generate the matrix of all the diagnoses.
- Try to set the different size of n-gram, maybe from 3 to 6.
- Convert the matrix to the dictionary that can be used for adjacency matrix.
Things we have finished today:
1. We have finished all the to-do list things of yesterday. We have the code prepared to calculate the cosine similarity between any two of the diagnoses and input the data directly into the csv file as a data frame.
2. We tested the code with sample lists of diagnoses and it works pretty well.
3. We have the plain text file but the file also contains many other unimportant information.
Things we will finish by the end of tomorrow:
1. Convert the csv file to json file.
2. Clean the text file and make it only contain the relevant things we need.
3. Try to run the whole text file in the python program and get the data frame that we want. Then put the data frame in the adjacency matrix and figure out how well the categories classification we have right now.
Finishing the simplified model of cosine similarity calculation between two diagnoses and finishing the test 4*4 matrix of cosine similarity and it works well. For example, the cosine value is 0.7 between F20.81: Schizophreniform Disorder and F20.9 which is Schizophrenia, which makes sense. For the next step, we want to input the whole text into the program and have a csv file to store the matrix information as the output. The things we want to finish by the end of tomorrow include:
1. Figuring out how to store a matrix in the csv file as the output.
2. Testing for some other diagnoses and make sure the model makes sense.
3. Study the D3 adjacency matrix and understand how it works to clustering different elements.
Still working on the text mining things. Kelly found out the document about using the packages in Python to automatically make clusters from a bunch of texts. The tools we will use is K-means clustering, you can define the number of clusters that you want and each observation is assigned to a cluster (cluster assignment) so as to minimize the within cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence. The document also introduces visualization of the clusters using matplotlib and mpld3 and the output will be a graph showing different clusters with different colors. However, I am a little confused about the file type of pkl in the sample of the document when the author was doing the K-mean clusters. So I went back to build the matrix of the cosine similarity first using Python and I just took sample of diagnosis of 4 and I am still working on the codes right now. On Monday, I will finish up the code. Some of the questions I have include:
1. There are more than 500 diagnosis and the matrix would be extremely large. What should we do with the enormous matrix?
2. Building a matrix does not help us classify different clusters. What should we do after finishing the matrix? What is the criteria of belonging to one cluster?
Started to explore the text mining of different diagnosis. We finished the basic structure of the text similarity comparison python program. We used cosine similarity to compare how similar two sentences compared with each other. Cosine similarity is an interesting value that the closer the value to 1 the more similar two sentences will be. We used two sample diagnosis to test our program and it worked well. For next step, we want to find a way to input the whole text of DSM5 into python and compare any two of them. Jordan suggested us to build a matrix whose columns and rows are different diagnosis and the entries are different similarity values. Now we just ignore the influence of the acronyms, which means we just ignore the condition that different words have same meaning, which makes sense because the writers have certain habit of word usage and it is quite infrequent to use a lot of acronyms in one chapter.
Today we started to check the duplication in the data set and I wrote the code that transferring a sample of the symptoms in the dictionary and I till tried to match the same index in the nodes and relationship files. Today we had a short talk with Jordan and discussed the things we would consider to deal with the data in next step:
1. Some of the categories of the symptoms are invalid because we added some symptoms which were missing in the original data file but we did not record the categories of that symptoms in our new file. I think we might need to use the NLTK in python and try to match each symptom with its category(or is there an easier way to do this?).
2. We now need to think about the questions to ask. Our goal is to make end points to be as close as possible to correct diagnosis within about 20 questions. So we need to think about the differentiators we are going to use. If the patient has rough idea about his/her symptoms, it is good. We can provide him/her some of the symptoms that distinguish two possible disorders that he/she described. If the patient has no idea what hs/she got, we might want to start with the "top" category. Top categories is kind of a relative ranking of the category based on how good of a predictor is it in getting us to right chapter. I think after we finish fixing the categories, we might need to calculate the frequency of each category and the category which can eliminate the most of possiblities become our top choices of questions to ask. (Probably in R to figure out the frequency of each category.)
3. Combining two sheets based on the same index and converting csv file into json file.
Today everyone finished the data input and finally we can call the end of the data cleaning part of the 4000+ symptoms. Something we are going to do include:
1. Converting the csv file into a json file using python dictionary as the format of:
{dx: XXX
[name]
[chapter]
[evidence1]
[evidence2]
...
[age of onset]
[gen_duration]
...
}
2. Check the duplication of the data set.
Bebcause everyone is doing different part and there is a lot of symptoms that are shared by several disorders, we might put the relationship or detailed description more than once in the two file. We will figure out as much deplication as possible.
Finishing my part and working on the row 700-800. Tomorrow I will spend part of time on the mental health and part of time on the 3D modeling.
Today Mr. G. Grinstein was invited to give us reviews and suggestions about our projects and I found it really helpful for our survival center project.
Finishing 360 rows in total for today and yesterday.
Some thoughts about today's work:
- Each time input something in the dataset file, we need to check whether there has already been same information containing in the file.
- If it is impossible to encode some features in the file, we will simplify the code and accept minor mistake.
- The presentation today is really helpful. It is really important to question some existing models and thoughts which are inherently biased.
- Still working on data input and hopefully will finish half of the data by the end of tomorrow.
We started a new version of the data file. We separated the data into two data files: one is the node file, containing all the information about the symptoms and evidence; the other one is the connection file, containing information that connects the nodes together. In this case, we shrinked the width of the data file(but why we don't like wide data file???) and make it better for construction of the decision tree. We now have rows of all the elements in our original versions, such as the age of onset, the number of evidence required, the chapter number and the codes of all disorder, etc. Each row has its index, detailed information and the type of the information. For the other file, we have columns "From","To" and "Type". We give different attributes to the symptom by constructing the path from evidence, chapter, general_duration, etc to the code,which represents different disorder. We have 4000+ data to reorganize and we have finished about 1/4 by the end of today. So there is still a long way to go in next few days. Some of the questions I have include:
- Encoding some traits of the disorder is still unclear. For example:
"Among 15 of the criteria, 3 have to be presented in the past 12 months and at least one should be within 6 months..."
We don't know which 3 will present and we are also unclear which one should be within 6 months. - Some of the timeOfOnset is wired.
For example, there is one saying that "within 1 minutes of injection....", so the time of onset should be 1/1440 days as we use "day" as our unit. - We need to be really careful when encoding the data as there is much duplication of symptoms and we don't like duplication.
To Do List for Tomorrow: - Continuing encoding the data and finish more than half of the data input.
- Figure out how to solve the questions that I had for today.
- Double check the duplication of the symptoms.
Cleaning up the data file and filling in all the blank spaces. The input of data has been basically finished in this week and for next week, we will spend time going through all the data and fixing the mistakes. These two days, we spent most of time understanding the material in Chapter 17 and found better ways to encode all the information in the excel file. After the data being checked,we will put in into the JSON and work on the interface. Some questions that are still unsolved include:
1. When two episodes are related with each other. (e.g. B occurs within 24 hrs after A)
We are thinking about creating another column to show such relationship.
2. Specifiers
We have put almost all the data into the file but the specifiers for each disorder are still messy and disorganized. We have to think of a way to deal with such problem.
The plan for next week:
Finishing data checking and cleaning.
Getting familiar with the coding in D3 and know about JSON file and the construction of the interface.
-
Finished data input for more than 200 symptoms and will finish the data input by the end of tomorrow.
-
I have marked the symptoms with ambiguous description with yellow and after putting all the data into the file, I will check the yellow shells again to make sure that all the ambiguous symptoms get the right classification.
-
Some of the problems that we had today include:
(a) Some of the different symptoms have the same code so we will append something in the code to distinguish two symptoms. (e.g. F16.10 and F16.20)
(b) Some of the evidence have a group of evidence as identifier. We will have that group of evidence as the "evidence of the evidence", which means put the index of the main evidence they support in the column "evidence_for".
(c) Normally for some symptoms, there is a smallest number of evidence which are identifiers. However, some of these symptoms have different most_recent_episode. For example, "as manifested by the presence of at least three of the following 15 criteria in the past 12 months from any of the categories below, with at least one criterion present in the past 6 months." We deal with such situation by creating two rows for different most_recent_episode. For example, in the case that I quote, we will put two rows; one will have 2 as the smallest number of evidence to be presented and the most recent episode should be "within 12 months" and the other row will have 1 as the smallest number of evidence to be presented and the most recent episode should be "within 6 months".
Today we finalize the format of data input and created the data file with a new format. We need to go through all the disorder in DSM-5 and just use the information provided in the diagnosis criteria. For the new format, we input the information based on each symptom but not on each disorder. For each symptom, we need to indicate that whether it is a symptom or evidence of a symptom, and if it is an evidence, which symptom it belongs to. Also some of the disorder, a certain number of evidence is required to present to be qualified as that kind of disorder, so we have another column that indicates the required number of evidence.
Task for tomorrow:
- Finishing collecting and inputting the data in the data file;
- Organize the questions encountered and find a time to discuss.
This two weeks we have updated the duration and the age of onset of each disorder. We are now revising and improving the data we have on hand. Because each some of the disorder have different episodes/symptoms/severity with different duration, we are now putting the duration for each symptoms for each kind of disorder. And for age of onset and duration, we could only check the criteria for each disorder, which is only referred in the orange boxes in DSM-5.