-
Notifications
You must be signed in to change notification settings - Fork 106
Discussion Notes
-
Analysis data set ready for some common words and medical hashtags.
-
Classify Medical vs Non-Medical in progress. (Using clustering)
-
Discuss algorithm, data science technique to approach the next goal.
-
Discuss a technique to look at 1% live stream for Medical Information.
-
Spam filtering : We need to discard tweets related to marketing and spam. Mr. Hugo looking into this. Join him if this topic interests you.
-
ICD-9 Codes : Could we potentially use these codes to filter out and classify relevant tweets. General consensus was that twitter users might not use such details.
-
Medical API’s / Medical Dictionaries - Is there a simple dictionary / API which we could use to classify data. Tapan and Chenchal looking into this. Let them know if you come across something useful. (example : www.dictionaryapi.com)
-
Correlation Analysis for common Diseases : Collect data for common disease terms like, Asthma, Flu, Cancer, COPD, Diabetes. (Pratik to do).
Analyse this data and find correlated medical subjects. Sara and Patrick to work on this. Data set will be available by Saturday. Do join them and share your ideas on this analysis.
- Correlate sentiment findings to medical cost, waiting time etc by state / city. A useful resource could be : http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html
-
Data Scientist (Sayali Kulkarni) who has worked on twitter feeds in the past to join us on this Saturday’s conference. Will give pointers for analysis.
-
Meetup organized for Bay area Residents (at 900 E Hamilton Avenue, Campbell) to coincide with Saturday conference. Please RSVP
-
Number of attendees limited to 25. Please RSVP
Vote and discuss on Ideas at Trello. Sign up for tasks at Trello.
-
Data collection is going on very well. A couple of more ideas being explored.
-
An approach for classifying Medical vs Non-Medical is underway. The system will only improve over time.
-
Looking ahead to start working with algorithms / models on Medical Data.
-
Inviting a data scientist during next Saturday’s Global conference to give inputs.
-
Also, inviting members of this project to write to me about models that we can implement to solve goals 2,3,4,5 (Trello). Ideally, would like 2 members to present during next weekend an algorithm / tool / model which we should consider implementing.
-
Good amount of Data for Hashtags collected (about 1 week 20 MB).
-
Data for common words (‘Medical’, ‘Doctor’ ‘Hospital’) collected (1 Week - 50 MB - 1 Million Tweets)
Classification of tweets into Medical vs Non-Medical: (A clustering Analysis)
-
Using clustering algorithm (Mallet - Topic Modelling). Refer to Clustering Analysis on Google Drive shared folder. Created 200 themes / topics. Currently 6 of us working to Classify these themes / topics into Medical vs Non-Medical.
-
Idea is to classify clusters into Medical or Non-Medical and hence have a large set of tweets classified, which can then be used in a supervised algorithm to classify future tweets.
- Other common words to collect data
(Clinic, Nurse, Cancer, flu, Disease, 911, medicine, fever, surgeon ) Please add to this list.
-
1 % live stream data collection and Cleanup.
-
Find top Hospital and their Usernames / Hashtags to collect data.
-
Find influential medical websites, (These websites might give pointers to collect tweets)
Medical Words Dictionary / Taxonomy :
-
Find a good source of all medical terms, disease names, symptom names, hospital names, etc.
-
This will help filter 1% live tweets into useful or non-useful information.
-
Algorithms for further challenges
-
Inviting a data scientist for next weekend’s conference.
-
Also inviting presentations from members on useful algorithms for implementation.
Action Items : Project is well underway. Lots of tasks described above needs input. Lookup tasks at Trello, and subscribe to tasks. Finally, a call for a short presentation during next weekend.
Attendees (Healthcare Twitter 1) : Lorren, Hugo, Rong, Tapan, Chenchal, Pratik
Attendees (Healthcare Twitter 2): Lorren, Omkar, Kartik, Liang, Scott, Joon, Pratik
Step 1 - Collect entire tweet with all attributes in JSON format for
a) Medical Hashtags
b) Simple Medical Words
c) 1% live streaming data for USA
Step 2 - Run cleaning scripts (To be provided soon) to remove unuseful information and to convert the json file to a CSV file.
Step 3 - Share this cleaned CSV data with only the required fields and ziped up via google drive for everyone to download and analyze.
(If more fields are required in the future for analysis, we change the script in step 2 slightly and rerun step 2, 3 to share more fields for analysis)
-
Distinguish medical tweets based on medical terms used in the tweet: Make a dictionary of medical words / diseases / hospitals etc. Create a taxonomy of medical terms, potentially look for information through Medical API's.
-
Directed graphs - Build a wordnet graph for Medicines, Medical words, Hospitals, Symptoms, drugs. Find relevant tweets using this graph. (Example - If two words close in the graph occur together, then the tweets are more medically relevant.)
Challenges : Creating a graph of such medical information could be extremely challenging.
Extension : Create a graph using the twitter data that we have collected, and some clusters will automatically be more medically relevant as compared to others. (This is similar to the concept number 2, proposed at Trello)
- Find patterns using co-occurance of words -
Joon has uploaded some R scripts on github with the steps to use it. The script finds the co-occurance of 'n' Two Word pairs. It also lists down common adjectives and adverbs describing these word pairings.
(Pseudo code and Examples to be uploaded shortly)
-
We should look at popular medical websites and extract tweets (understand which tweets to follow) from these websites.
-
Can we find out a good source of medical websites? Can we find a list of Hospital hashtags and get information relevant to hospitals in US? At least the top 50?
-
Also, Data collection to continue throughout the duration of the project.
-
Design consideration - Restrict analysis to high-impact tweets, defined as those with 1 or more retweets.
-
Keep track of flu, at the start of flu season.
- What are the attributes of a tweet? How do we find the location of origin for the tweet?
“Tapan” to create a task list entry for summarizing the twitter attributes and will work on it,
- Medical API's, Medical Websites :
Task list entry created. Require more hands to research on these topics.
3.'Number of tweets' attribute - can we use this to filter out irregular users? (Very low or very high tweets)
-
Check out the task list at Trello and assign tasks on which you can help out or which interests you.
-
Also comment on the implementation ideas, Algorithms section and help break down some of them to next steps and approaches.
-
Research on software / algorithms / Analysis is always welcome.
-
Finally Check out the R scripts uploaded on github. Run them to get a feel of executing scripts in R.
Trello : This will be our main collaboration forum. Do create an Id at https://trello.com/ if you have not created one already. Email me your Id to be added onto this forum. Tune into this forum on a regular basis.
We will write down ideas / algorithms / techniques that we want to Implement. We will then break it down into tasks following which, members can pick up tasks out here to work on.
All questions technical / non technical can be posted here and someone will help you out.
Google Drive : For sharing all data
Github : All Source Code will be placed here.
Github Wiki : All Documentation (Individual and Group). Do contribute towards building a strong repository of resources.
Weekly Conference : We will continue our weekly conference, on Tuesday / Wednesday via Google Hangout. Weekend conference on Saturday via Webex.
This is the main forum where we cross-pollinate ideas, break them down into workable tasks and figure out which direction to move in.
I encourage everyone tuned into this project to attend these meetings. As we are all geographically dispersed, the success of this project will depend on meaningful / efficient collaboration.
- Data extraction - Current approach - Consider tweets with certain medical hashtags and common words like ‘Medical’, ‘Doctor’ etc. and build an algorithm to extract meaningful tweets based on our learning of these tweets.
Another suggested approach - Stream all 1% live tweets and then create an algorithm to extract Medical Tweets. Advantages, data collection is quicker and many medical tweets that we may otherwise miss due to our strong filtering criteria may now be possible to find.
Both techniques have their pros and cons. For now, we will go ahead with both these strategies in mind.
-
Research - Will need a team to research the available algorithms / research work that we can utilize to solve some of the challenges on this project. Researching these topics may give us more ideas on how to move forward. Document the research findings and present them during the weekly conference so we can decide the immediate implementation steps.
-
Collaboration between members from different professional background.
The strength of this project lies in the variety of professional and research background of the team members. We should leverage this to solve the challenges on this project.
Try to complement all work with simple explanation (In case of code, I would add English Language Pseudo code so that non-programmers in ‘R’ for example would be able grasp it and use it).
We can form micro-teams where each member plays out a different responsibility. For instance Joon and I, along with other programmers will work with analysts, researchers, modelers in bringing to life some algorithms and creating insights.
- Miscellaneous -
List of collaborators is almost complete. I will take it down from github now and share it via google docs.
Future Events to be added to google Calendar
Contribute to research on Wiki. Pick up tasks from Trello and list down your ideas and questions, so that they are open for discussion on this community.
Do Run Tweepy* scripts in Python and Tweepy Analysis Scripts in R. (This will get you setup with the environment for future code). If you face any challenges, post at Trello.
-
Github repository will be used only for code, data, analysis results.
-
Github Wiki Pages will be used for all documentation. Members list, Goals, Ideas, Tasks have been moved to Wiki Pages.
-
Feel free to add pages in the Wiki.
-
We need a collaboration tool for continued and focused discussions – Couple of members suggested using Trello. Its extremely simple to use, and a great collaboration tool.
-
Moving forward, we will move all our discussions to Trello. ( https://trello.com )
- We will have a conference call on Saturday, 9AM PDT (-07:00 GMT) for everyone who could not make it to the call on Tuesday / Wednesday. (All members are invited ofcourse)
- Collaborate to collect twitter stream : From tweepy.py, Tweepy_hashtags_1.py, Tweepy_hashtags_2.py etc.
- Write an algorithm to identify co-occurrence of words with Medical Hashtahgs – Hence define if the Hashtag is relevant or not. (Given a Hashtag, define if it yields relevant Medical information or not)
-
If we follow important medical tweets and their re-tweets, we might be able to create a graph of nodes in the medical tweets world. High level – Solve by thinking twitter analysis as a graph problem.
-
We can find out hashtags for all US hospitals and then check the sentiment of those tweets to find the happiest hospital.
-
Extend that concept to find out day of the week when patients are most frustrated.
-
Find flu / disease trends using twitter.
Using past twits
-
The impacts of the Boston bombing emergencies departments waiting times/service qualities
-
Rating/approval/confidence at the announcement of "ObamaCare"
-
Many ideas / Many application areas – We need to define collectively what challenge we should try to solve, and what should be our direct next steps. Can’t solve everything.
-
Let’s share relevant material, whitepapers, Research and if possible compliment it with a summary and how we can utilize that information to this project.
- Create a Trello Id and add your Trello Id to the Members List on the Wiki. I will add all of you to the Trello Project Forum. ( http://trello.com )