Automatic information extraction and classification
A typical writing process involves gathering information from different sources, putting it together, and adding in how it all fits together. In almost all cases, this is done by the writer doing online research, finding the best bits from other sources, and combining these extracts with their own unique take on the topic. However, the research portion can take a long time, which detracts from the more interesting and human part of writing. So, I wondered if it'd be possible to automate the sometimes boring task of extracting and categorizing the most important information on the subject, giving the human more time to think about their unique outlook on the topic of interest.
It turns out that it is possible for a computer to do this, using modern NLP and machine learning techniques. Though the computerized version definitely can't achieve human level results, it could help a writer gather initial information from a variety of sources, so they can gain basic knowledge about the field.
Here's the process the computer goes through when synthesizing the data:
This gives the computer initial information, sorted into headings that can easily be extracted.
It goes through each wikipedia section, splits it into sentences, and then trains the svm classifier to identify under which header a given sentence belongs.
Since the google search API is deprecated, I had to create a custom search engine, and then set it to search the web and return JSON.
While doing initial testing, I noticed that some websites didn't load properly, because they contained content injected by javascript.
Once the website loaded in phantomjs, the program getHTML.js
prints out the rendered HTML. The python script runs the node script,
and gets its output in order to get the rendered html of the website.
It loads the extracted html into an Article, and then parses it to get the main text of the article, while ignoring sidebars and other unrelated content.
It runs the trained classifier from step 3 on each sentence of the article, and adds it to the wikipedia sentences for that section.
It combines all sentences for each section, and then uses the sumy
module to summarize it in a number of sentences, based on how many
sentences it has currently.
It outputs each section, along with each sentence of the section in a bulleted list. Originally, the sentences were output in paragraph form, but when I read through it, the sentences didn't "flow" well together. This made the writing sound jerky and artificial. Once I put it in a bulleted list, it felt much more natural, like a typical list of notes.
When Google first came out with its search engine in the late 1990's, knowledge became much more easily accessible for students and researchers all around. I feel like this project could be a beginning to a new type of search engine; one that goes one level further than what Google has already done. This could be useful not just for writers, but for anyone who wants to get quick information from a variety of sources.
Of course, the current implementation isn't very near to this ideal, and so a lot of work must be done. Here are some specific areas I'd like to focus on:
- Source selection. I'd lke to develop some sort of algorithm to cull sources, so when I search for "pizza", it'll give me research information on pizza, not the website for pizzahut.com. I'm not exactly sure how I'd do this yet.
- Sentence classification. Though I feel like the SVM classifier does a surprisingly job with limited data, it doesn't always get it perfectly right. I'm thinking of trying something like doc2vec so it can have some knowledge built in.
- Summarization. I'd like it to eliminate irrelevant sentences and sentences that need more context.
- Speed. Currently it takes about 2 minutes to generate the synthesis. Most of this time is taken up downloading the web articles, so it'd really help if I could pre-download webpages, kind of like search engines do. This isn't really possible for me, though.
- Images. I'd like to interlace images with the rest of the content, probably one image per category.