Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes and improvements #6

Merged
merged 1 commit into from
Oct 14, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@ project/plugins/project/

# Scala-IDE specific
.scala_dependencies

# Idea-specific
.idea
.idea_modules
11 changes: 9 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,14 @@ libraryDependencies ++= Seq(
"com.foursquare" %% "rogue-index" % "2.2.0" intransitive(),
"net.liftweb" %% "lift-mongodb-record" % "2.5.1",
"org.slf4j" % "slf4j-api" % "1.7.5",
"ch.qos.logback" % "logback-classic" % "1.0.13"
"ch.qos.logback" % "logback-classic" % "1.0.13",
"com.google.guava" % "guava" % "15.0",
"com.google.code.findbugs" % "jsr305" % "2.0.2"
)

resolvers += "OpenNLP Repository" at "http://opennlp.sourceforge.net/maven2/"
resolvers += "OpenNLP Repository" at "http://opennlp.sourceforge.net/maven2/"

resolvers += "Sonatype Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots"

resolvers += "Sonatype Releases" at "http://oss.sonatype.org/content/repositories/releases"

7 changes: 6 additions & 1 deletion project/plugins.sbt
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.2.0")

addSbtPlugin("com.typesafe.sbt" % "sbt-start-script" % "0.9.0")
addSbtPlugin("com.typesafe.sbt" % "sbt-start-script" % "0.9.0")

resolvers += "Sonatype snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"

addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.6.0-SNAPSHOT")

4 changes: 2 additions & 2 deletions src/main/scala/com/textteaser/summarizer/Main.scala
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@ object Main extends App {

val id = "anythingyoulikehere"
val title = "Astronomic news: the universe may not be expanding after all"
val text = "Now that conventional thinking has been turned on its head in a paper by Prof Christof Wetterich at the University of Heidelberg in Germany. He points out that the tell-tale light emitted by atoms is also governed by the masses of their constituent particles, notably their electrons. The way these absorb and emit light would shift towards the blue part of the spectrum if atoms were to grow in mass, and to the red if they lost it. Because the frequency or �pitch� of light increases with mass, Prof Wetterich argues that masses could have been lower long ago. If they had been constantly increasing, the colours of old galaxies would look red-shifted � and the degree of red shift would depend on how far away they were from Earth. �None of my colleagues has so far found any fault [with this],� he says. Although his research has yet to be published in a peer-reviewed publication, Nature reports that the idea that the universe is not expanding at all � or even contracting � is being taken seriously by some experts, such as Dr HongSheng Zhao, a cosmologist at the University of St Andrews who has worked on an alternative theory of gravity. �I see no fault in [Prof Wetterich�s] mathematical treatment,� he says. �There were rudimentary versions of this idea two decades ago, and I think it is fascinating to explore this alternative representation of the cosmic expansion, where the evolution of the universe is like a piano keyboard played out from low to high pitch.� Prof Wetterich takes the detached, even playful, view that his work marks a change in perspective, with two different views of reality: either the distances between galaxies grow, as in the traditional balloon picture, or the size of atoms shrinks, increasing their mass. Or it�s a complex blend of the two. One benefit of this idea is that he is able to rid physics of the singularity at the start of time, a nasty infinity where the laws of physics break down. Instead, the Big Bang is smeared over the distant past: the first note of the ''cosmic piano�� was long and low-pitched. Harry Cliff, a physicist working at CERN who is the Science Museum�s fellow of modern science, thinks it striking that a universe where particles are getting heavier could look identical to one where space/time is expanding. �Finding two different ways of thinking about the same problem often leads to new insights,� he says. �String theory, for instance, is full of 'dualities� like this, which allow theorists to pick whichever view makes their calculations simpler.� If this idea turns out to be right � and that is a very big if � it could pave the way for new ways to think about our universe. If we are lucky, they might even be as revolutionary as Edwin Hubble�s, almost a century ago. Roger Highfield is director of external affairs at the Science Museum"
val text = "Now that conventional thinking has been turned on its head in a paper by Prof Christof Wetterich at the University of Heidelberg in Germany. He points out that the tell-tale light emitted by atoms is also governed by the masses of their constituent particles, notably their electrons. The way these absorb and emit light would shift towards the blue part of the spectrum if atoms were to grow in mass, and to the red if they lost it. Because the frequency or ÒpitchÓ of light increases with mass, Prof Wetterich argues that masses could have been lower long ago. If they had been constantly increasing, the colours of old galaxies would look red-shifted Ð and the degree of red shift would depend on how far away they were from Earth. ÒNone of my colleagues has so far found any fault [with this],Ó he says. Although his research has yet to be published in a peer-reviewed publication, Nature reports that the idea that the universe is not expanding at all Ð or even contracting Ð is being taken seriously by some experts, such as Dr HongSheng Zhao, a cosmologist at the University of St Andrews who has worked on an alternative theory of gravity. ÒI see no fault in [Prof WetterichÕs] mathematical treatment,Ó he says. ÒThere were rudimentary versions of this idea two decades ago, and I think it is fascinating to explore this alternative representation of the cosmic expansion, where the evolution of the universe is like a piano keyboard played out from low to high pitch.Ó Prof Wetterich takes the detached, even playful, view that his work marks a change in perspective, with two different views of reality: either the distances between galaxies grow, as in the traditional balloon picture, or the size of atoms shrinks, increasing their mass. Or itÕs a complex blend of the two. One benefit of this idea is that he is able to rid physics of the singularity at the start of time, a nasty infinity where the laws of physics break down. Instead, the Big Bang is smeared over the distant past: the first note of the ''cosmic pianoÕÕ was long and low-pitched. Harry Cliff, a physicist working at CERN who is the Science MuseumÕs fellow of modern science, thinks it striking that a universe where particles are getting heavier could look identical to one where space/time is expanding. ÒFinding two different ways of thinking about the same problem often leads to new insights,Ó he says. ÒString theory, for instance, is full of 'dualitiesÕ like this, which allow theorists to pick whichever view makes their calculations simpler.Ó If this idea turns out to be right Ð and that is a very big if Ð it could pave the way for new ways to think about our universe. If we are lucky, they might even be as revolutionary as Edwin HubbleÕs, almost a century ago. Roger Highfield is director of external affairs at the Science Museum"

val article = Article(id, title, text)
val summary = summarizer.summarize(article.article, article.title, article.id, article.blog, article.category)

println(summary)

log.info("Summarization completed.")
}
}
19 changes: 10 additions & 9 deletions src/main/scala/com/textteaser/summarizer/Parser.scala
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
package com.textteaser.summarizer

import java.io.FileInputStream
import opennlp.tools.sentdetect._
import com.google.inject.Inject
import com.google.common.base.{CharMatcher, Splitter}
import scala.collection.JavaConverters

class Parser @Inject() (sentenceDetector: SentenceDetectorME, config: Config) {
val stopWords = getStopWords
Expand All @@ -17,13 +18,13 @@ class Parser @Inject() (sentenceDetector: SentenceDetectorME, config: Config) {
/*
* Split Words: Split words via white space and new lines. Then remove whites space in the resulting array.
*/
def splitWords(source: String) = source.split("\\s+|\\r?\\n+")
.map(_.replaceAll("\\W", "").toLowerCase())
.filter(_ != "")
def splitWords(source: String) = JavaConverters.iterableAsScalaIterableConverter(
Splitter.on(CharMatcher.JAVA_LETTER_OR_DIGIT.negate())
.trimResults().omitEmptyStrings()
.split(source)).asScala.toArray

def titleScore(titleWords: Array[String], sentence: Array[String]) = sentence
.filter(w => !stopWords.contains(w) && titleWords.contains(w)) // Removing stop words and only returning title words
.size / titleWords.size.asInstanceOf[Double]
def titleScore(titleWords: Array[String], sentence: Array[String]) =
sentence.count(w => !stopWords.contains(w) && titleWords.contains(w)) / titleWords.size.toDouble

def getKeywords(text: String) = {
val words = splitWords(text).filter(w => !stopWords.contains(w)) // removing stop words
Expand Down Expand Up @@ -58,11 +59,11 @@ class Parser @Inject() (sentenceDetector: SentenceDetectorME, config: Config) {
else if (normalized > 0.7 && normalized <= 0.8)
0.04
else if (normalized > 0.8 && normalized <= 0.9)
0.04;
0.04
else if (normalized > 0.9 && normalized <= 1.0)
0.15
else
0
0d
}

def getStopWords = Array("-", " ", ",", ".", "a", "e", "i", "o", "u", "t", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "both", "bottom", "but", "by", "call", "can", "cannot", "can't", "co", "con", "could", "couldn't", "de", "describe", "detail", "did", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "got", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "into", "is", "it", "its", "it's", "itself", "just", "keep", "last", "latter", "latterly", "least", "less", "like", "ltd", "made", "make", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "new", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "people", "per", "perhaps", "please", "put", "rather", "re", "said", "same", "see", "seem", "seemed", "seeming", "seems", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "use", "very", "via", "want", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "reuters", "news", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue", "wed", "thu", "fri", "sat", "sun", "rappler", "rapplercom", "inquirer", "yahoo", "home", "sports", "1", "10", "2012", "sa", "says", "tweet", "pm", "home", "homepage", "sports", "section", "newsinfo", "stories", "story", "photo", "2013", "na", "ng", "ang", "year", "years", "percent", "ko", "ako", "yung", "yun", "2", "3", "4", "5", "6", "7", "8", "9", "0", "time", "january", "february", "march", "april", "may", "june", "july", "august", "september", "october", "november", "december", "philippine", "government", "police", "manila")
Expand Down