hamlet term frequency counting

this tiny little project was inspired by an assignment to "test your programming skills" at my university.

task description

A frequent task in Information Retrieval (IR) is the calculation of term frequencies. For all terms it is to be counted how often they occur in a text. For this a term is defined as the stem of a word.
Examples:
word --> word stem (term)
going --> go
apple --> appl
apples --> appl

Within this task the document in question (the first scene of Shakespeare's Hamlet hamlet.xml) is in XML format. Therefore first that file must be downloaded and parsed. After this only the contents of the <LINE> is to be taken.

From this, the term frequencies (after stemming) are to be calculated. The output of the program is a list of all terms, together with the respective occurrence frequencies within elements. The output should look like this:
word --> count
go --> 7
appl --> 2
situat --> 5

For word stem reduction there is a variant of the famous Porter Stemming Algorithm available. http://www.ils.unc.edu/keyes/java/porter/PorterStemmer.java

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
resources		resources
.gitignore		.gitignore
HamletParser.java		HamletParser.java
HamletTermCount.java		HamletTermCount.java
README.md		README.md
Stemmer.java		Stemmer.java
WordCounter.java		WordCounter.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hamlet term frequency counting

task description

About

Releases

Packages

Languages

BabsBerlin/IR_termFrequency

Folders and files

Latest commit

History

Repository files navigation

hamlet term frequency counting

task description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages