Skip to content

Latest commit

 

History

History
36 lines (25 loc) · 1.97 KB

README.md

File metadata and controls

36 lines (25 loc) · 1.97 KB

Naive context sentiment analysis

Aim

This R script should address the problem that several sentiment analysis scripts ignore valence shifters (e.g. "hardly difficult", "not great at all"). For a great outline of that issue, you can see trinker's argument and sentimentr package here.

The sentimentr package does a remarkable job in handling valence shifters but it requires 'good' text data that is properly punctuated - because the valence shifter weighting is done on "polarized context clusters" in sentences (i.e., you get one sentiment value per sentence).

Many text data are not suitable in that pipeline because they are

  • not punctuated at all (e.g., auto-generated YouTube transcripts)
  • badly punctuated (e.g., data from blogs where punctuation is not necessarily a given)
  • or because they are very brief: Twitter data, for example, even if properly annotated for sentence-boundary-disambiguation, would return one or two sentiment values.

Why "naive context sentiment analysis"

Our approach is based on the sentimentr idea of creating a "cluster" around sentiments. Within that cluster, we then look for valence shifters (taken from the brilliant lexicon package), weight the original sentiment, and returns a vector of sentiments of the size v (where v = number of tokens that are not punctuation marks).

Our approach does not rely on sentences and punctation and is therefore "naive" towards the broader structure texts.

Note: We are still developing this tool.

Development wish list

  • speed improvements (in particular in the length standardisation, e.g. switch to different discrete cosine transformation or Fourier transformation)
  • multi-dimensionality implementation for other lexicon-based approaches (needed: "lexicon" as function parameter)
  • multi-language support (needs lexicon-databases in different languages)
  • python implementation