Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate emoji parsing; #74

Merged
merged 21 commits into from
Apr 1, 2017
Merged

Conversation

thorbert
Copy link
Contributor

No description provided.

@thorbert
Copy link
Contributor Author

thorbert commented Aug 3, 2016

It might be better to separate the AFINN and the emoji lists and let build/index.js build the combined product.

It'd probably also be better to preserve the token order instead of tacking emojis on the end.

@thorbert thorbert changed the title Integrage emoji parsing; Add emoji-regex to dependencies; Integrage emoji parsing; Aug 5, 2016
@thorbert thorbert changed the title Integrage emoji parsing; Integrate emoji parsing; Aug 5, 2016
@thisandagain thisandagain self-assigned this Oct 7, 2016
@peterpme
Copy link

👋 Can I help move this along?

@thisandagain
Copy link
Owner

thisandagain commented Mar 21, 2017

@peterpme Sure! Here is what I think we need:

  • Add peer reviewed validation dataset which includes emojis (see Add Twitter validation dataset #110)
  • Check to ensure that changes to the tokenizer included in this PR don't adversely impact performance too much
  • Resolve merge conflicts
  • Clean-up a few nitpicks from the PR

lib/index.js Outdated
@@ -34,12 +34,14 @@ module.exports = function (phrase, inject, callback) {
}

// Storage objects
var tokens = tokenize(phrase),
var tokens = [],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here?

package.json Outdated
@@ -19,7 +19,8 @@
"test": "make test"
},
"dependencies": {
"lodash.assign": "4.1.0"
"lodash.assign": "4.1.0",
"spliddit": "^2.1.1"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use absolute version numbers (remove the ^)

build/index.js Outdated
@@ -5,35 +5,68 @@
* @author Andrew Sliwinski <andrew@diy.org>
*/

/*globals Promise:true*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not introduce promises to this. This should be simple enough to be broken down into smaller functions to avoid callback hell.

lib/tokenize.js Outdated
var spliddit = require('spliddit');

module.exports = function(input) {
if (input === undefined) throw new Error('input undefined');
Copy link
Owner

@thisandagain thisandagain Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module should never throw. I think the expected behavior here would be for the tokenizer to return an empty array.

lib/tokenize.js Outdated
module.exports = function(input) {
if (input === undefined) throw new Error('input undefined');
if (input.trim() === '') return [];
var tokens = spliddit(input.toLowerCase()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path could use some documentation / line comments.

@thisandagain
Copy link
Owner

Resolves #24

package.json Outdated
@@ -19,7 +19,10 @@
"scripts": {
"test": "make test"
},
"dependencies": {},
"dependencies": {
"async": "^2.1.5",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because async is only used by the build script it should only be included in devDependencies.

@thisandagain
Copy link
Owner

I was able to find a listing of emoji sentiment that we may want to use as the source (similar to how we use AFINN) for the sentiment module. I think I'd be more comfortable landing support with this as the backing dataset.

Overview

http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html

Dataset

https://www.clarin.si/repository/xmlui/handle/11356/1048

Paper

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296

build/emoji.txt Outdated
@@ -0,0 +1,79 @@
😁 5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop this file now that we are including Emoji_Sentiment_Data_v1.0.csv

@thisandagain
Copy link
Owner

thisandagain commented Mar 22, 2017

Thanks @thorbert ! This is really getting close. I think I just have two remaining comments and then we should do final testing.

@thisandagain
Copy link
Owner

Thanks @thorbert ! Just pulling everything down for testing now. Two things I'm observing:

  • Performance: the new tokenizer has reduced performance by almost 50%
  • Test coverage: test coverage has dropped from 100% to 87.5% (also due to changes in the tokenizer)

I'll keep digging to see if this might be easy to resolve.

@thisandagain
Copy link
Owner

thisandagain commented Mar 25, 2017

It appears that the use of spliddit.hasPair is the main performance culprit. Particularly where you are attempting to tokenize emoji that are not delimited by a space (e.g. 😀💩👻). This also happens to be where you don't have test coverage.

I don't really see a way around this completely yet, but having better unicode support for the tokenizer might be worth the performance cost. Here is a quick first shot at attempting to optimize and improve readability using a flat map:

var spliddit = require('spliddit');

/**
 * Applys a "flatten" reducer to the standard JS "map" array method.
 * @param  {Function} lambda Reducer
 * @return {array}           Flattened array
 */
Array.prototype.flatMap = function(lambda) {
    return Array.prototype.concat.apply([], this.map(lambda));
};

/**
 * Remove special characters and return an array of tokens (words).
 * @param  {string} input Input string
 * @return {array}        Array of tokens
 */
module.exports = function(input) {
    // Normalize input
    input = input
        .toLowerCase()
        .replace(/[.,\/#!$%\^&\*;:{}=_`\"~()]/g, '')
        .replace('/ {2,}/',' ');

    // Split on spaces
    input = spliddit(input, ' ');

    // Iterate over tokens and split unicode pairs
    return input.flatMap(function (t) {
        if (!spliddit.hasPair(t)) return t;
        return spliddit(t, '');
    });
};

Another major issue I noticed while testing is that something is amiss with the build script. Common emoji like 😀 and 💩 are included in Emoji_Sentiment_Data_v1.0.csv but are not present in the resulting JSON used by sentiment.

@thorbert
Copy link
Contributor Author

Yes, benchmarks don't look good. I can expand tests but I'm not sure my toy use case makes sense generally (SMS messages from a 4 year old which contain only undelimited emoji). Maybe we can toss that out if doing so helps to restore the performance measures.

@thisandagain
Copy link
Owner

@thorbert Sure. Let's remove that change to the tokenizer. We could always restore it later as a configuration option, but I wouldn't want all users to take on that performance hit.

lib/tokenize.js Outdated
var spliddit = require('spliddit');

module.exports = function(input) {
if (input === undefined) return [];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we check if the input if undefined in the containing module (https://github.com/thisandagain/sentiment/blob/develop/lib/index.js#L42), this can be removed.

lib/tokenize.js Outdated

module.exports = function(input) {
if (input === undefined) return [];
if (input.trim() === '') return [''];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your thinking that this is a performance optimization? If so, I think I'd rather track it in a separate PR.

var test = require('tap').test;
var sentiment = require('../../lib/index');

var dataset = 'This is so cool ✓';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also include a test case with some more common emoji such as 😄 (that is how I found the build issue).

@thisandagain
Copy link
Owner

This is great! Thanks @thorbert for all your patience as we worked through the details. I'll land this now and then prep a release. 😄 👍

@thisandagain thisandagain merged commit 5d415b2 into thisandagain:develop Apr 1, 2017
@thisandagain thisandagain mentioned this pull request Apr 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants