-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate emoji parsing; #74
Conversation
It might be better to separate the AFINN and the emoji lists and let build/index.js build the combined product. It'd probably also be better to preserve the token order instead of tacking emojis on the end. |
👋 Can I help move this along? |
@peterpme Sure! Here is what I think we need:
|
lib/index.js
Outdated
@@ -34,12 +34,14 @@ module.exports = function (phrase, inject, callback) { | |||
} | |||
|
|||
// Storage objects | |||
var tokens = tokenize(phrase), | |||
var tokens = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change here?
package.json
Outdated
@@ -19,7 +19,8 @@ | |||
"test": "make test" | |||
}, | |||
"dependencies": { | |||
"lodash.assign": "4.1.0" | |||
"lodash.assign": "4.1.0", | |||
"spliddit": "^2.1.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use absolute version numbers (remove the ^
)
build/index.js
Outdated
@@ -5,35 +5,68 @@ | |||
* @author Andrew Sliwinski <andrew@diy.org> | |||
*/ | |||
|
|||
/*globals Promise:true*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not introduce promises to this. This should be simple enough to be broken down into smaller functions to avoid callback hell.
lib/tokenize.js
Outdated
var spliddit = require('spliddit'); | ||
|
||
module.exports = function(input) { | ||
if (input === undefined) throw new Error('input undefined'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module should never throw. I think the expected behavior here would be for the tokenizer to return an empty array.
lib/tokenize.js
Outdated
module.exports = function(input) { | ||
if (input === undefined) throw new Error('input undefined'); | ||
if (input.trim() === '') return []; | ||
var tokens = spliddit(input.toLowerCase() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path could use some documentation / line comments.
Resolves #24 |
package.json
Outdated
@@ -19,7 +19,10 @@ | |||
"scripts": { | |||
"test": "make test" | |||
}, | |||
"dependencies": {}, | |||
"dependencies": { | |||
"async": "^2.1.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because async
is only used by the build script it should only be included in devDependencies
.
I was able to find a listing of emoji sentiment that we may want to use as the source (similar to how we use AFINN) for the Overviewhttp://kt.ijs.si/data/Emoji_sentiment_ranking/index.html Datasethttps://www.clarin.si/repository/xmlui/handle/11356/1048 Paperhttp://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296 |
…NN and Emoji data into the same hash
build/emoji.txt
Outdated
@@ -0,0 +1,79 @@ | |||
😁 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop this file now that we are including Emoji_Sentiment_Data_v1.0.csv
Thanks @thorbert ! This is really getting close. I think I just have two remaining comments and then we should do final testing. |
Thanks @thorbert ! Just pulling everything down for testing now. Two things I'm observing:
I'll keep digging to see if this might be easy to resolve. |
It appears that the use of I don't really see a way around this completely yet, but having better unicode support for the tokenizer might be worth the performance cost. Here is a quick first shot at attempting to optimize and improve readability using a flat map: var spliddit = require('spliddit');
/**
* Applys a "flatten" reducer to the standard JS "map" array method.
* @param {Function} lambda Reducer
* @return {array} Flattened array
*/
Array.prototype.flatMap = function(lambda) {
return Array.prototype.concat.apply([], this.map(lambda));
};
/**
* Remove special characters and return an array of tokens (words).
* @param {string} input Input string
* @return {array} Array of tokens
*/
module.exports = function(input) {
// Normalize input
input = input
.toLowerCase()
.replace(/[.,\/#!$%\^&\*;:{}=_`\"~()]/g, '')
.replace('/ {2,}/',' ');
// Split on spaces
input = spliddit(input, ' ');
// Iterate over tokens and split unicode pairs
return input.flatMap(function (t) {
if (!spliddit.hasPair(t)) return t;
return spliddit(t, '');
});
}; Another major issue I noticed while testing is that something is amiss with the build script. Common emoji like 😀 and 💩 are included in |
Yes, benchmarks don't look good. I can expand tests but I'm not sure my toy use case makes sense generally (SMS messages from a 4 year old which contain only undelimited emoji). Maybe we can toss that out if doing so helps to restore the performance measures. |
@thorbert Sure. Let's remove that change to the tokenizer. We could always restore it later as a configuration option, but I wouldn't want all users to take on that performance hit. |
lib/tokenize.js
Outdated
var spliddit = require('spliddit'); | ||
|
||
module.exports = function(input) { | ||
if (input === undefined) return []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we check if the input if undefined
in the containing module (https://github.com/thisandagain/sentiment/blob/develop/lib/index.js#L42), this can be removed.
lib/tokenize.js
Outdated
|
||
module.exports = function(input) { | ||
if (input === undefined) return []; | ||
if (input.trim() === '') return ['']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your thinking that this is a performance optimization? If so, I think I'd rather track it in a separate PR.
var test = require('tap').test; | ||
var sentiment = require('../../lib/index'); | ||
|
||
var dataset = 'This is so cool ✓'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to also include a test case with some more common emoji such as 😄 (that is how I found the build issue).
This is great! Thanks @thorbert for all your patience as we worked through the details. I'll land this now and then prep a release. 😄 👍 |
No description provided.