Add support for additional languages #128

elyas-bhy · 2017-12-11T14:35:27Z

This PR resolves issues #104 in a backwards-compatible way.
It adds the ability to specify the language of the input text to analyze.

It relies on some changes introduced in PR #93 such as relying on the afinn-165-multilingual repository to retrieve AFINN-165 translations for over 100+ languages.

Although the accuracy of these translations is yet to be determined (as discussed in this comment), this change adds the ability to easily swap the translations if needed, with minimal refactoring.
It also enables developers (with lower requirements wrt. precision) to access these translations directly from this module, instead of having to fork the project (as in the comments of issue #104, see https://github.com/AlexGustafsson/sentiment-swedish and https://github.com/kubawolanin/sentiment-polish for example).

Finally, this PR also introduces the ability to define language-specific negation strategies, as negation can be more or less complex according to the language, and does not consist in simply translating negators from one language to another.

thisandagain · 2017-12-14T19:17:29Z

@elyas-bhy Thanks! I'll take a look at this over the next week. 😄

elyas-bhy · 2018-01-12T08:25:24Z

@thisandagain have you had the time to review this PR?

thisandagain · 2018-01-15T18:27:12Z

Thanks @elyas-bhy! I think this is an interesting and exciting approach. Couple comments and questions that I'd love to get your thoughts on:

This feels like such a fundamental change that I'd be ok to make this breaking and bump the major version number. In some places, you are jumping through hoops to maintain compatibility (such as here) – which I greatly appreciate – but I think I'd be ok to rethink the API to help things stay simple.
In general, I don't feel great about bringing in the translations from https://github.com/dkocich/afinn-165-multilingual. I don't speak many languages, but from the few that I do, I can tell that some of the translations provided by Google Translate are quite poor. I think I'd rather think about a structure where contributors can submit a single "package" that covers the AFINN translation dictionary, negation works, and a labeled dataset for validation. In that way perhaps we could rethink that the existing dictionary, negation list, and UCI datasets to be a default (en) "package".
Totally a detail / nitpick: I suppose I feel about "helper" the same way I do as having a module called "util". I think I'd prefer that any modules added have names that make it immediately clear what they do.
Also a detail: I would want to make sure that any new code landed maintains the current level (100%) of test coverage. You can run the coverage report using make coverage. It's not too bad ... just a few lines in the helper that aren't covered!

elyas-bhy · 2018-01-25T12:00:37Z

This feels like such a fundamental change that I'd be ok to make this breaking and bump the major version number. In some places, you are jumping through hoops to maintain compatibility (such as here) – which I greatly appreciate – but I think I'd be ok to rethink the API to help things stay simple.

Sure thing. What kind of API do you have in mind?
I was thinking that the most flexible API would be using setters, like in object-oriented programming:

const Sentiment = require('sentiment');  // a constructor is exported
const sentiment = new Sentiment();  // create a new instance
sentiment.setLanguage('en');
sentiment.setExtras({ ... });  // inject custom keywords
sentiment.addExtras({ ... });  // extend the existing keywords
sentiment.ignoreEmojis(true);  // ignore emojis
// etc. for other options
const result = sentiment.analyze('Hello world');

This has the benefit of being flexible enough to allow adding further options without breaking the existing API. Thoughts?

In general, I don't feel great about bringing in the translations from https://github.com/dkocich/afinn-165-multilingual. I don't speak many languages, but from the few that I do, I can tell that some of the translations provided by Google Translate are quite poor. I think I'd rather think about a structure where contributors can submit a single "package" that covers the AFINN translation dictionary, negation works, and a labeled dataset for validation. In that way perhaps we could rethink that the existing dictionary, negation list, and UCI datasets to be a default (en) "package".

I agree that the translations from Google Translate are quite lacking, but at least they allow developers to start testing them in order to iterate and refine them later.
A good compromise for this PR is to include the necessary refactoring for supporting additional languages, but only expose english for now, since it is the only language that is thoroughly tested. What are your thoughts on this?
Also, there was a discussion in issue #104 about using CrowdFlower for establishing labeled corpora for different languages. Is this something that is still planned? What can we do to push this forward?

Totally a detail / nitpick: I suppose I feel about "helper" the same way I do as having a module called "util". I think I'd prefer that any modules added have names that make it immediately clear what they do.

Agreed. I'll fix this once we agree on the more important changes mentioned above.

Also a detail: I would want to make sure that any new code landed maintains the current level (100%) of test coverage. You can run the coverage report using make coverage. It's not too bad ... just a few lines in the helper that aren't covered!

Same as above.

elyas-bhy · 2018-02-05T17:29:51Z

@thisandagain thoughts?

thisandagain · 2018-02-08T17:38:49Z

@elyas-bhy Thanks for being so thoughtful about this.

Using an OOP setter paradigm seems reasonable to me and could help keep the implementation clean. I can see an argument for using a configuration object paradigm (which feels more javascript-y) and less verbose:

const sentiment = require('sentiment');
const result = sentiment({
    text: 'Hello world',  // required
    extras: [],           // optional
    language: 'en'        // optional
    emoji: true .         // optional
})

Performance-wise it's a little hard to say what the impact of either approach would be. I can see pros and cons to each.

"A good compromise for this PR is to include the necessary refactoring for supporting additional languages, but only expose english for now, since it is the only language that is thoroughly tested. What are your thoughts on this?"

👍. That feels like a perfectly reasonable compromise. Thanks!

elyas-bhy · 2018-02-15T08:55:56Z

@thisandagain I like your suggestion. However, I feel like since the text is always required, it should be its own argument, separate from the optional ones, like so:

const sentiment = require('sentiment');
const result = sentiment('Hello world', {
    extras: {},         // optional
    language: 'en'      // optional
    emoji: true .       // optional
})

This feels more sensible to me, and is in line with common Javascript practices.
Thoughts?

thisandagain · 2018-02-15T13:25:38Z

@elyas-bhy Great point. That looks good to me! 👍

elyas-bhy · 2018-02-27T08:40:31Z

@thisandagain Regarding returning the result, do we keep the callbacks? In which case, would something like sentiment.analyze(phrase, opts, callback) make sense? Or should we return promises? Or should we support both paradigms (i.e. return a promise if no callback was specified)?

elyas-bhy · 2018-02-28T12:56:14Z

@thisandagain I finished integrating the changes we discussed. Please check out the updated README file as it summarizes the new features and modifications (view a rendered version here). Note that merging this PR would require releasing a new major version as it has breaking changes.

thisandagain · 2018-02-28T14:22:30Z

This is wonderful. Thanks @elyas-bhy. I'll take a deeper look this weekend. Agreed re: releasing in a major version bump.

thisandagain

This looks great. Thanks a ton @elyas-bhy. Tests looks good, docs look good, everything is looking really solid. My only concern is the impact this PR has on performance.

On my dev machine running npm run benchmark:

sentiment (Latest) - Short  x 5,575 ops/sec ±1.56% (88 runs sampled)
sentiment (Latest) - Long   x 2,237 ops/sec ±1.26% (92 runs sampled)
Sentimental (1.0.1) - Short x 409,364 ops/sec ±1.74% (93 runs sampled)
Sentimental (1.0.1) - Long  x 1,753 ops/sec ±1.54% (91 runs sampled)

The "short" benchmark with this PR applied is over 100X slower (develop runs at 711,005 ops/sec). Interestingly (and potentially helpfully!), this PR appears to have little to no impact on the "long" benchmark.

This would lead me to suspect that the "getLabels" lookup for the language processor may be the culprit. Luckily, I think this may be easy to resolve by simply making sure that the english labels are loaded at "require-time" rather than every time that the sentiment.analyze function is run. In addition, it also points to a potential benefit of holding the labels in a cache object (particularly when users are leveraging something other than english).

This is so close. Thanks a ton @elyas-bhy. Can't wait to get this landed! Let me know if you want to talk through any of the performance stuff in more detail.

thisandagain · 2018-03-03T17:12:08Z

@elyas-bhy Related: I'm going to add you as a contributor to this repo. It's the very least I can do to recognize how great your contributions have been. 😄

elyas-bhy · 2018-03-03T19:03:31Z

Thanks for the feedback. I will look into this when I get back on Monday.
Also thanks for the contributor rights!

elyas-bhy · 2018-03-05T15:38:35Z

This would lead me to suspect that the "getLabels" lookup for the language processor may be the culprit. Luckily, I think this may be easy to resolve by simply making sure that the english labels are loaded at "require-time" rather than every time that the sentiment.analyze function is run. In addition, it also points to a potential benefit of holding the labels in a cache object (particularly when users are leveraging something other than english).

This is already the case:

English language is loaded at require-time and cached
The getLabels function calls the getLanguage function which caches languages when they are first loaded

After some digging into the profiler, it turns out that the performance hit is due to these lines:

// https://github.com/elyas-bhy/sentiment/blob/develop/lib/index.js#L47-L49
if (opts.emojis !== false) {
  Object.assign(labels, emojis);
}

In other words, providing the ability to enable or disable emoji detection as an option of the sentiment.analyze function leads to a major performance hit.
We should probably rethink this option, is it really necessary after all? Perhaps we can enable emoji detection by default, and eventually, if needed, provide an option to disable it at a global level, to avoid merging large objects at runtime.

@thisandagain thoughts?

elyas-bhy · 2018-03-13T09:08:45Z

@thisandagain ping.

elyas-bhy · 2018-03-28T10:05:22Z

@thisandagain ping again.

thisandagain · 2018-03-31T00:17:52Z

Sorry for the delay @elyas-bhy. I see ... that all makes sense. The issue is that the emoji processing is still being applied even if emojis is undefined in opts. Checking the opts object is quite fast, the issue is with Object.assign. By default we omit the emoji processing due to the performance implications (discussion here). I think swapping out that logic with the following should resolve this issue:

// Add emojis unless explicitly excluded
if (typeof opts.emojis !== 'undefined') {
     if (opts.emojis) Object.assign(labels, emojis);
}

Resulting benchmark:

sentiment (Latest) - Short  x 867,995 ops/sec ±2.02% (89 runs sampled)
sentiment (Latest) - Long   x 3,793 ops/sec ±1.45% (88 runs sampled)

elyas-bhy · 2018-04-01T15:42:30Z

@thisandagain thanks for the feedback. However, I think I have not properly explained what I had in mind in my last comment.

Sure, if you disable emojis by default, then the benchmarks seem to perform way better, but I find that a bit misleading. The fact is that, even with your proposed fix, we are still calling Object.assign every time we analyze something. This is inefficient, because you are rebuilding everything at each subsequent analyze call.

My proposed fix is to merge the emoji labels once, when initializing the sentiment instance, to avoid repeatedly merging large objects at runtime. This removes the overhead from the analyze function, which in turn performs much better.

However, if we go this route, we need to ask ourselves if we need a way to disable emojis if needed (does the community really need this option after all?). Perhaps requiring the sentiment module should return a constructor, so that the user can pass options when creating a new instance? This approach is useful and future-proof to enable passing global options, and should be considered since we are releasing a new major version anyway.

Thoughts?

thisandagain · 2018-04-01T20:16:58Z

@elyas-bhy Ah! Thanks for clarifying. That makes sense. My hesitation with landing the emoji work initially was that it would be a breaking change and we had poor validation tests at the time. Since this will be a breaking change anyway I think I'm ok to land it as long as the overall performance and validation tests look ok.

Perhaps requiring the sentiment module should return a constructor, so that the user can pass options when creating a new instance? This approach is useful and future-proof to enable passing global options, and should be considered since we are releasing a new major version anyway.

That seems reasonable and should provide some users piece of mind until we have more robust validation available for the emoji sentiment.

I think this is really close. Thanks for working through this with me @elyas-bhy.

elyas-bhy · 2018-04-03T15:33:23Z

Requiring the sentiment module now returns a constructor instead of a singleton instance. We can pass options to the constructor, but nothing is implemented yet, imo it is probably better to create separate issues for these options (such as globally disabling emojis), the scope of this PR is large enough as it is.
Benchmarks, coverage and tests are 👍

elyas-bhy · 2018-04-11T11:41:15Z

@thisandagain ping.

thisandagain · 2018-04-25T23:03:08Z

... and merged! Thanks a ton for all your work on this @elyas-bhy. Also thank you for being so patient with me and my crazy schedule this year. 💟

dkocich · 2018-04-26T05:52:38Z

@elyas-bhy thanks for your work on this issue :-)

elyas-bhy added 5 commits December 11, 2017 12:22

Add multilang support, fixes thisandagain#104

6c4c917

Move negators to separate file

f311554

Add support for language-specific negation strategies

4a95466

Add missing method documentation

5084d4d

Add link to supported languages

1b34049

thisandagain self-requested a review December 14, 2017 19:17

thisandagain self-assigned this Dec 14, 2017

thisandagain added the pr - needs review label Dec 14, 2017

thisandagain added the needs discussion label Jan 15, 2018

thisandagain assigned elyas-bhy Jan 15, 2018

elyas-bhy added 4 commits February 27, 2018 10:17

Update method signature to accept configuration object

037e22f

Add support for dynamically registring new languages

2dcb120

Merge changes

16d09ca

Update README.md

7fd7ae5

thisandagain removed the needs discussion label Mar 3, 2018

thisandagain requested changes Mar 3, 2018

View reviewed changes

thisandagain added pr - needs work and removed pr - needs review labels Mar 3, 2018

thisandagain removed their assignment Mar 3, 2018

thisandagain mentioned this pull request Mar 3, 2018

add bundling capability #135

Closed

Export a constructor from the provided module

d010408

elyas-bhy added pr - needs review and removed pr - needs work labels Apr 3, 2018

thisandagain mentioned this pull request Apr 25, 2018

Update code formatting #136

Open

thisandagain merged commit d010408 into thisandagain:develop Apr 25, 2018

thisandagain added pr - good to merge and removed pr - needs review labels Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for additional languages #128

Add support for additional languages #128

elyas-bhy commented Dec 11, 2017 •

edited

Loading

thisandagain commented Dec 14, 2017

elyas-bhy commented Jan 12, 2018

thisandagain commented Jan 15, 2018 •

edited

Loading

elyas-bhy commented Jan 25, 2018

elyas-bhy commented Feb 5, 2018

thisandagain commented Feb 8, 2018

elyas-bhy commented Feb 15, 2018 •

edited

Loading

thisandagain commented Feb 15, 2018

elyas-bhy commented Feb 27, 2018 •

edited

Loading

elyas-bhy commented Feb 28, 2018 •

edited

Loading

thisandagain commented Feb 28, 2018

thisandagain left a comment •

edited

Loading

thisandagain commented Mar 3, 2018

elyas-bhy commented Mar 3, 2018 •

edited

Loading

elyas-bhy commented Mar 5, 2018

elyas-bhy commented Mar 13, 2018

elyas-bhy commented Mar 28, 2018

thisandagain commented Mar 31, 2018 •

edited

Loading

elyas-bhy commented Apr 1, 2018 •

edited

Loading

thisandagain commented Apr 1, 2018 •

edited

Loading

elyas-bhy commented Apr 3, 2018 •

edited

Loading

elyas-bhy commented Apr 11, 2018

thisandagain commented Apr 25, 2018

dkocich commented Apr 26, 2018

Add support for additional languages #128

Add support for additional languages #128

Conversation

elyas-bhy commented Dec 11, 2017 • edited Loading

thisandagain commented Dec 14, 2017

elyas-bhy commented Jan 12, 2018

thisandagain commented Jan 15, 2018 • edited Loading

elyas-bhy commented Jan 25, 2018

elyas-bhy commented Feb 5, 2018

thisandagain commented Feb 8, 2018

elyas-bhy commented Feb 15, 2018 • edited Loading

thisandagain commented Feb 15, 2018

elyas-bhy commented Feb 27, 2018 • edited Loading

elyas-bhy commented Feb 28, 2018 • edited Loading

thisandagain commented Feb 28, 2018

thisandagain left a comment • edited Loading

Choose a reason for hiding this comment

thisandagain commented Mar 3, 2018

elyas-bhy commented Mar 3, 2018 • edited Loading

elyas-bhy commented Mar 5, 2018

elyas-bhy commented Mar 13, 2018

elyas-bhy commented Mar 28, 2018

thisandagain commented Mar 31, 2018 • edited Loading

elyas-bhy commented Apr 1, 2018 • edited Loading

thisandagain commented Apr 1, 2018 • edited Loading

elyas-bhy commented Apr 3, 2018 • edited Loading

elyas-bhy commented Apr 11, 2018

thisandagain commented Apr 25, 2018

dkocich commented Apr 26, 2018

elyas-bhy commented Dec 11, 2017 •

edited

Loading

thisandagain commented Jan 15, 2018 •

edited

Loading

elyas-bhy commented Feb 15, 2018 •

edited

Loading

elyas-bhy commented Feb 27, 2018 •

edited

Loading

elyas-bhy commented Feb 28, 2018 •

edited

Loading

thisandagain left a comment •

edited

Loading

elyas-bhy commented Mar 3, 2018 •

edited

Loading

thisandagain commented Mar 31, 2018 •

edited

Loading

elyas-bhy commented Apr 1, 2018 •

edited

Loading

thisandagain commented Apr 1, 2018 •

edited

Loading

elyas-bhy commented Apr 3, 2018 •

edited

Loading