Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for additional languages #128

Merged
merged 10 commits into from
Apr 25, 2018

Conversation

elyas-bhy
Copy link
Collaborator

@elyas-bhy elyas-bhy commented Dec 11, 2017

This PR resolves issues #104 in a backwards-compatible way.
It adds the ability to specify the language of the input text to analyze.

It relies on some changes introduced in PR #93 such as relying on the afinn-165-multilingual repository to retrieve AFINN-165 translations for over 100+ languages.

Although the accuracy of these translations is yet to be determined (as discussed in this comment), this change adds the ability to easily swap the translations if needed, with minimal refactoring.
It also enables developers (with lower requirements wrt. precision) to access these translations directly from this module, instead of having to fork the project (as in the comments of issue #104, see https://github.com/AlexGustafsson/sentiment-swedish and https://github.com/kubawolanin/sentiment-polish for example).

Finally, this PR also introduces the ability to define language-specific negation strategies, as negation can be more or less complex according to the language, and does not consist in simply translating negators from one language to another.

@thisandagain
Copy link
Owner

@elyas-bhy Thanks! I'll take a look at this over the next week. 😄

@elyas-bhy
Copy link
Collaborator Author

@thisandagain have you had the time to review this PR?

@thisandagain
Copy link
Owner

thisandagain commented Jan 15, 2018

Thanks @elyas-bhy! I think this is an interesting and exciting approach. Couple comments and questions that I'd love to get your thoughts on:

  1. This feels like such a fundamental change that I'd be ok to make this breaking and bump the major version number. In some places, you are jumping through hoops to maintain compatibility (such as here) – which I greatly appreciate – but I think I'd be ok to rethink the API to help things stay simple.

  2. In general, I don't feel great about bringing in the translations from https://github.com/dkocich/afinn-165-multilingual. I don't speak many languages, but from the few that I do, I can tell that some of the translations provided by Google Translate are quite poor. I think I'd rather think about a structure where contributors can submit a single "package" that covers the AFINN translation dictionary, negation works, and a labeled dataset for validation. In that way perhaps we could rethink that the existing dictionary, negation list, and UCI datasets to be a default (en) "package".

  3. Totally a detail / nitpick: I suppose I feel about "helper" the same way I do as having a module called "util". I think I'd prefer that any modules added have names that make it immediately clear what they do.

  4. Also a detail: I would want to make sure that any new code landed maintains the current level (100%) of test coverage. You can run the coverage report using make coverage. It's not too bad ... just a few lines in the helper that aren't covered!

@elyas-bhy
Copy link
Collaborator Author

  1. This feels like such a fundamental change that I'd be ok to make this breaking and bump the major version number. In some places, you are jumping through hoops to maintain compatibility (such as here) – which I greatly appreciate – but I think I'd be ok to rethink the API to help things stay simple.

Sure thing. What kind of API do you have in mind?
I was thinking that the most flexible API would be using setters, like in object-oriented programming:

const Sentiment = require('sentiment');  // a constructor is exported
const sentiment = new Sentiment();  // create a new instance
sentiment.setLanguage('en');
sentiment.setExtras({ ... });  // inject custom keywords
sentiment.addExtras({ ... });  // extend the existing keywords
sentiment.ignoreEmojis(true);  // ignore emojis
// etc. for other options
const result = sentiment.analyze('Hello world');

This has the benefit of being flexible enough to allow adding further options without breaking the existing API. Thoughts?

  1. In general, I don't feel great about bringing in the translations from https://github.com/dkocich/afinn-165-multilingual. I don't speak many languages, but from the few that I do, I can tell that some of the translations provided by Google Translate are quite poor. I think I'd rather think about a structure where contributors can submit a single "package" that covers the AFINN translation dictionary, negation works, and a labeled dataset for validation. In that way perhaps we could rethink that the existing dictionary, negation list, and UCI datasets to be a default (en) "package".

I agree that the translations from Google Translate are quite lacking, but at least they allow developers to start testing them in order to iterate and refine them later.
A good compromise for this PR is to include the necessary refactoring for supporting additional languages, but only expose english for now, since it is the only language that is thoroughly tested. What are your thoughts on this?
Also, there was a discussion in issue #104 about using CrowdFlower for establishing labeled corpora for different languages. Is this something that is still planned? What can we do to push this forward?

  1. Totally a detail / nitpick: I suppose I feel about "helper" the same way I do as having a module called "util". I think I'd prefer that any modules added have names that make it immediately clear what they do.

Agreed. I'll fix this once we agree on the more important changes mentioned above.

  1. Also a detail: I would want to make sure that any new code landed maintains the current level (100%) of test coverage. You can run the coverage report using make coverage. It's not too bad ... just a few lines in the helper that aren't covered!

Same as above.

@elyas-bhy
Copy link
Collaborator Author

@thisandagain thoughts?

@thisandagain
Copy link
Owner

@elyas-bhy Thanks for being so thoughtful about this.

Using an OOP setter paradigm seems reasonable to me and could help keep the implementation clean. I can see an argument for using a configuration object paradigm (which feels more javascript-y) and less verbose:

const sentiment = require('sentiment');
const result = sentiment({
    text: 'Hello world',  // required
    extras: [],           // optional
    language: 'en'        // optional
    emoji: true .         // optional
})

Performance-wise it's a little hard to say what the impact of either approach would be. I can see pros and cons to each.


"A good compromise for this PR is to include the necessary refactoring for supporting additional languages, but only expose english for now, since it is the only language that is thoroughly tested. What are your thoughts on this?"

👍. That feels like a perfectly reasonable compromise. Thanks!

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Feb 15, 2018

@thisandagain I like your suggestion. However, I feel like since the text is always required, it should be its own argument, separate from the optional ones, like so:

const sentiment = require('sentiment');
const result = sentiment('Hello world', {
    extras: {},         // optional
    language: 'en'      // optional
    emoji: true .       // optional
})

This feels more sensible to me, and is in line with common Javascript practices.
Thoughts?

@thisandagain
Copy link
Owner

@elyas-bhy Great point. That looks good to me! 👍

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Feb 27, 2018

@thisandagain Regarding returning the result, do we keep the callbacks? In which case, would something like sentiment.analyze(phrase, opts, callback) make sense? Or should we return promises? Or should we support both paradigms (i.e. return a promise if no callback was specified)?

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Feb 28, 2018

@thisandagain I finished integrating the changes we discussed. Please check out the updated README file as it summarizes the new features and modifications (view a rendered version here). Note that merging this PR would require releasing a new major version as it has breaking changes.

@thisandagain
Copy link
Owner

This is wonderful. Thanks @elyas-bhy. I'll take a deeper look this weekend. Agreed re: releasing in a major version bump.

Copy link
Owner

@thisandagain thisandagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks a ton @elyas-bhy. Tests looks good, docs look good, everything is looking really solid. My only concern is the impact this PR has on performance.

On my dev machine running npm run benchmark:

sentiment (Latest) - Short  x 5,575 ops/sec ±1.56% (88 runs sampled)
sentiment (Latest) - Long   x 2,237 ops/sec ±1.26% (92 runs sampled)
Sentimental (1.0.1) - Short x 409,364 ops/sec ±1.74% (93 runs sampled)
Sentimental (1.0.1) - Long  x 1,753 ops/sec ±1.54% (91 runs sampled)

The "short" benchmark with this PR applied is over 100X slower (develop runs at 711,005 ops/sec). Interestingly (and potentially helpfully!), this PR appears to have little to no impact on the "long" benchmark.

This would lead me to suspect that the "getLabels" lookup for the language processor may be the culprit. Luckily, I think this may be easy to resolve by simply making sure that the english labels are loaded at "require-time" rather than every time that the sentiment.analyze function is run. In addition, it also points to a potential benefit of holding the labels in a cache object (particularly when users are leveraging something other than english).

This is so close. Thanks a ton @elyas-bhy. Can't wait to get this landed! Let me know if you want to talk through any of the performance stuff in more detail.

@thisandagain
Copy link
Owner

@elyas-bhy Related: I'm going to add you as a contributor to this repo. It's the very least I can do to recognize how great your contributions have been. 😄

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Mar 3, 2018

Thanks for the feedback. I will look into this when I get back on Monday.
Also thanks for the contributor rights!

@elyas-bhy
Copy link
Collaborator Author

This would lead me to suspect that the "getLabels" lookup for the language processor may be the culprit. Luckily, I think this may be easy to resolve by simply making sure that the english labels are loaded at "require-time" rather than every time that the sentiment.analyze function is run. In addition, it also points to a potential benefit of holding the labels in a cache object (particularly when users are leveraging something other than english).

This is already the case:

After some digging into the profiler, it turns out that the performance hit is due to these lines:

// https://github.com/elyas-bhy/sentiment/blob/develop/lib/index.js#L47-L49
if (opts.emojis !== false) {
  Object.assign(labels, emojis);
}

In other words, providing the ability to enable or disable emoji detection as an option of the sentiment.analyze function leads to a major performance hit.
We should probably rethink this option, is it really necessary after all? Perhaps we can enable emoji detection by default, and eventually, if needed, provide an option to disable it at a global level, to avoid merging large objects at runtime.

@thisandagain thoughts?

@elyas-bhy
Copy link
Collaborator Author

@thisandagain ping.

@elyas-bhy
Copy link
Collaborator Author

@thisandagain ping again.

@thisandagain
Copy link
Owner

thisandagain commented Mar 31, 2018

Sorry for the delay @elyas-bhy. I see ... that all makes sense. The issue is that the emoji processing is still being applied even if emojis is undefined in opts. Checking the opts object is quite fast, the issue is with Object.assign. By default we omit the emoji processing due to the performance implications (discussion here). I think swapping out that logic with the following should resolve this issue:

// Add emojis unless explicitly excluded
if (typeof opts.emojis !== 'undefined') {
     if (opts.emojis) Object.assign(labels, emojis);
}

Resulting benchmark:

sentiment (Latest) - Short  x 867,995 ops/sec ±2.02% (89 runs sampled)
sentiment (Latest) - Long   x 3,793 ops/sec ±1.45% (88 runs sampled)

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Apr 1, 2018

@thisandagain thanks for the feedback. However, I think I have not properly explained what I had in mind in my last comment.

Sure, if you disable emojis by default, then the benchmarks seem to perform way better, but I find that a bit misleading. The fact is that, even with your proposed fix, we are still calling Object.assign every time we analyze something. This is inefficient, because you are rebuilding everything at each subsequent analyze call.

My proposed fix is to merge the emoji labels once, when initializing the sentiment instance, to avoid repeatedly merging large objects at runtime. This removes the overhead from the analyze function, which in turn performs much better.

However, if we go this route, we need to ask ourselves if we need a way to disable emojis if needed (does the community really need this option after all?). Perhaps requiring the sentiment module should return a constructor, so that the user can pass options when creating a new instance? This approach is useful and future-proof to enable passing global options, and should be considered since we are releasing a new major version anyway.

Thoughts?

@thisandagain
Copy link
Owner

thisandagain commented Apr 1, 2018

@elyas-bhy Ah! Thanks for clarifying. That makes sense. My hesitation with landing the emoji work initially was that it would be a breaking change and we had poor validation tests at the time. Since this will be a breaking change anyway I think I'm ok to land it as long as the overall performance and validation tests look ok.

Perhaps requiring the sentiment module should return a constructor, so that the user can pass options when creating a new instance? This approach is useful and future-proof to enable passing global options, and should be considered since we are releasing a new major version anyway.

That seems reasonable and should provide some users piece of mind until we have more robust validation available for the emoji sentiment.


I think this is really close. Thanks for working through this with me @elyas-bhy.

@elyas-bhy
Copy link
Collaborator Author

elyas-bhy commented Apr 3, 2018

Requiring the sentiment module now returns a constructor instead of a singleton instance. We can pass options to the constructor, but nothing is implemented yet, imo it is probably better to create separate issues for these options (such as globally disabling emojis), the scope of this PR is large enough as it is.
Benchmarks, coverage and tests are 👍

@elyas-bhy
Copy link
Collaborator Author

@thisandagain ping.

@thisandagain thisandagain merged commit d010408 into thisandagain:develop Apr 25, 2018
@thisandagain
Copy link
Owner

... and merged! Thanks a ton for all your work on this @elyas-bhy. Also thank you for being so patient with me and my crazy schedule this year. 💟

@dkocich
Copy link

dkocich commented Apr 26, 2018

@elyas-bhy thanks for your work on this issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants