-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for additional languages #128
Conversation
@elyas-bhy Thanks! I'll take a look at this over the next week. 😄 |
@thisandagain have you had the time to review this PR? |
Thanks @elyas-bhy! I think this is an interesting and exciting approach. Couple comments and questions that I'd love to get your thoughts on:
|
Sure thing. What kind of API do you have in mind? const Sentiment = require('sentiment'); // a constructor is exported
const sentiment = new Sentiment(); // create a new instance
sentiment.setLanguage('en');
sentiment.setExtras({ ... }); // inject custom keywords
sentiment.addExtras({ ... }); // extend the existing keywords
sentiment.ignoreEmojis(true); // ignore emojis
// etc. for other options
const result = sentiment.analyze('Hello world'); This has the benefit of being flexible enough to allow adding further options without breaking the existing API. Thoughts?
I agree that the translations from Google Translate are quite lacking, but at least they allow developers to start testing them in order to iterate and refine them later.
Agreed. I'll fix this once we agree on the more important changes mentioned above.
Same as above. |
@thisandagain thoughts? |
@elyas-bhy Thanks for being so thoughtful about this. Using an OOP setter paradigm seems reasonable to me and could help keep the implementation clean. I can see an argument for using a configuration object paradigm (which feels more javascript-y) and less verbose: const sentiment = require('sentiment');
const result = sentiment({
text: 'Hello world', // required
extras: [], // optional
language: 'en' // optional
emoji: true . // optional
}) Performance-wise it's a little hard to say what the impact of either approach would be. I can see pros and cons to each.
👍. That feels like a perfectly reasonable compromise. Thanks! |
@thisandagain I like your suggestion. However, I feel like since the const sentiment = require('sentiment');
const result = sentiment('Hello world', {
extras: {}, // optional
language: 'en' // optional
emoji: true . // optional
}) This feels more sensible to me, and is in line with common Javascript practices. |
@elyas-bhy Great point. That looks good to me! 👍 |
@thisandagain Regarding returning the result, do we keep the callbacks? In which case, would something like |
@thisandagain I finished integrating the changes we discussed. Please check out the updated README file as it summarizes the new features and modifications (view a rendered version here). Note that merging this PR would require releasing a new major version as it has breaking changes. |
This is wonderful. Thanks @elyas-bhy. I'll take a deeper look this weekend. Agreed re: releasing in a major version bump. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks a ton @elyas-bhy. Tests looks good, docs look good, everything is looking really solid. My only concern is the impact this PR has on performance.
On my dev machine running npm run benchmark
:
sentiment (Latest) - Short x 5,575 ops/sec ±1.56% (88 runs sampled)
sentiment (Latest) - Long x 2,237 ops/sec ±1.26% (92 runs sampled)
Sentimental (1.0.1) - Short x 409,364 ops/sec ±1.74% (93 runs sampled)
Sentimental (1.0.1) - Long x 1,753 ops/sec ±1.54% (91 runs sampled)
The "short" benchmark with this PR applied is over 100X slower (develop
runs at 711,005 ops/sec). Interestingly (and potentially helpfully!), this PR appears to have little to no impact on the "long" benchmark.
This would lead me to suspect that the "getLabels" lookup for the language processor may be the culprit. Luckily, I think this may be easy to resolve by simply making sure that the english labels are loaded at "require-time" rather than every time that the sentiment.analyze
function is run. In addition, it also points to a potential benefit of holding the labels in a cache object (particularly when users are leveraging something other than english).
This is so close. Thanks a ton @elyas-bhy. Can't wait to get this landed! Let me know if you want to talk through any of the performance stuff in more detail.
@elyas-bhy Related: I'm going to add you as a contributor to this repo. It's the very least I can do to recognize how great your contributions have been. 😄 |
Thanks for the feedback. I will look into this when I get back on Monday. |
This is already the case:
After some digging into the profiler, it turns out that the performance hit is due to these lines: // https://github.com/elyas-bhy/sentiment/blob/develop/lib/index.js#L47-L49
if (opts.emojis !== false) {
Object.assign(labels, emojis);
} In other words, providing the ability to enable or disable emoji detection as an option of the @thisandagain thoughts? |
@thisandagain ping. |
@thisandagain ping again. |
Sorry for the delay @elyas-bhy. I see ... that all makes sense. The issue is that the emoji processing is still being applied even if // Add emojis unless explicitly excluded
if (typeof opts.emojis !== 'undefined') {
if (opts.emojis) Object.assign(labels, emojis);
} Resulting benchmark:
|
@thisandagain thanks for the feedback. However, I think I have not properly explained what I had in mind in my last comment. Sure, if you disable emojis by default, then the benchmarks seem to perform way better, but I find that a bit misleading. The fact is that, even with your proposed fix, we are still calling My proposed fix is to merge the emoji labels once, when initializing the However, if we go this route, we need to ask ourselves if we need a way to disable emojis if needed (does the community really need this option after all?). Perhaps requiring the Thoughts? |
@elyas-bhy Ah! Thanks for clarifying. That makes sense. My hesitation with landing the emoji work initially was that it would be a breaking change and we had poor validation tests at the time. Since this will be a breaking change anyway I think I'm ok to land it as long as the overall performance and validation tests look ok.
That seems reasonable and should provide some users piece of mind until we have more robust validation available for the emoji sentiment. I think this is really close. Thanks for working through this with me @elyas-bhy. |
Requiring the |
@thisandagain ping. |
... and merged! Thanks a ton for all your work on this @elyas-bhy. Also thank you for being so patient with me and my crazy schedule this year. 💟 |
@elyas-bhy thanks for your work on this issue :-) |
This PR resolves issues #104 in a backwards-compatible way.
It adds the ability to specify the language of the input text to analyze.
It relies on some changes introduced in PR #93 such as relying on the
afinn-165-multilingual
repository to retrieve AFINN-165 translations for over 100+ languages.Although the accuracy of these translations is yet to be determined (as discussed in this comment), this change adds the ability to easily swap the translations if needed, with minimal refactoring.
It also enables developers (with lower requirements wrt. precision) to access these translations directly from this module, instead of having to fork the project (as in the comments of issue #104, see https://github.com/AlexGustafsson/sentiment-swedish and https://github.com/kubawolanin/sentiment-polish for example).
Finally, this PR also introduces the ability to define language-specific negation strategies, as negation can be more or less complex according to the language, and does not consist in simply translating negators from one language to another.