Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STORM-1565] Multi-Lang Performance Improvements #1136

Closed
wants to merge 4 commits into from

Conversation

vesense
Copy link
Member

@vesense vesense commented Feb 22, 2016

@dan-blanchard
Copy link
Contributor

You may want to look at the Pyleus project's MessagePackSerializer, because I know they had to add some things in to handle weird edge cases. I'm not really familiar enough with implementing custom serializers to recognize why your approach and theirs look rather different.

@dan-blanchard
Copy link
Contributor

Also, this would be amazing, as I was trying to convince the Pyleus folks to contribute their serializer to Storm anyway in Yelp/pyleus#159.

@@ -226,7 +226,7 @@ topology.eventlogger.executors: null
topology.tasks: null
# maximum amount of time a message has to complete before it's considered failed
topology.message.timeout.secs: 30
topology.multilang.serializer: "org.apache.storm.multilang.JsonSerializer"
topology.multilang.serializer: "org.apache.storm.multilang.MessagePackSerializer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While having the MessagePackSeriailzer built-in would be amazing, changing it to the default serializer would break a lot of multi-lang libraries, so I don't think that's a great idea. Let people opt in.

@vesense
Copy link
Member Author

vesense commented Mar 7, 2016

Thanks @dan-blanchard I'd love to hear from pyleus contributors.

@vesense
Copy link
Member Author

vesense commented Mar 16, 2016

cc @HeartSaVioR You might be interested in this improvement.

@@ -202,6 +202,7 @@
<clj-time.version>0.8.0</clj-time.version>
<curator.version>2.9.0</curator.version>
<json-simple.version>1.1</json-simple.version>
<msgpack.version>0.6.12</msgpack.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be much more useful if it were implemented using a newer version of the msgpack-core library—they changed the name after 0.6.12—because 0.7 and above supports the BINARY format, which lets you send arbitrary bytes. Without that, you won't be able to send tuples containing arbitrary bytes with this serializer. This is also a problem with the JSON serializer (because JSON strings can't contain non-Unicode characters), but it would be great if we didn't have the problem here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, they changed the whole API with the change from msgpack to msgpack-core, so the template approach won't work anymore. You can see new usage examples here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dan-blanchard Thanks. I will update.

@HeartSaVioR
Copy link
Contributor

@vesense
Sorry to reach out too late. Some comments for me:

  1. We might want to include this to multilang (like multilang-serializer) module so that it doesn't affect dependency if users don't use multilang feature. But I'm also OK to have this to core if we are OK to just shade msgpack.
  2. As @dan-blanchard stated we may want to have stable implementation of serializer if possible.
  3. I totally agree with @dan-blanchard that changing default behavior should be avoided unless releasing major version.
  4. It requires users to implement custom multi-lang logic to apply deserialization. IMO, I think it makes issue more complicated. We could provide implementation by default, but it will force user to have dependency at all, and more bad thing is that we need to address this at least three languages.

Personally I feel that we're stuck on having less-maintaining and non-library-dependent default implementations. I can take care of python / ruby implementation though I'm not expert on these, but have no idea on nodejs.

@HeartSaVioR
Copy link
Contributor

So it may need to discuss with current state of multi-lang support once rather than adding something continuously.

@vesense
Copy link
Member Author

vesense commented Jun 21, 2016

@HeartSaVioR Agree with you. We should have a discussion about multi-lang support. (Can open a thread on dev@ mailing list)

@dan-blanchard
Copy link
Contributor

I'm not on the Storm dev mailing list yet—I guess I should probably fix that—but we actually already have a Python implementation of msgpack serialization in pystorm, the Python multi-lang implementation that powers streamparse.

I've been meaning to propose for a while that instead of having your own default Python multi-lang implementation in Storm that very few people use (because it's not very Pythonic or production-ready), you should instead point people to use pystorm at the very least. It provides all the functionality that the storm.py provides but with documentation, exception handling, logging, etc. pystorm is intended to be used as a starting point for more full-fledged Python-Storm interop libraries (like streamparse), but if you want to replace the bare bones provdided by storm.py with an enhanced version of the same thing, it seems like pystorm is the way to go.

@HeartSaVioR
Copy link
Contributor

@dan-blanchard
Yeah, agreed. I've addressed several multi-lang issues so I had been interested on multi-lang feature of Storm, but few people have interest on maintaining / improving basic implementation of multilang protocol in various language even python.
Pystorm seems great, and we lack contributors who care multi-lang implementations as one of major module so I'd still rather rely on stable implementation than build from scratch. Just one concern for me is that it seems to be heavily depending on you (I mean individual) so I'm curious we feel safe to rely on.

@HeartSaVioR
Copy link
Contributor

@dan-blanchard And please note that opinions are my own, so we should raise this to discussion and let community give various opinions if we would really want to.

@dan-blanchard
Copy link
Contributor

Pystorm seems great, and we lack contributors who care multi-lang implementations as one of major module so I'd still rather rely on stable implementation than build from scratch. Just one concern for me is that it seems to be heavily depending on you (I mean individual) so I'm curious we feel safe to rely on.

I work full-time for a company that uses pystorm and streamparse in production, and we have a team dedicated to maintaining these projects, so I don't think you need to worry about that much.

@HeartSaVioR
Copy link
Contributor

OK great. Could you subscribe dev@ mailing list? I occasionally initiate discussion from there (multilang, too) so you might want to have a talk regarding to multilang.

@dan-blanchard
Copy link
Contributor

Yup. I subscribed yesterday.
On Wed, Jun 22, 2016 at 8:35 PM Jungtaek Lim notifications@github.com
wrote:

OK great. Could you subscribe dev@ mailing list? I occasionally initiate
discussion from there (multilang, too) so you might want to have a talk
regarding to multilang.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1136 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7l2aR0yXa0BnItmZom7dzA_50avEOzks5qOdTCgaJpZM4HfaQG
.

@ptgoetz
Copy link
Member

ptgoetz commented Dec 15, 2016

@vesense Any update on this? It looks like pystorm/streamparse are waiting for this to be merged.

cc @dan-blanchard

@mazhechao
Copy link

And another problem is that it is required to specify the encoding character when using JSON serializing, default is utf-8 . It will raise exception if the object contains non-utf8 characters. Will MessagePackSerializer solve this problem?

@mazhechao
Copy link

mazhechao commented Dec 29, 2016

In my opinion, I prefer to have this in multilang module rather than core. Upgrading storm cluster is not convenient.

@dan-blanchard
Copy link
Contributor

Will MessagePackSerializer solve this problem?

It will if this code is updated to use the latest version of the msgpack-core library. As I mentioned in a previous comment:

This would be much more useful if it were implemented using a newer version of the msgpack-core library—they changed the name after 0.6.12—because 0.7 and above supports the BINARY format, which lets you send arbitrary bytes. Without that, you won't be able to send tuples containing arbitrary bytes with this serializer. This is also a problem with the JSON serializer (because JSON strings can't contain non-Unicode characters), but it would be great if we didn't have the problem here.

@mazhechao
Copy link

Thats great. I hope this pull request can be accepted and released soon.

@mazhechao
Copy link

So is there any update about merging this pull request?

@amontalenti
Copy link
Contributor

Was chatting with @roshannaik and @dan-blanchard today, and this PR came up.

Someone on Storm team may benefit from taking a look at this as part of the performance revisions being done for Storm 2.0. As mentioned in #428 in streamparse -- the community project for running Python multi-lang topologies with Storm -- getting this merged somewhere in the Storm codebase would open up the possibility to switch serializer from Kryo & JSON to msgpack throughout, which would speed up multi-lang use cases considerably. This PR includes a pure Java implementation of a msgpack serializer, as well as pointers to the right msgpack library in the Java community; it just needs to be reviewed, tested, and merged.

@rzo1 rzo1 closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants