-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative data format for matrix shards storage #5
Comments
Thanks @5kg for sharing. This is interesting. |
Well. I was also thinking about this when I wanted to use pb to send a 1GB snapshot. Then I asked the designer of protobuf and the one who implemented go-protobuf. They told me "never mind, you can do it actually". Then I benchmark it, all thing just works fine. I suggest you to benchmark it first. I am not against to use any compression mechanism. However, I am not quite sure why lmdb,leveldb(kv db), csv,hdf5(format), snappy(compression) group together. A little confused. |
Hi, The original description is indeed confusing. Sorry for that. I think the data format for matrix shards storage and transfer between taskgraph nodes are two separated issues. Of course, if we found a data format fits both, that will be great. For data transfer, we may take a look at flatbuffers, or Cap’n Proto. They are both featuring "access to serialized data without parsing/unpacking", which might be favorable for our usage (transferring huge float array). I've not used these two library before. We need do some additional research and benchmark before adopting those libraries. Or maybe just wait until protobuf becomes the bottleneck. For data storage, there is a 2G hard limit for the size of protobuf message, since they encode data use 32-bit integer I think. Such a limit has been affecting people ignoring the "limit you size of message to 1 MB" rule of thumb -- BVLC/caffe#2006 😄 . Most commonly used data format for storage are csv & hdf5. Some machine learning library like caffe also use lmdb/leveldb for data storage. Their use cases for kv store are mostly fetching independent data bolbs such as image, which I think is irrelevant for bwmf. |
In term of Not a priority for now. |
Several things to think about it: If you only access the data once, it would be slower than encoding/decoding in a bulk intuitively and in theory. It really depends on your data access pattern. If you can provide the common access pattern of bwmf, we can make a better decision.
Reference? On a 64bit machine, I do not think there is a 2GB limit actually. |
If you use flatbuffers or Cap’n Proto, one nice thing you would get for free is the data is mmap-able. Thus you do not need to save them into any kv store. |
I think grpc will support snappy eventually. So you do not need to worry about it for communication. |
They use int explicitly: https://github.com/google/protobuf/blob/4644f99d1af4250dec95339be6a13e149787ab33/src/google/protobuf/message_lite.cc#L243 |
Could we write a test case to try the 2GB limit? |
@5kg int on 32bit = 32 bit; int on 64bit = 64bit. int32=32; int64=64. I may be wrong though... Using go too much. |
That's c++ |
For c++, sizeof(int) = 4 on 64bit machine. Maybe it's only a limit for c++? Let me write a test for it. |
Great. Thanks @5kg ! |
@fengjingchao @xiang90 Marshaling/unmarshaling seems to be OK. But I found the memory usage blowed up during the process. It uses as much as 20g memory unmarshaling a 2.8g protobuf file. |
@5kg Have you tried gogoproto? I think that is the problem with the inefficient goproto library... |
@xiang90 Never heard about All I need to do is change the Before:
After:
|
I also changed |
Can you upload the code to gist with the commands too? On Wed, Jun 17, 2015 at 10:44 PM, Zifei Tong notifications@github.com
- Hongchao Deng |
@fengjingchao Sure. Please see https://gist.github.com/5kg/f27a32f238c376635024/revisions You can clone the gist then checkout an older version. |
@5kg gogoproto will generate marshal code, which is significantly better than reflection based marshaling. |
No luck. It's faster but memory usage is still the same.
Updated gist: https://gist.github.com/5kg/f27a32f238c376635024 |
@5kg Wired... Can you try to use go's runtime pkg to print out the memstats? |
@5kg I can help you to debug this when I have time too. |
@5kg Also you can quickly scan the generated code. It should be very easy to understand. |
is it because this is map in the proto message? Xiaoyun On Wed, Jun 17, 2015 at 11:35 PM, Zifei Tong notifications@github.com
|
Quote 1, 2, 3:
Options:
We can probably use snappy for data compression.
The text was updated successfully, but these errors were encountered: