-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix optimizer parameter buffer allocation size. #2855
Conversation
The buffer allocation size should be number of bytes, not number of floats.
return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content)) | ||
} | ||
|
||
r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In optimizer/optimizer.cc
, Content is casted to float array, and length(num_bytes) is passed to Tensor.h
int paddle_update_parameter(paddle_optimizer* o,
const paddle_element_type data_type,
const void* grad_buffer,
int num_bytes) {
auto grad_type = reinterpret_cast<const float*>(grad_buffer);
Tensor* gradient = new Tensor(const_cast<float*>(grad_type), num_bytes);
in optimizer/Tensor.h
:
TensorT(T* data, size_t size)
: height_(1), width_(size), data_ptr_(nullptr), data_(data) {}
The num_bytes is used as width, which means count of float numbers. So this is not bytes.
Maybe we should change the parameter name of paddle_update_parameter
num_bytes
to array_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero helin move the size changing to tensor side. may be it is more clear put this near type casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_bytes
is for const void* grad_buffer
which means the length of the byte buffer. Then paddle_update_parameter
need to calculate the array_size
based of the type of the element. That's why I changed from new TensorT(buffer, num_bytes)
to new Tensor(buffer, num_bytes/sizeof(float))
here
I think it's paddle_update_parameter
's responsibility to determine the meaning and element size of the buffer, not the caller.
Would like to know how do you guys think? @typhoonzero @dzhwinter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't see the change below.
p := paramWithConfigs.Param | ||
c := paramWithConfigs.Config | ||
s := State | ||
paramBufferSize := C.size_t(len(p.Content) / C.sizeof_float) | ||
paramBufferSize := C.size_t(len(p.Content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, malloc size is indeed must be the length of p.Content
the type is unsigned char*
means a byte buffer.
Related: #2859 |
return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content)) | ||
} | ||
|
||
r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero helin move the size changing to tensor side. may be it is more clear put this near type casting.
if i == numGroups-1 { | ||
gs = grads[i*paramPerGroup:] | ||
} else { | ||
gs = grads[i*paramPerGroup : (i+1)*paramPerGroup] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, if we send grads with the size of 0, 10000, 0, 10000 like a sawtooth. compare with group similar size together, like 0, 0 ,0, 10000, 10000. Is it hurt the performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, this is for testing concurrent send, I think the system performs best when the traffic for each parameter server is similar size.
The current implementation will send one entire parameter (no matter how big it is) to one parameter server. In the near future (a TODO item) we will make the optimization to partition each big parameter into smaller chunks and shard them to different parameter servers. In this way, the traffic for each parameter server is similar size.
return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content)) | ||
} | ||
|
||
r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't see the change below.
p.Name = "p_" + strconv.Itoa(i) | ||
p.ElementType = pserver.Float32 | ||
p.Content = make([]byte, (i+1)*100) | ||
err := c.InitParam(pserver.ParameterWithConfig{Param: p, Config: config}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need multiple client instances to simulate multiple client to init parameters at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! I will send a follow up PR for this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
The buffer allocation size should be number of bytes, not number of
floats.
Thanks @reyoung for noticing the bug!
Fixes: #2854
Fixes: #2858