Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix optimizer parameter buffer allocation size. #2855

Merged
merged 2 commits into from
Jul 14, 2017

Conversation

helinwang
Copy link
Contributor

@helinwang helinwang commented Jul 13, 2017

The buffer allocation size should be number of bytes, not number of
floats.

Thanks @reyoung for noticing the bug!

Fixes: #2854
Fixes: #2858

The buffer allocation size should be number of bytes, not number of
floats.
@helinwang helinwang requested a review from dzhwinter July 13, 2017 19:50
return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content))
}

r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In optimizer/optimizer.cc, Content is casted to float array, and length(num_bytes) is passed to Tensor.h

int paddle_update_parameter(paddle_optimizer* o,
                            const paddle_element_type data_type,
                            const void* grad_buffer,
                            int num_bytes) {
  auto grad_type = reinterpret_cast<const float*>(grad_buffer);
  Tensor* gradient = new Tensor(const_cast<float*>(grad_type), num_bytes);

in optimizer/Tensor.h:

TensorT(T* data, size_t size)
      : height_(1), width_(size), data_ptr_(nullptr), data_(data) {}

The num_bytes is used as width, which means count of float numbers. So this is not bytes.

Maybe we should change the parameter name of paddle_update_parameter num_bytes to array_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero helin move the size changing to tensor side. may be it is more clear put this near type casting.

Copy link
Contributor Author

@helinwang helinwang Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_bytes is for const void* grad_buffer which means the length of the byte buffer. Then paddle_update_parameter need to calculate the array_size based of the type of the element. That's why I changed from new TensorT(buffer, num_bytes) to new Tensor(buffer, num_bytes/sizeof(float)) here
I think it's paddle_update_parameter's responsibility to determine the meaning and element size of the buffer, not the caller.
Would like to know how do you guys think? @typhoonzero @dzhwinter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see the change below.

p := paramWithConfigs.Param
c := paramWithConfigs.Config
s := State
paramBufferSize := C.size_t(len(p.Content) / C.sizeof_float)
paramBufferSize := C.size_t(len(p.Content))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, malloc size is indeed must be the length of p.Content the type is unsigned char* means a byte buffer.

@typhoonzero
Copy link
Contributor

Related: #2859

return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content))
}

r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero helin move the size changing to tensor side. may be it is more clear put this near type casting.

if i == numGroups-1 {
gs = grads[i*paramPerGroup:]
} else {
gs = grads[i*paramPerGroup : (i+1)*paramPerGroup]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, if we send grads with the size of 0, 10000, 0, 10000 like a sawtooth. compare with group similar size together, like 0, 0 ,0, 10000, 10000. Is it hurt the performance?

Copy link
Contributor Author

@helinwang helinwang Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, this is for testing concurrent send, I think the system performs best when the traffic for each parameter server is similar size.
The current implementation will send one entire parameter (no matter how big it is) to one parameter server. In the near future (a TODO item) we will make the optimization to partition each big parameter into smaller chunks and shard them to different parameter servers. In this way, the traffic for each parameter server is similar size.

return fmt.Errorf("Name: %s, parameter and gradient does not have same content len, parameter: %d, gradient: %d", g.Name, o.contentLen, len(g.Content))
}

r := C.paddle_update_parameter(o.opt, C.paddle_element_type(g.ElementType), unsafe.Pointer(&g.Content[0]), C.int(len(g.Content)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see the change below.

p.Name = "p_" + strconv.Itoa(i)
p.ElementType = pserver.Float32
p.Content = make([]byte, (i+1)*100)
err := c.InitParam(pserver.ParameterWithConfig{Param: p, Config: config})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need multiple client instances to simulate multiple client to init parameters at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I will send a follow up PR for this one.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@helinwang helinwang merged commit f49fda5 into PaddlePaddle:develop Jul 14, 2017
@helinwang helinwang deleted the issue/2854 branch July 14, 2017 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants