Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynet-92]. Multi-device support #704

Merged
merged 46 commits into from
Aug 10, 2017
Merged

[Dynet-92]. Multi-device support #704

merged 46 commits into from
Aug 10, 2017

Conversation

xunzhang
Copy link
Collaborator

@xunzhang xunzhang commented Jul 17, 2017

  • Refactor dynet to support multi-device more friendly.
  • Honor --dynet-devices argument.
  • Implement interfaces(partly) specifying device type as the argument when defining an expression or using cg.change_expr_device before defining an expression.
  • Implement memcpy between devices in forward process.
  • Test forward in hybird CPU/GPU mode with basic expression: V * tanh(affine_transform(b, W, x)) + a.
  • Implement memcpy between devices in backward process.
  • Test backward in hybird CPU/GPU mode with basic expression: V * tanh(affine_transform(b, W, x)) + a.
  • Add support for rollback to CPU mechanism when there is no GPU implementation yet.
  • Debug hang issue using multiple GPUs
  • Add more feature tests.

Original Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg(dynet::devices_map["GPU:0"]); // default device if not specify
    Expression W = parameter(cg, p_W, dynet::devices_map["CPU"]);
    Expression b = parameter(cg, p_b); // default: GPU:0
    Expression x = input(cg, {2}, x_values); // default: GPU:0
    cg.change_expr_device(dynet::devices_map["GPU:1"]); // change default device for future expressions
    Expression h = tanh(affine_transform({b, W, x})); // reside GPU:1
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

Modified Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg;
    Expression W = parameter(cg, p_W, dynet::devices_map["GPU:0"]);
    Expression b = parameter(cg, p_b); // default to p_b's device(GPU:0)
    Expression x = input(cg, {2}, x_values, dynet::devices_map["CPU"]);
    Expression x_2 = to_device(x, dynet::devices_map["GPU:0"]);
    Expression h = affine_transform({b, W, x}); // default to b's device(GPU:0)
    Expression h_2 = to_device(h, dynet::devices_map["CPU"]);
    Expression v = tanh(h_2); // default to h2's device(CPU), suppose tanh has no cuda impl in this case 
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

To reviewer @neubig , you can have a quick test using code below:

// usage: ./a.out --dynet-devices CPU,GPU:0

#include <iostream>
#include "dynet/dynet.h"
#include "dynet/training.h"
#include "dynet/expr.h"
#include "dynet/io.h"
#include "dynet/model.h"
#include "dynet/devices.h"

using namespace std;
using namespace dynet;

int main(int argc, char** argv) {
  dynet::initialize(argc, argv);

  const unsigned ITERATIONS = 30; 

  // ParameterCollection (all the model parameters).
  ParameterCollection m;
  SimpleSGDTrainer sgd(m);

  const unsigned HIDDEN_SIZE = 8;
  Parameter p_W = m.add_parameters({HIDDEN_SIZE, 2});
  Parameter p_b = m.add_parameters({HIDDEN_SIZE});
  Parameter p_V = m.add_parameters({1, HIDDEN_SIZE});
  Parameter p_a = m.add_parameters({1});
  if (argc == 2) {
    // Load the model and parameters from file if given.
    TextFileLoader loader(argv[1]);
    loader.populate(m);
  }

  // Static declaration of the computation graph.
  ComputationGraph cg; 
  Expression W = parameter(cg, p_W);
  Expression b = parameter(cg, p_b);
  Expression V = parameter(cg, p_V);
  Expression a = parameter(cg, p_a);

  // Set x_values to change the inputs to the network.
  vector<dynet::real> x_values(2);
  Expression x = input(cg, {2}, &x_values);
  dynet::real y_value;  // Set y_value to change the target output.
  Expression y = input(cg, &y_value);

  Expression aa = W * x + b;
  Expression hhh = to_device(aa, dynet::get_global_device("CPU"));
  Expression h = tanh(hhh);
  Expression hh = to_device(h, dynet::get_global_device("GPU:0"));
  Expression y_pred = V*hh + a;
  Expression loss_expr = squared_distance(y_pred, y); 

  // Show the computation graph, just for fun.
  cg.print_graphviz();

  // Train the parameters.
  for (unsigned iter = 0; iter < ITERATIONS; ++iter) {
    double loss = 0;
    for (unsigned mi = 0; mi < 4; ++mi) {
      bool x1 = mi % 2;
      bool x2 = (mi / 2) % 2;
      x_values[0] = x1 ? 1 : -1; 
      x_values[1] = x2 ? 1 : -1; 
      y_value = (x1 != x2) ? 1 : -1; 

      loss += as_scalar(cg.forward(loss_expr));
      cg.backward(loss_expr);
      sgd.update();

    }
    loss /= 4;
    cerr << "E = " << loss << endl;
  }

  // Output the model and parameter objects to a file.
  TextFileSaver saver("/tmp/xor.model");
  saver.save(m);
}

@xunzhang xunzhang changed the title Dynet 92 model parallelism Dynet 92 Multi-device support Jul 17, 2017
@xunzhang xunzhang changed the title Dynet 92 Multi-device support [WIP] [Dynet-92]. Multi-device support Jul 17, 2017
@xunzhang xunzhang mentioned this pull request Jul 20, 2017
3 tasks
@neubig
Copy link
Contributor

neubig commented Jul 20, 2017

In general, this is great: I think multi-device support will be a great feature for DyNet to have. First, I have a high level comment. In my mind, there are two design decisions here:

How do we specify the "default" device of a graph node when it is not specified explicitly

  1. Current Implementation: A default is passed to ComputationGraph, and the default is used.
  2. Alternatively, we could have the node default to the device of its first argument.

The first has the advantage of perhaps being easier to understand, but may result in hidden memory moves where people aren't expecting them. It also adds some code complexity.

When some of the inputs are not on the same device what do you do?

  1. Current Implementation: The ExecutionEngine is responsible for moving memory.

  2. The ExecutionEngine throws an error, telling the user to move the memory themselves (using something like dy.change_device(x, device)).

  3. A combination of 1. and 2., where 2. is on by default, but 1. can be chosen.

  4. and 3. have the advantage of not crashing, but also have the potential to hide memory moves that the user really wouldn't want to be doing. (For example, in the example code, the weight matrix would be passed from CPU to GPU every time it was used, which would be really really bad.) 2. has the advantage of preventing this, but may result in a sightly increased coding burden.

My opinion: I tend to prefer 2./2. respectively, but could be convinced otherwise.

@yoavg
Copy link
Contributor

yoavg commented Jul 20, 2017

This is really great!

I like options 2 and 2 also, but would like to propose a variation of the second (2):

the name dy.change_device(x, device) is a bit confusing imo, as we are not changing the device of x as much as copying x to another device (it can also be used on the same device after). So I propose to change @neubig 's proposed interface slightly to:
Expression y = x.to_device(device)

letting both x and y be used.

Another proposal (maybe its already there, I didn't look at the code) is to also allow multiple CPU devices. There, the copying will be NOPs, but we can still run different cpu devices as different threads.

How do things look in terms of synchronization in the current implementation?

@xunzhang
Copy link
Collaborator Author

Very helpful comments!! I also prefer 2. / 2. then. And I think to support both copy like x.to_device(device) and move like x.change_device(device) are necessary. But to_device interface is sort of hard to implement since there will be some discrete VariableIndex indexes and we need to refactor the executor code to support it. I think I will first finish change_device and then implement to_device.

Currently, we don't support specifying CPU id. I will think about that in the near future.

…device interface instead of do memcpy in executor.
@xunzhang
Copy link
Collaborator Author

The remark of to_device and change_device in my last comment might be incorrect. Basically, I think to_device is a copy like operation which will create an additional node while change_device is sort of changing the device assignment of an expression(not sure if this semantic is useful or not).

@neubig
Copy link
Contributor

neubig commented Jul 22, 2017

@xunzhang Yes, to_device is an operation that will create a new node (where the memory is stored on a different device than a single node). I don't think that we should have a function to change the device of a particular node for the reasons you mentioned: it would complicate things and require special handling in the executor. Regarding Yoav's comment about having multiple CPU devices, I'm not sure that this is necessary. In order to do things on multiple threads, we'll need to have a multi-threaded execution engine anyway, so we can probably have the multi-threaded execution engine perform multiple operations using the same CPU device. Let's save this discussion for a later commit when we tackle multi-threading the execution engine.

@xunzhang
Copy link
Collaborator Author

@neubig Right, cool. I will finish this soon.

@xunzhang
Copy link
Collaborator Author

xunzhang commented Jul 24, 2017

This pull request is review-ready. It will not affect old code and interface and I will split remaining work in the future pull requests.

The remaining things include,

  1. Fix remaining hard-coded default_device places.
  2. Fix the multi GPUs hanging bug: this is not introduced by this pull request and should be applied with multi-device support someplace.
  3. Python interface: the code in this pull request will not break current use I think.
  4. Add tests and refactor failing GPU unit-tests, update the document.

@xunzhang xunzhang changed the title [WIP] [Dynet-92]. Multi-device support [Dynet-92]. Multi-device support Jul 24, 2017
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great! I have a bunch of small comments, but I think once they're resolved and I can confirm that this works in my environment, I think we can merge. Also, some of my comments might just be oversights, so if there's anything that you don't think needs to be fixed, just tell me.

dynet/exec.cc Outdated
@@ -122,6 +122,10 @@ const Tensor& SimpleExecutionEngine::incremental_forward(VariableIndex i) {
}
node->aux_mem = aux_mem;

// check consistent device
for (auto & xs_v : xs) {
DYNET_ASSERT(xs_v->device == nfxs[num_nodes_evaluated].device, "Attemp to do tensor forward in different devices");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attemp -> Attempt. Similarly for all below.

dynet/exec.cc Outdated
@@ -245,15 +254,17 @@ void BatchedExecutionEngine::combine_tensors(std::vector<VariableIndex> batch_id
const size_t sz = node2size[id];

float* my_src = batches[node2batch[id]].nfx.v + node2offset[id];
if (tout.device->type == DeviceType::CPU) {
memcpy(dest, my_src, sz * sizeof(float));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this "else", I think we should add a tout.device->type == DeviceType::GPU, then throw an error if we get a device other than the one we're expecting. We might add other device types later, and if we do this logic will break (and it's in a somewhat inconspicuous place, so we should make sure that it's easy to catch through an error).

dynet/expr.cc Outdated
@@ -44,15 +43,17 @@ Expression operator+(const Expression& x, const Expression& y) {
else if (y.dim().batch_size() == 1)
return Expression(x.pg, x.pg->add_function<ScalarAdd>({x.i, y.i}));
else
return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i}));
return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i}, x.pg->nodes[x.i]->device));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't need a "device" argument, right? It can just inherit its device from its inputs.

dynet/expr.cc Outdated
@@ -76,7 +77,10 @@ Expression contract3d_1d(const Expression& x, const Expression& y, const Express
Expression sqrt(const Expression& x) { return Expression(x.pg, x.pg->add_function<Sqrt>({x.i})); }
Expression abs(const Expression& x) { return Expression(x.pg, x.pg->add_function<Abs>({x.i})); }
Expression erf(const Expression& x) { return Expression(x.pg, x.pg->add_function<Erf>({x.i})); }
Expression tanh(const Expression& x) { return Expression(x.pg, x.pg->add_function<Tanh>({x.i})); }
Expression tanh(const Expression& x, Device *device) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we should remove the device argument from this and all others below, except the ones that deal with input, or explicitly changing devices.

dynet/expr.h Outdated
const bool is_stale() const {return (get_number_of_active_graphs() != 1 || graph_id != get_current_graph_id());}
Expression() : pg(nullptr), i(0), graph_id(0) {}

Expression(Device *device) : pg(nullptr), i(0), graph_id(0) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to the empty constructor, so we should delete it.

@@ -19,7 +21,7 @@ struct Transpose : public Node {
// y = inv(x)
// x = an invertible matrix
struct MatrixInverse : public Node {
explicit MatrixInverse(const std::initializer_list<VariableIndex>& a) : Node(a) {}
explicit MatrixInverse(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no device here.


namespace dynet {

// y = x_1 * x_2
struct MatrixMultiply : public Node {
explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a) : Node(a) {}
explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No device here.


namespace dynet {

// y = tanh x_1
struct Tanh : public Node {
explicit Tanh(const std::initializer_list<VariableIndex>& a) : Node(a) {}
explicit Tanh(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No device here.

dynet/tensor.cc Outdated
@@ -94,11 +95,13 @@ float TensorTools::access_element(const Tensor& v, int index) {
}

float TensorTools::access_element(const Tensor& v, const Dim& index) {
if (v.device->type == DeviceType::CPU) {
return (*v)(index[0], index[1]);
} else {
#if HAVE_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I think we should check if the device is GPU, and throw an error if we get an unknown device.

dynet/tensor.cc Outdated
cudaMemcpyAsync(v.v, v_src.v, sizeof(real) * v.d.size(), cudaMemcpyDeviceToHost);
#endif
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, check if device is GPU explicitly.

@neubig neubig merged commit 5903c85 into clab:master Aug 10, 2017
@neubig
Copy link
Contributor

neubig commented Aug 10, 2017

Thanks! I confirmed that this is working as expected, so I merged. This is great to have :)

@duyvuleo
Copy link
Contributor

Is it working actually? I tried the "./examples/train_xor-multidevice" example and got the following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid device name: GPU:0
Aborted (core dumped)

@neubig
Copy link
Contributor

neubig commented Aug 17, 2017

I think documentation is not finished yet, but you need to add --dynet-devices CPU,GPU:0 to the command line I think.

@duyvuleo
Copy link
Contributor

It works. Thanks!

@xunzhang xunzhang mentioned this pull request Aug 25, 2017
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants