[Dynet-92]. Multi-device support #704

xunzhang · 2017-07-17T11:21:07Z

Original Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg(dynet::devices_map["GPU:0"]); // default device if not specify
    Expression W = parameter(cg, p_W, dynet::devices_map["CPU"]);
    Expression b = parameter(cg, p_b); // default: GPU:0
    Expression x = input(cg, {2}, x_values); // default: GPU:0
    cg.change_expr_device(dynet::devices_map["GPU:1"]); // change default device for future expressions
    Expression h = tanh(affine_transform({b, W, x})); // reside GPU:1
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

Modified Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg;
    Expression W = parameter(cg, p_W, dynet::devices_map["GPU:0"]);
    Expression b = parameter(cg, p_b); // default to p_b's device(GPU:0)
    Expression x = input(cg, {2}, x_values, dynet::devices_map["CPU"]);
    Expression x_2 = to_device(x, dynet::devices_map["GPU:0"]);
    Expression h = affine_transform({b, W, x}); // default to b's device(GPU:0)
    Expression h_2 = to_device(h, dynet::devices_map["CPU"]);
    Expression v = tanh(h_2); // default to h2's device(CPU), suppose tanh has no cuda impl in this case 
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

To reviewer @neubig , you can have a quick test using code below:

// usage: ./a.out --dynet-devices CPU,GPU:0

#include <iostream>
#include "dynet/dynet.h"
#include "dynet/training.h"
#include "dynet/expr.h"
#include "dynet/io.h"
#include "dynet/model.h"
#include "dynet/devices.h"

using namespace std;
using namespace dynet;

int main(int argc, char** argv) {
  dynet::initialize(argc, argv);

  const unsigned ITERATIONS = 30; 

  // ParameterCollection (all the model parameters).
  ParameterCollection m;
  SimpleSGDTrainer sgd(m);

  const unsigned HIDDEN_SIZE = 8;
  Parameter p_W = m.add_parameters({HIDDEN_SIZE, 2});
  Parameter p_b = m.add_parameters({HIDDEN_SIZE});
  Parameter p_V = m.add_parameters({1, HIDDEN_SIZE});
  Parameter p_a = m.add_parameters({1});
  if (argc == 2) {
    // Load the model and parameters from file if given.
    TextFileLoader loader(argv[1]);
    loader.populate(m);
  }

  // Static declaration of the computation graph.
  ComputationGraph cg; 
  Expression W = parameter(cg, p_W);
  Expression b = parameter(cg, p_b);
  Expression V = parameter(cg, p_V);
  Expression a = parameter(cg, p_a);

  // Set x_values to change the inputs to the network.
  vector<dynet::real> x_values(2);
  Expression x = input(cg, {2}, &x_values);
  dynet::real y_value;  // Set y_value to change the target output.
  Expression y = input(cg, &y_value);

  Expression aa = W * x + b;
  Expression hhh = to_device(aa, dynet::get_global_device("CPU"));
  Expression h = tanh(hhh);
  Expression hh = to_device(h, dynet::get_global_device("GPU:0"));
  Expression y_pred = V*hh + a;
  Expression loss_expr = squared_distance(y_pred, y); 

  // Show the computation graph, just for fun.
  cg.print_graphviz();

  // Train the parameters.
  for (unsigned iter = 0; iter < ITERATIONS; ++iter) {
    double loss = 0;
    for (unsigned mi = 0; mi < 4; ++mi) {
      bool x1 = mi % 2;
      bool x2 = (mi / 2) % 2;
      x_values[0] = x1 ? 1 : -1; 
      x_values[1] = x2 ? 1 : -1; 
      y_value = (x1 != x2) ? 1 : -1; 

      loss += as_scalar(cg.forward(loss_expr));
      cg.backward(loss_expr);
      sgd.update();

    }
    loss /= 4;
    cerr << "E = " << loss << endl;
  }

  // Output the model and parameter objects to a file.
  TextFileSaver saver("/tmp/xor.model");
  saver.save(m);
}

… cpu impl temporarily.

…x+h)

… format(--dynet-devices CPU,GPU:0,GPU:2)

neubig · 2017-07-20T19:28:47Z

In general, this is great: I think multi-device support will be a great feature for DyNet to have. First, I have a high level comment. In my mind, there are two design decisions here:

How do we specify the "default" device of a graph node when it is not specified explicitly

Current Implementation: A default is passed to ComputationGraph, and the default is used.
Alternatively, we could have the node default to the device of its first argument.

The first has the advantage of perhaps being easier to understand, but may result in hidden memory moves where people aren't expecting them. It also adds some code complexity.

When some of the inputs are not on the same device what do you do?

Current Implementation: The ExecutionEngine is responsible for moving memory.
The ExecutionEngine throws an error, telling the user to move the memory themselves (using something like dy.change_device(x, device)).
A combination of 1. and 2., where 2. is on by default, but 1. can be chosen.
and 3. have the advantage of not crashing, but also have the potential to hide memory moves that the user really wouldn't want to be doing. (For example, in the example code, the weight matrix would be passed from CPU to GPU every time it was used, which would be really really bad.) 2. has the advantage of preventing this, but may result in a sightly increased coding burden.

My opinion: I tend to prefer 2./2. respectively, but could be convinced otherwise.

yoavg · 2017-07-20T19:47:42Z

This is really great!

I like options 2 and 2 also, but would like to propose a variation of the second (2):

the name dy.change_device(x, device) is a bit confusing imo, as we are not changing the device of x as much as copying x to another device (it can also be used on the same device after). So I propose to change @neubig 's proposed interface slightly to:
Expression y = x.to_device(device)

letting both x and y be used.

Another proposal (maybe its already there, I didn't look at the code) is to also allow multiple CPU devices. There, the copying will be NOPs, but we can still run different cpu devices as different threads.

How do things look in terms of synchronization in the current implementation?

xunzhang · 2017-07-22T08:23:22Z

Very helpful comments!! I also prefer 2. / 2. then. And I think to support both copy like x.to_device(device) and move like x.change_device(device) are necessary. But to_device interface is sort of hard to implement since there will be some discrete VariableIndex indexes and we need to refactor the executor code to support it. I think I will first finish change_device and then implement to_device.

Currently, we don't support specifying CPU id. I will think about that in the near future.

…device interface instead of do memcpy in executor.

xunzhang · 2017-07-22T09:46:52Z

The remark of to_device and change_device in my last comment might be incorrect. Basically, I think to_device is a copy like operation which will create an additional node while change_device is sort of changing the device assignment of an expression(not sure if this semantic is useful or not).

neubig · 2017-07-22T14:42:25Z

@xunzhang Yes, to_device is an operation that will create a new node (where the memory is stored on a different device than a single node). I don't think that we should have a function to change the device of a particular node for the reasons you mentioned: it would complicate things and require special handling in the executor. Regarding Yoav's comment about having multiple CPU devices, I'm not sure that this is necessary. In order to do things on multiple threads, we'll need to have a multi-threaded execution engine anyway, so we can probably have the multi-threaded execution engine perform multiple operations using the same CPU device. Let's save this discussion for a later commit when we tackle multi-threading the execution engine.

xunzhang · 2017-07-22T15:32:34Z

@neubig Right, cool. I will finish this soon.

xunzhang · 2017-07-24T17:24:20Z

This pull request is review-ready. It will not affect old code and interface and I will split remaining work in the future pull requests.

The remaining things include,

Fix remaining hard-coded default_device places.
Fix the multi GPUs hanging bug: this is not introduced by this pull request and should be applied with multi-device support someplace.
Python interface: the code in this pull request will not break current use I think.
Add tests and refactor failing GPU unit-tests, update the document.

neubig

Thanks, this is great! I have a bunch of small comments, but I think once they're resolved and I can confirm that this works in my environment, I think we can merge. Also, some of my comments might just be oversights, so if there's anything that you don't think needs to be fixed, just tell me.

neubig · 2017-07-28T10:14:37Z

dynet/exec.cc

@@ -122,6 +122,10 @@ const Tensor& SimpleExecutionEngine::incremental_forward(VariableIndex i) {
      }
      node->aux_mem = aux_mem;

+      // check consistent device
+      for (auto & xs_v : xs) {
+        DYNET_ASSERT(xs_v->device == nfxs[num_nodes_evaluated].device, "Attemp to do tensor forward in different devices");


Attemp -> Attempt. Similarly for all below.

neubig · 2017-07-28T10:18:27Z

dynet/exec.cc

@@ -245,15 +254,17 @@ void BatchedExecutionEngine::combine_tensors(std::vector<VariableIndex> batch_id
    const size_t sz = node2size[id];

    float* my_src = batches[node2batch[id]].nfx.v + node2offset[id];
+    if (tout.device->type == DeviceType::CPU) {
+      memcpy(dest, my_src, sz * sizeof(float));
+    } else {


For this "else", I think we should add a tout.device->type == DeviceType::GPU, then throw an error if we get a device other than the one we're expecting. We might add other device types later, and if we do this logic will break (and it's in a somewhat inconspicuous place, so we should make sure that it's easy to catch through an error).

neubig · 2017-07-28T10:20:20Z

dynet/expr.cc

@@ -44,15 +43,17 @@ Expression operator+(const Expression& x, const Expression& y) {
    else if (y.dim().batch_size() == 1)
        return Expression(x.pg, x.pg->add_function<ScalarAdd>({x.i, y.i}));
    else
-        return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i}));
+        return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i}, x.pg->nodes[x.i]->device));


This shouldn't need a "device" argument, right? It can just inherit its device from its inputs.

neubig · 2017-07-28T10:20:54Z

dynet/expr.cc

@@ -76,7 +77,10 @@ Expression contract3d_1d(const Expression& x, const Expression& y, const Express
 Expression sqrt(const Expression& x) { return Expression(x.pg, x.pg->add_function<Sqrt>({x.i})); }
 Expression abs(const Expression& x) { return Expression(x.pg, x.pg->add_function<Abs>({x.i})); }
 Expression erf(const Expression& x) { return Expression(x.pg, x.pg->add_function<Erf>({x.i})); }
-Expression tanh(const Expression& x) { return Expression(x.pg, x.pg->add_function<Tanh>({x.i})); }
+Expression tanh(const Expression& x, Device *device) {


Similarly, we should remove the device argument from this and all others below, except the ones that deal with input, or explicitly changing devices.

neubig · 2017-07-28T10:22:19Z

dynet/expr.h

-  const bool is_stale() const {return (get_number_of_active_graphs() != 1 || graph_id != get_current_graph_id());}
+  Expression() : pg(nullptr), i(0), graph_id(0) {}
+
+  Expression(Device *device) : pg(nullptr), i(0), graph_id(0) {}


This is identical to the empty constructor, so we should delete it.

neubig · 2017-07-28T10:33:34Z

dynet/nodes-linalg.h

@@ -19,7 +21,7 @@ struct Transpose : public Node {
 // y = inv(x)
 // x = an invertible matrix
 struct MatrixInverse : public Node {
-  explicit MatrixInverse(const std::initializer_list<VariableIndex>& a) : Node(a) {}
+  explicit MatrixInverse(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {}


Again, no device here.

neubig · 2017-07-28T10:33:46Z

dynet/nodes-matrixmultiply.h


 namespace dynet {

 // y = x_1 * x_2
 struct MatrixMultiply : public Node {
-  explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a) : Node(a) {}
+  explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a,


No device here.

neubig · 2017-07-28T10:33:58Z

dynet/nodes-trig.h


 namespace dynet {

 // y = tanh x_1
 struct Tanh : public Node {
-  explicit Tanh(const std::initializer_list<VariableIndex>& a) : Node(a) {}
+  explicit Tanh(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {}


No device here.

neubig · 2017-07-28T10:35:13Z

dynet/tensor.cc

@@ -94,11 +95,13 @@ float TensorTools::access_element(const Tensor& v, int index) {
 }

 float TensorTools::access_element(const Tensor& v, const Dim& index) {
+  if (v.device->type == DeviceType::CPU) {
+    return (*v)(index[0], index[1]);
+  } else {
 #if HAVE_CUDA


Similarly here, I think we should check if the device is GPU, and throw an error if we get an unknown device.

neubig · 2017-07-28T10:36:11Z

dynet/tensor.cc

+      cudaMemcpyAsync(v.v, v_src.v, sizeof(real) * v.d.size(), cudaMemcpyDeviceToHost);
+#endif
+    }
+  } else {


Similarly here, check if device is GPU explicitly.

…factor

neubig · 2017-08-10T15:51:34Z

Thanks! I confirmed that this is working as expected, so I merged. This is great to have :)

duyvuleo · 2017-08-17T00:34:43Z

Is it working actually? I tried the "./examples/train_xor-multidevice" example and got the following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid device name: GPU:0
Aborted (core dumped)

neubig · 2017-08-17T00:59:21Z

I think documentation is not finished yet, but you need to add --dynet-devices CPU,GPU:0 to the command line I think.

duyvuleo · 2017-08-17T02:17:19Z

It works. Thanks!

xunzhang and others added 12 commits July 11, 2017 21:13

Modify HAVE_CUDA usage in exec.cc to support multi-devices.

fff3c32

add DYNET_NO_CUDA_IMPL_WARNING for no cuda impl place and rollback to…

6f065f9

… cpu impl temporarily.

fix typo

1545b22

fix incorrect optional argument throw condition.

4b47e1d

Implement basic multi-device interface for simple expression: tanh(W*…

5133243

…x+h)

fix duplicate symbol error without inline for affine_transform

7ed97e4

rename --dynet-gpu-ids with --dynet-devices and redefine the argument…

aab84d8

… format(--dynet-devices CPU,GPU:0,GPU:2)

typo fix

4129fb5

add devices_map global var

3cb6bd2

forward non autobatch model memcpy added, to be tested

59e58ca

fix coredump and passed basic forward test

fb571d7

fix devices.cc to avoid conflict

96b36d0

xunzhang changed the title ~~Dynet 92 model parallelism~~ Dynet 92 Multi-device support Jul 17, 2017

xunzhang changed the title ~~Dynet 92 Multi-device support~~ [WIP] [Dynet-92]. Multi-device support Jul 17, 2017

xunzhang added 2 commits July 19, 2017 16:12

backward non autobatch model memcpy, to be tested

4efefca

hotfix

f9851c6

xunzhang mentioned this pull request Jul 20, 2017

[WIP] ThreadPool Device Support #713

Closed

3 tasks

xunzhang added 2 commits July 20, 2017 16:10

update inverse op to init cpu rollback

3594cd1

add device option for add_parameters and add_lookup_parameters

19c343f

remove device attribute in ComputationGraph class

0a48cb8

change default device to the first argument by default, honor change_…

a4c501b

…device interface instead of do memcpy in executor.

xunzhang added 3 commits July 23, 2017 00:52

add to_device operation

3fbeb12

conflict fix

ea622c3

add to_device operation into CMake

2b2ee42

manually revert cpu auto-rollback strategy commit

bd7018d

xunzhang changed the title ~~[WIP] [Dynet-92]. Multi-device support~~ [Dynet-92]. Multi-device support Jul 24, 2017

neubig requested changes Jul 28, 2017

View reviewed changes

xunzhang and others added 13 commits July 29, 2017 15:07

refactor impl according to review comments

e46d459

add bad device check and fix the ci tests failure

1284dbd

device for add_(const)_(lookup)_parameter

a656534

indent fix

e7008ff

Merge branch 'master' into dynet-92_model_parallelism

406e2f6

add get_global_device and modify gpu init case

f6788f4

Made some modifications and added example

e6fb612

parameter update support for multi-device without gradient_l2_norm re…

7e9da04

…factor

Fixed checking of same device

88fc18c

multi-device rewrite of gradient_l2_norm_dev

8d3810a

bugfix

71c68ff

add cpu device in all cases to do gradient_l2_norm

77ab3e6

Removed debug command

952467f

neubig approved these changes Aug 3, 2017

View reviewed changes

neubig and others added 5 commits August 3, 2017 12:52

Re-enabled bernoulli randomization

f23e29c

fix unmatched index bug for gradient_l2_norm_dev

bf72f88

add "--dynet-devices" doc for commandline option page

5dff94e

Merge branch 'master' into dynet-92_model_parallelism

8d9a66f

modify ToDevice::backward_dev_impl to support multi-copy cg structure

15e75c2

neubig merged commit 5903c85 into clab:master Aug 10, 2017

neubig mentioned this pull request Aug 10, 2017

Better way to set configuration from python. #731

Closed

xunzhang mentioned this pull request Aug 25, 2017

Multi-device support #92

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dynet-92]. Multi-device support #704

[Dynet-92]. Multi-device support #704

xunzhang commented Jul 17, 2017 •

edited

Loading

neubig commented Jul 20, 2017

yoavg commented Jul 20, 2017

xunzhang commented Jul 22, 2017

xunzhang commented Jul 22, 2017

neubig commented Jul 22, 2017

xunzhang commented Jul 22, 2017

xunzhang commented Jul 24, 2017 •

edited

Loading

neubig left a comment

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig Jul 28, 2017

neubig commented Aug 10, 2017

duyvuleo commented Aug 17, 2017

neubig commented Aug 17, 2017

duyvuleo commented Aug 17, 2017

[Dynet-92]. Multi-device support #704

[Dynet-92]. Multi-device support #704

Conversation

xunzhang commented Jul 17, 2017 • edited Loading

neubig commented Jul 20, 2017

How do we specify the "default" device of a graph node when it is not specified explicitly

When some of the inputs are not on the same device what do you do?

yoavg commented Jul 20, 2017

xunzhang commented Jul 22, 2017

xunzhang commented Jul 22, 2017

neubig commented Jul 22, 2017

xunzhang commented Jul 22, 2017

xunzhang commented Jul 24, 2017 • edited Loading

neubig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neubig commented Aug 10, 2017

duyvuleo commented Aug 17, 2017

neubig commented Aug 17, 2017

duyvuleo commented Aug 17, 2017

xunzhang commented Jul 17, 2017 •

edited

Loading

xunzhang commented Jul 24, 2017 •

edited

Loading