Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port allocation from majel to paddle #2217

Closed
wants to merge 23 commits into from

Conversation

QiJune
Copy link
Member

@QiJune QiJune commented May 19, 2017

No description provided.

@QiJune QiJune requested a review from JiayiFeng May 19, 2017 10:47

#include "allocation.h"
#include "hl_cuda.h"
#include "paddle/utils/Logging.h"
Copy link
Collaborator

@wangkuiyi wangkuiyi May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we made an agreement yesterday that Majel shouldn't depend on Paddle? if so, for logging, we should just use google log directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -44,7 +44,10 @@ function(cc_library TARGET_NAME)
else()
add_library(${TARGET_NAME} STATIC ${cc_library_SRCS})
endif()
add_dependencies(${TARGET_NAME} ${cc_library_DEPS} ${external_project_dependencies})
if(cc_library_DEPS)
target_link_libraries(${TARGET_NAME} ${cc_library_DEPS})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call target_link_librareis when we build a library? Or, all we need here is add_dependencies and the linking will happen when we build shared objects or executable binaries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Remove the DEPS to add_test

#include <boost/variant.hpp>

#include "allocation.h"
#include "hl_cuda.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create malloc.{h,cc}, which calls C runtime malloc and cudaMalloc, but doesn't depend on hl_cuda. In that way, we don't make Majel depend on Paddle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

inline bool operator!=(const GpuPlace& o) const { return !(*this == o); }

GpuPlace() : GpuPlace(0) {}
int device;
Copy link
Collaborator

@wangkuiyi wangkuiyi May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it -- why we should make GpuPlace no longer distinguish GPUs? Is that because we want to use the CUDA context to determine the current GPU?

If so, I think what we need to do is not removing int device; from GpuPlace, but to redefine get_place to call cuCtxGetDevice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One CPU thread is binding to one GPU card. Every cpu thread will set the GPU card first, using cudaSetDevice. There is no need for the tensor hold the place which the specific GPU card is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that we cannot assume this? How if we are going to support OpenCL/FPGA other than CUDA? Would this assumption become a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think communications between devices are costful, one neural network will be run on one device. If we are going to support OpenCL/FPGA, we can just define Place as follows:

typedef boost::variant<CpuPlace, CudaPlace, OpenclPlace, FpgaPlace> Place;

Then, we implement methods, such as malloc/free of corresponding device.

Copy link
Collaborator

@wangkuiyi wangkuiyi May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can run a neural network on one device, but when we aggregate gradients/parameters from these devices, it seems that we need to copy data from/to exact places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will create one Context for one device, and then set device id to specific Context. The device id will be handled in Paddle, not in tensor library. It's sure that we can copy data between different GPU cards. Following is a example:

    cudaSetDevice(1);
    cudaDeviceEnablePeerAccess(2,flags); //flags=0
    cudaSetDevice(2);
    cudaDeviceEnablePeerAccess(1,flags); //flags=0

    // Allocate some data
    float *gpu1data, *gpu2data;
    cudaSetDevice(1);
    cudaMalloc(&gpu1data, nbytes);
    cudaSetDevice(2);
    cudaMalloc(&gpu2data, nbytes);
    
    // Do the p2p copy!
    cudaMemcpy(gpu1data, gpu2data, cudaMemcpyDefault);

The gpu data block does not hold device id information, but the Context does.

}

void* operator()(const GpuPlace& p) const {
void* address = hl_malloc_device(size_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move the definition of hl_maclloc_device to paddle/majel/malloc.cc. The definition has only 4 lines of code. In this way, Majel doesn't rely on paddle/cuda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@QiJune QiJune requested review from gangliao and wangkuiyi May 22, 2017 02:11
@QiJune QiJune force-pushed the fetaure/tensor_allocation branch from 7b08aad to 7822c86 Compare May 22, 2017 08:46
@QiJune QiJune force-pushed the fetaure/tensor_allocation branch from 3064514 to 2d6a2be Compare May 23, 2017 05:37
return dest_d;
}

void free_mem_device(void* dest_d) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is the counterpart of malloc_cuda, so it should be name free_cuda?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return cudaGetErrorString((cudaError_t)err);
}

void* malloc_device(size_t size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rename malloc_device into malloc_cuda? We have a plan to support other device interfaces like OpenCL and FPGA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
#endif

class DefaultAllocator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to its name, class DefaultAllocator should be in allocation.{h,cc}; instead of malloc.{h,cc}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majel provides various memory management policies. Every memory management policy can be abstracted into a allocator class. Here, we just implement a simple one first, DefalutAllocator.
malloc is a global method which is responsible for memory allocation. malloc will choose a specific memory allocation policy. And Allocation is a memory block handled by Array and will call malloc method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with every sentence in your comment. And it seems that's the reason we should move class Allocation to allocation.{h,cc}?

#pragma once
#include <memory>

#include "place.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use full include path name -- #include "paddle/majel/place.h". I know everyone might have a different idea about this, but let us just unify. Thanks.

@@ -0,0 +1,37 @@
#pragma once
#include <memory>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is <memory> mandatory in this header file? Should we move it to allocation.cc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <cuda_runtime.h>
#endif

#define CHECK_CUDA(cudaFunc) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that @reyoung and @helinwang both warning some time ago that PaddlePaddle as a library mustn't fatal with error, but need to return the error and make sure that it can be handled by the caller. I think it's worthy to confirm with them.

Copy link
Contributor

@helinwang helinwang May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think in general a library should never fatal. On the exception that if it is in an unrecoverable state. malloc failure is a good example. Maybe CUDA failure is (almost) unrecoverable as well? If CHECK_CUDA fails, can the client do something to overcome the problem?

Copy link
Member Author

@QiJune QiJune May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I think that if malloc method gets an error, the client can hardly do nothing to overcome the problem. So, just let it fatal.
Second, if we check the result state of malloc, then we have to check the result state when we construct an array. It's will be quite fussy.

Copy link
Collaborator

@wangkuiyi wangkuiyi May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of C library design principle, how about that we just follow C's convention of malloc -- returns NULL when the allocation fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In majel's code, when allocation fails, a bad allocation exception will be throw.

if (ptr_ == nullptr) {
    throw std::bad_alloc();
}

If we do not throw an exception or CHECK FATAL, we have to check if the Array has been constructed correctly.
I suggest that we can return error state of most other operations except malloc/free and some CUDA operations. Because we can hardly do something to make the client to overcome if these system related APIs go down.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us don't CHECK_EQ here and remove the definition of CHECK_CUDA.

It is not hard to handle the error returned by cudaMalloc. Majel did the following in src/gpu/detail/cuda.cu:

void* malloc(size_t size) {
    void* ptr = 0;

    cudaError_t result = cudaMalloc(&ptr, size);

    if (result == cudaSuccess) {
        return ptr;
    }

    // clear last error
    cudaGetLastError();

    return nullptr;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in allocation.cu, we will find

Allocation::Allocation(size_t size, Place place) :
    place_(place), size_(size), owned_(true) {

    if (size > 0) {
        majel::detail::Allocator allocator(size_);
        ptr_ = boost::apply_visitor(allocator,
                                    place_);
        if (ptr_ == nullptr) {
            throw std::bad_alloc();
        }
    }
}

So, I think that maybe CHECK fatal and throwing a bad_alloc exception will make the same difference. The server will go down, and client can not recover either.

inline bool operator!=(const GpuPlace& o) const { return !(*this == o); }

GpuPlace() : GpuPlace(0) {}
int device;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that we cannot assume this? How if we are going to support OpenCL/FPGA other than CUDA? Would this assumption become a bug?


class DefaultAllocator {
public:
static void* malloc(majel::Place place, size_t size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error __must_check malloc(majel::Place place, size_t size, void** ptr);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I think that if malloc method gets an error, the client can hardly do nothing to overcome the problem. So, just let it fatal.
Second, if we check the result state of malloc, then we have to check the result state when we construct an array. It's will be quite fussy.

Copy link
Collaborator

@wangkuiyi wangkuiyi May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is common to expose the out-of-memory error to the client code, and I think it is the client code's responsibility to recover the error by either print some error message or try another device.

But I'd make the conclusion here to keep @QiJune 's original function definition because it follows C malloc's signature.

typedef boost::variant<CpuPlace, GpuPlace> Place;
#else
typedef boost::variant<CpuPlace> Place;
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果编译CPU版本的话,那么Place里面只能接受CpuPlace;这个时候给一个Array传递GpuPlace,就会在编译的时候报错。

@QiJune QiJune force-pushed the fetaure/tensor_allocation branch from 666ad98 to 7ae5e38 Compare June 12, 2017 05:06
@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

Close due to paddle use Eigen instead of majel.

@luotao1 luotao1 closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants