Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable memory manager #1203

Open
1 task
cjnolet opened this issue May 7, 2020 · 9 comments
Open
1 task

Pluggable memory manager #1203

cjnolet opened this issue May 7, 2020 · 9 comments

Comments

@cjnolet
Copy link
Contributor

cjnolet commented May 7, 2020

Running on:

  • CPU
  • [ X] GPU

Interface:

  • [ X ] C++
  • [ X ] Python

We have been enabling consistent use of the RMM (RAPIDS memory manager) across the RAPIDS ecosystem and it would be very useful if we could also plug it into FAISS. We would like to start using the GPU-accelerated approximate methods on cuML's NearestNeighbors and this could be a way that we could still use the GpuResources API but be able to guarantee that it will play nicely with the RMMPoolAllocator.

It would be great if FAISS provided a way to plug in a memory manager/allocator. I don't think there needs to be a dependency on RMM here, just the ability for the underlying memory management to be plugged in.

@wickedfoo
Copy link
Contributor

I was thinking about overhauling the memory alllocation/deallocation in GPU Faiss anyways, to be able to better keep track of where memory is going for users, and allow for optional logging. I'll make sure that every allocation goes through the GpuResources object, and there will be one of several different categories.

There are broadly two classes of memory allocations in GPU Faiss: permanent and temporary. Permanent allocations are retained for the lifetime of the index, and are ultimately owned by the index.

Temporary allocations are made out of a memory stack that GpuResources allocates up front, which falls back to the heap (cudaMalloc) when the stack size is exhausted. These allocations do not live beyond the lifetime of a top level call to a Faiss index (or at least, on the GPU they are ordered with respect to the ordering stream, and once all kernels are done on the stream to which all work is ordered, then that temporary allocation is no longer needed and can be reused or freed. Generally about 1 GB or so of memory should be reserved in this stack to avoid cudaMalloc/Free calls during many search operations.

An implementation can then be provided of the GpuResources object, and you can route those memory allocations wherever you want, provided you maintain the lifetimes desired.

@mdouze
Copy link
Contributor

mdouze commented May 12, 2020

Nice explanation @wickedfoo, would you mind adding it to https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU ?

@mdouze
Copy link
Contributor

mdouze commented May 29, 2020

No activity, closing.

@mdouze mdouze closed this as completed May 29, 2020
@wickedfoo wickedfoo reopened this May 29, 2020
@wickedfoo
Copy link
Contributor

I will be working on this in June.

@cjnolet
Copy link
Contributor Author

cjnolet commented Jun 17, 2020

@wickedfoo any chance you have gotten the opportunity to make progress on this? We have been using the RMM pool allocator and managed memory in order to eliminate all device synchronizations from alloc/dealloc and oversubscribe the memory.

We are hoping to integrate the approximate nearest neighbors very soon and plugging in a separate allocator will be even more important since FAISS will retain ownership over that memory.

@wickedfoo
Copy link
Contributor

Yes, I have started on the diff on my end, sorry I work on many things at FB most of which have nothing to do with Faiss these days. The API looks something like this:

enum AllocType {
 ... a bunch of categories of memory in here for Faiss internals ...
};

enum MemorySpace {
  /// Managed using cudaMalloc/cudaFree
  Device = 1,
  /// Managed using cudaMallocManaged/cudaFree
  Unified = 2,
  /// Managed using cudaHostAlloc/cudaFreeHost
  HostPinned = 3,
};

struct AllocRequest {
  AllocType type;
  int device;
  MemorySpace space;
  cudaStream_t stream;
  size_t size;
};

with the functions in GpuResources looking something like this:

  virtual void* allocMemory(const AllocRequest& req) = 0;
  virtual void deallocMemory(int device, void* in) = 0;

so to override the memory allocator, you could extend StandardGpuResources (which provides a default implementation of the rest of GpuResources, providing streams and cuBLAS handles etc.) and override these two functions with your own memory allocator code in C++.

@wickedfoo
Copy link
Contributor

FYI you can currently allocate index objects using managed memory, though this wouldn't come out of your allocator:

https://github.com/facebookresearch/faiss/blob/master/gpu/GpuIndex.h#L30

Not 100% sure that all allocations will go through this, but the major ones would.

@wickedfoo
Copy link
Contributor

wickedfoo commented Jun 30, 2020

@cjnolet I have finished the diff internally in our FB repo, it is out for review amongst ourselves so hopefully will be in your hands not before too long.

There are two allocation functions added to faiss::gpu::GpuResources. All GPU memory allocation is guaranteed to go through these two calls, in the instance of the resource object provided to the index:

  /// Memory management
  /// Returns an allocation from the given memory space, ordered with respect to
  /// the given stream (i.e., the first user will be a kernel in this stream).
  /// All allocations are sized internally to be the next highest multiple of 16
  /// bytes, and all allocations returned are guaranteed to be 16 byte aligned.
  virtual void* allocMemory(const AllocRequest& req) = 0;

  /// Returns a previous allocation
  virtual void deallocMemory(int device, void* in) = 0;

As an artifact of the pre-C++11 API and SWIG restrictions we had at one time (no shared_ptr), the GpuResources object must stay alive for the lifetime of any indices that use it (the case before and after as well).

where AllocRequest etc. is defined like this:

enum AllocType {
  /// Unknown allocation type or miscellaneous (not currently categorized)
  Other = 0,

  /// Primary data storage for GpuIndexFlat (the raw matrix of vectors and
  /// vector norms if needed)
  FlatData = 1,

  /// Primary data storage for GpuIndexIVF* (the storage for each individual IVF
  /// list)
  IVFLists = 2,

  /// Quantizer (PQ, SQ) dictionary information
  Quantizer = 3,

  /// For GpuIndexIVFPQ, "precomputed codes" for more efficient PQ lookup
  /// require the use of possibly large tables. These are marked separately from
  /// Quantizer as these can frequently be 100s - 1000s of MiB in size
  QuantizerPrecomputedCodes = 4,

  ///
  /// StandardGpuResources implementation specific types
  ///

  /// When using StandardGpuResources, temporary memory allocations
  /// (MemorySpace::Temporary) come out of a stack region of memory that is
  /// allocated up front for each gpu (e.g., 1.5 GiB upon initialization). This
  /// allocation by StandardGpuResources is marked with this AllocType.
  TemporaryMemoryBuffer = 10,

  /// When using StandardGpuResources, any MemorySpace::Temporary allocations
  /// that cannot be satisfied within the TemporaryMemoryBuffer region fall back
  /// to calling cudaMalloc which are sized to just the request at hand. These
  /// "overflow" temporary allocations are marked with this AllocType.
  TemporaryMemoryOverflow = 11,
};

/// Convert an AllocType to string
std::string allocTypeToString(AllocType t);

/// Memory regions accessible to the GPU
enum MemorySpace {
  /// Temporary device memory (guaranteed to no longer be used upon exit of a
  /// top-level index call, and where the streams using it have completed GPU
  /// work). Typically backed by Device memory (cudaMalloc/cudaFree).
  Temporary = 0,

  /// Managed using cudaMalloc/cudaFree (typical GPU device memory)
  Device = 1,

  /// Managed using cudaMallocManaged/cudaFree (typical Unified CPU/GPU memory)
  Unified = 2,
};

/// Information on what/where an allocation is
struct AllocInfo {
  inline AllocInfo()
      : type(AllocType::Other),
        device(0),
        space(MemorySpace::Device),
        stream(nullptr) {
  }

  inline AllocInfo(AllocType at,
                   int dev,
                   MemorySpace sp,
                   cudaStream_t st)
      : type(at),
        device(dev),
        space(sp),
        stream(st) {
  }

  /// The internal category of the allocation
  AllocType type;

  /// The device on which the allocation is happening
  int device;

  /// The memory space of the allocation
  MemorySpace space;

  /// The stream on which new work on the memory will be ordered (e.g., if a
  /// piece of memory cached and to be returned for this call was last used on
  /// stream 3 and a new memory request is for stream 4, the memory manager will
  /// synchronize stream 4 to wait for the completion of stream 3 via events or
  /// other stream synchronization.
  ///
  /// The memory manager guarantees that the returned memory is free to use
  /// without data races on this stream specified.
  cudaStream_t stream;
};

/// Information on what/where an allocation is, along with how big it should be
struct AllocRequest : public AllocInfo {
  inline AllocRequest()
      : AllocInfo(),
        size(0) {
  }

  inline AllocRequest(const AllocInfo& info,
                      size_t sz)
      : AllocInfo(info),
        size(sz) {
  }

  inline AllocRequest(AllocType at,
                      int dev,
                      MemorySpace sp,
                      cudaStream_t st,
                      size_t sz)
      : AllocInfo(at, dev, sp, st),
        size(sz) {
  }

  /// The size in bytes of the allocation
  size_t size;
};

The most naive implementation would simply call cudaMalloc / cudaFree for every allocation and deallocation in MemorySpace::Temporary and MemorySpace::Device, and cudaMallocManaged / cudaFree for MemorySpace::Device, without keeping track of anything (and depending upon cudaFree to perform all needed stream synchronization.

These are probably best overridden by extending StandardGpuResources in C++ which will implement the non-memory APIs (for streams, cuBLAS handles etc), and just providing your own implementation of these two virtual functions.

As long as the stream synchronization directives are adhered to, you can cache and reuse memory however you want.

Something that is convenient as well is that the default implementation now shows where all the memory is going, something like this:

{0: {'FlatData': (2, 6526368), 'IVFLists': (25296, 205735072), 'Quantizer': (2, 262144), 'TemporaryMemoryBuffer': (1, 1610612736)}, 1: {'FlatData': (2, 6526368), 'IVFLists': (25296, 205735072), 'Quantizer': (2, 262144), 'QuantizerPrecomputedCodes': (1, 103612416), 'TemporaryMemoryBuffer': (1, 1610612736)}}

which is device -> (AllocType -> (# allocations, total size in bytes)).

@cjnolet
Copy link
Contributor Author

cjnolet commented Jul 12, 2020

@wickedfoo the API looks great. Looking forward to the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants