In the past 2 days I've been researching various ways to avoid expensive allocations/deallocations in tight loops on GPU, by reusing memory.

This is also a recurrent issue in game programming so I'd like to share my approach and request for comments.

My use-case:
  • I allocate and free memory through a custom allocator (cudaMalloc/cudaFree in my case).
  • Allocation/deallocation is very expensive and often occurs in tight loops (12.5 ms for 512MB cudaMalloc vs 1us for system malloc).
  • Memory is at a premium, we cannot hoard unused memory.
  • Memory per object is huge, for example a VGG neural network (one of the earliest) with typical images of size 224x224x3 RGB will have 138 millions of parameters and takes about 93MB/image (see here). Typical batch sizes are 32~128 images, and a new batch is required every 200ms~1s (check "OxfordNet", the other name of VGG in this benchmark)
  • Objects are heterogeneous, but objects requested in a tight loop have the same size most of the time.
My solution:
  • A table to keep track of available cached blocks, BlockSize as keys, ByteAddress as values. Note that Nim table can store several values for the same key (say 90MB), and with take it returns the values in FIFO order.
  • A deque (queue are deprecated). Each CachedBlock has an epoch timestamp, after a threshold they will be free for real and returned to the OS/GPU driver
  • A intset (i.e. a set). Blocks reused before their expiration are added there. Deleting them from the middle of the queue at random location would be too costly (O(n)).

Regarding the decay/timedEviction, I'm not too sure what formula to use, exponential decay (radioactivity-like), fixed-time decay or other schemes.

Here is the code, Ididn't implement the interface yet:

import  tables,
        intsets,
        deques,
        times

type
  BlockSize = int
  Allocator = proc (size: Natural): pointer {.noconv.}
  Deallocator = proc (p: pointer) {.noconv.}
  
  CachedBlock = tuple[epoch: Time, address: ByteAddress, size: BlockSize]
    ## Describe a contiguous allocated and available memory chunk in the pool:
    ##   - starting timestamp of availability,
    ##   - starting address,
    ##   - size in bytes
  
  DecayingObjectPool = object
    freeBlocks: Table[BlockSize, ByteAddress]
    evictionQueue: Deque[CachedBlock]
    lazyReused: IntSet
    allocator: Allocator
    deallocator: Deallocator
    ## Object pool / caching allocator with timed eviction
    ##   - Keep track of available blocksizes in the pool.
    ##   - Free blocks that are too old expire and are returned to the OS/devices via the deallocator.
    ##   - Eviction is managed by a queue however reallocated objects must be
    ##     be removed from the EvictionQueue as well and can be at arbitrary positions.
    ##     To avoid costly deletion in the middle of the queue, reused objects are tracked
    ##     in lazyReused and will be removed lazily from the queue when they expire but will not
    ##     trigger the deallocator.
    when defined(debug) or defined(test):
      idleMem: Natural
      usedMem: Natural
      nbAllocations: Natural
      nbDeallocations: Natural
      cacheHits: Natural
      cacheMisses: Natural

proc initDecayingObjectPool(proc_alloc: Allocator, proc_dealloc: Deallocator): DecayingObjectPool =
  result.allocator = proc_alloc
  result.deallocator = proc_dealloc


when isMainModule:
  let foo = initDecayingObjectPool(alloc0, dealloc)
  echo foo

Complexity analysis:
  • The table operations used are add, del, take, all O(1)
  • The deque operations used are only the queue part: addLast, peekFirst (to check epoch), popFirst, all O(1)
  • The intset operations used are contains, incl, excl. I'm not sure of the complexity but it should be O(1) as well.
2017-11-10 13:02:24

I had a similar problem in the past and I have considered a similar solution.

However, I ended up using one custom fixed size hash table with open addressing. Idea is that table once filled up, start throwing away one element everytime another one needs to be inserted. Table doesn't ever grow. It was not as generic and it was O(1) just for a subset of operations, but it was much faster.

Answer: it depends on what operations you are going to use most. I ended up inserting in 99.99999% cases, no del, no lookup.

2017-11-10 15:51:29

@cdome That's funny, actually, I've used a container that seems to work just like yours, despite the fact the use case was totally different. It was for an evolutionary program.

@mratsim If I get it, you need at least 93 MB * 32 images = 11 GB 904 MB per batch?

By the way: did you look at Rust allocators collection?

2017-11-11 23:39:07

Yes, a Nvidia Titan X is quite a common GPU for deep learning and comes with 12GB of VRAM, GTX 1080 Ti, 11GB and GTX 1070-1080 8GB. Whatever the GPU the goal is to saturate them with the biggest batch they can handle.

There is a lot of research going into more memory-efficient networks, and how to optimize memory usage without impacting accuracy.

I did not came across this Rust allocators collection, great. All my Cuda Allocator research is detailed there

2017-11-12 18:45:23

@mratsim Well, you mentioned game programming at first so "serious deep learning" didn't come to my mind. ^^"

As for Rust allocators --- you link to Linux kernel slab on the page you linked. There is (or probably: are) a slab allocator for Rust available.

2017-11-12 21:44:30

I found it useful when I worked on my memory caches:

http://antirez.com/news/109

https://arxiv.org/pdf/1512.00727.pdf

2017-11-13 12:24:20