Transparent Memory Extension for Shared GPUs
Dr.-Ing. Jens Kehne
Dissertation, Fakultät für Informatik, Institut für Technische Informatik (ITEC), Karlsruher Institut für Technologie (KIT)
- Date: 07.02.2019
Over the last few years, graphics processing units (GPUs) have become popular in computing. Consequently, all major cloud providers have included GPUs in their platforms. These platforms typically use virtualization to share physical resources between users, which increases the utilization of these resources. Utilization can be increased even further through oversubscription: Since users tend to buy more resources than are actually needed, providers can offer more resources than physically available to their customers, hoping that the customers will not fully utilize the resources that were promised all the time. In case customers do fully utilize their resources, however, the provider must be prepared to keep the customers’ applications running even if the customers’ resource demands exceed the capacity of the physical resources.
The memory of modern GPUs can be oversubscribed easily since these GPUs support virtual memory not unlike that found in CPUs. Cloud providers can thus grant large virtual address spaces to their customers, only allocating physical memory if a customer actually uses that memory. Shortages of GPU memory can be mitigated by evicting data from GPU memory into the system’s main memory. However, evicting data from the GPU is complicated by the asynchronous nature of today’s GPUs: Users can submit kernels directly into the command queues of these GPUs, with the GPU handling scheduling and dispatching autonomously. In addition, GPUs assume that all data allocated in GPU memory is accessible at any time, forcefully terminating any GPU kernel that tries to access unavailable data.
Previous work typically circumvented this problem by introducing a software scheduler for GPU kernels which selects the next kernel to execute in software whenever a previous kernel finishes execution. If data from the next kernel’s address space has been evicted, the scheduler returns that data to GPU memory before launching the next kernel, evicting data from other applications in the process. The main disadvantage of this approach is that scheduling GPU kernels in software bypasses the GPU’s own, highly efficient scheduling and context switching and therefore induces significant overhead in applications even in the absence of memory pressure.
In this thesis, we present GPUswap, a novel approach to oversubscription of GPU memory which does not require software scheduling of GPU kernels. In contrast to previous work, GPUswap evicts data on memory allocation requests instead of kernel launches: When an application attempts to allocate memory, but there is insufficient GPU memory available, GPUswap evicts data from the GPU into the system’s main memory to make room for the allocation request. GPUswap then uses the GPU’s virtual memory to map the evicted data directly into the address space of the application owning the data. Since evicted data is thus directly accessible to the application at any time, GPUswap can allow applications to submit kernels directly to the GPU without the need for software scheduling. Consequently, GPUswap does not induce any overhead as long as sufficient GPU memory is available. In addition, GPUswap eliminates unnecessary copying of data: Only evicted data that is actually accessed by a GPU kernel is transferred over the PCIe bus, while previous work indiscriminately copied all data a kernel might access prior to kernel launch. Overall, GPUswap thus delivers consistently higher performance than previous work, regardless of whether or not a sufficient amount of GPU memory is available.
Since accessing evicted data over the PCIe bus nonetheless induces non-trivial overhead, GPUswap should ideally evict rarely-accessed pages first. However, the hardware features commonly used to identify such rarely-accessed pages on the CPU – such as reference bits – are not available in current GPUs. Therefore, we rely on off-line profiling to identify such rarely-accessed pages. In contrast to previous work on GPU memory profiling, which was based on compiler modification, our own profiling uses the GPU’s performance monitoring counters to profile the application’s GPU kernels transparently. Our profiler is therefore not limited to specific types of application and does not require recompiling of third-party code such as shared libraries. Experiments with our profiler have shown that the number of accesses per page varies mostly between an application’s memory buffers, while pages within the same buffer tend to exhibit a similar number of accesses.
Based on the results of our profiling, we examine several possible eviction policies and their viability on current GPUs. We then design a prototype policy which allows application developers to assign a priority to each buffer allocated by the application. Based on these priorities, our policy subsequently decides which buffer’s contents to evict first. Our policy does not require hardware features not present in current GPUs, and our evaluation shows that the policy is able to relocate significant amounts of application data to system RAM with minimal overhead.