Efficient Persistent Memory

Modern systems have an increasing need for low-latency storage. Traditional solid-state drives (SSDs) based on NVMe cannot properly fulfill this role, as the asynchronous protocol introduces latency.

Persistent memory (PM) offers a better solution. PM is byte-addressable and accessed directly from the CPU like main memory, but retains its contents while the system is powered off. PM therefore supports synchronous access with low latency, even for small access sizes.

Synchronous access to PM from the CPU is essential for low latency but becomes expensive once the PM is under load and cannot answer requests immediately. On traditional storage devices with asynchronous access (e.g., HDD, SSD), the operating system can schedule other processes during the wait time or put the CPU in a low-power sleep state. This is not possible with PM, as individual PM accesses are not visible to the operating system. Instead, the CPU pipeline stalls during the wait time, wasting CPU time and energy.

Our work introduces approaches for measuring and improving the efficiency of PM file systems. We provide efficiency metrics that assign a specific cost (CPU time or energy) to accessing a certain amount of storage. We evaluate these metrics on multiple file systems and find that most PM file systems do not access PM efficiently under a parallel write load.

Three mechanisms for efficient PM access

We then propose three mechanisms for improving the efficiency of PM file systems under parallel load. They include two software-based approaches for limiting parallel accesses and another approach based on hardware offloading.

Processes bypass the kernel using DAX.

With direct access (DAX), processes can obtain memory mappings to PM, bypassing the kernel. We therefore need a different solution to avoid overloading PM with accesses to DAX mappings. For effective limiting, we need a global view of the volume of PM accesses from all concurrently running applications. Existing hardware and operating system mechanisms cannot provide this information with association to both applications and PM devices. We introduce a monitoring approach based on memory access instruction sampling that provides the required association.

PM threads are isolated on a single core with core specialization.

We make monitoring data available to applications via shared memory, allowing them to react immediately to overload situations. Using this monitoring data, we implement a policy based on core specialization that can limit the number of CPU cores stalling on PM accesses.

Contact: Lukas Werling

Artikel
Title Author Source

Lukas Werling

Dissertation - Karlsruhe Institute of Technology (KIT)

Lukas Werling, Yussuf Khalil, Peter Maucher, Thorsten Gröninger, Frank Bellosa

DIMES'23, 1st Workshop on Disruptive Memory Systems, Koblenz, Germany, October 23, 2023