NUMA architectures in in-memory databases

Posted on November 27th, 2016 by Victor Herrero

In the last years, and somehow related to the arise of Big Data, databases technologies have widely enlarged their data management landscape by either, at the highest level, adopting new flexible data structures in comparison to traditional (fix) relational structures or, at the finest level, by new hardware solutions. The second gave birth to the so-called in-memory databases (IMDB), which are databases where data are essentially stored in main memory in order to achieve very high data access speeds. To achieve persistency, some products bet for ultimately storing those data in hard disk before shutdown, whereas some others use data replication across multiple nodes. In any case, IMDB are also tightly related to distributed databases. This is so because of NUMA (non-uniform memory access) architectures in which different memory access speeds can be achieved with regard to the memory socket where the data are actually placed.

NUMA is a memory access paradigm used to build processors with stronger computing power. In multi-socket architectures (the reader must understand a socket as a CPU socket from now on), each socket has its own memory but access to other socket memories is also granted. The pay-off is that accessing local memory is faster than remote. In NUMA architectures built by Intel, sockets are connected by means of QPI (Quick Path Interconnect) [1] links that allow transferring data between any pair of sockets. Sometimes the connection topology falls into a complete graph so that any data transfer can be done in one single hop, but usually the topology is simply a connected graph. This normally depends on the amount of sockets involved, as the cost of physically adding n(n-1)/2 QPI links (being n the number of sockets) in the complete graph case may drastically increment the hardware cost. Contrarily, if the number of sockets is small, several QPI links can connect the same pair of sockets so that transfer speed multiplies. All in all, memory in NUMA can be globally seen as the sum of all socket memories where each provides different access speed. As a consequence, in-memory database performance appears to be subject to the data placement.

Nevertheless, the concurrency problems originated by this memory access paradigm are many. The most relevant is the so-called cache-coherence problem. Despite remote memory can be accessed from any socket, cache memory is not shared and, therefore, several copies of the same datum may co-exist in different memory caches. To solve this, two different approaches exist: snoop-based and directory-based. The former can be simply seen as broadcast, where writes are notified to all sockets. In the latter, writes are logged in a directory (stored in one of the NUMA nodes) and sockets query for it before using a given datum to make sure that they consume the last value. In whatever approach, the QPI links are also used for such communication and, thus, they might overload with non-application data, especially in the case of the snoop-based approach, where the remaining bandwidth available for the database is strongly reduced. For this reason, newer hardware (Intel Haswell generation and next) may address the cache-coherence problem in some architectures by means of the directory-based approach. As a matter of fact, we empirically found that memory reads also produce noise on the QPI links. In snoop-based, an average of around 12.5% of the data read was sent to all other NUMA nodes which, in turn, sent back another 12.5% approximately as acknowledgement. Hence, solely reading data turned out to create a noise of 25% of the amount read. Oppositely, in directory-based, this was negligible.

Such values were measured by means of the Intel Performance Counter Monitors (PCM) tool [2], which uses special registers installed in sockets to count how many times certain events occur during a given period of time. There is one register for each type of event, and more specific information can be gathered by enabling or disabling flags in those registers. Unfortunately, registers are at the hardware level where no sense of application exists. Therefore, events are globally captured, which makes it important to reduce the number of running processes so that background noise is minimized. Some examples of events are the number of hits/misses in the cache memory, number of instructions or cycles, number of bytes read and/or written, number of bytes sent through each QPI link, among others. These events are then classified as core or uncore. The former refers to those that actually happen within the CPU itself, as the cache hits/misses, whereas the latter are events taking place in the memory controller such as the QPI traffic. Note that not all the processor generations support all the uncore events, but it actually depends on where the memory controller is physically installed. However, the Haswell generation allows collecting both core and uncore events. This generation was the one we used to measure the previous QPI traffic, first in an snoop-based architecture and, second, in an directory-based.

[1] An Introduction to the Intel® QuickPath Interconnect, http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html

[2] Intel® Performance Counter Monitor - A better way to measure CPU utilization, https://software.intel.com/en-us/articles/intel-performance-counter-monitor