比特派官网下载链接|numa

作者: 比特派官网下载链接
2024-03-07 20:15:01

浅解NUMA机制 - 知乎

浅解NUMA机制 - 知乎切换模式写文章登录/注册浅解NUMA机制柴可喵斯基云计算|大学教师 | 码农导读本文适合知道NUMA这个词但想进一步了解的新手。以下的文章内容包括:NUMA的产生背景,NUMA的架构细节和几个上机演示的例子。NUMA的诞生背景在NUMA出现之前,CPU朝着高频率的方向发展遇到了天花板,转而向着多核心的方向发展。在一开始,内存控制器还在北桥中,所有CPU对内存的访问都要通过北桥来完成。此时所有CPU访问内存都是“一致的”,如下图所示:UMA这样的架构称为UMA(Uniform Memory Access),直译为“统一内存访问”,这样的架构对软件层面来说非常容易,总线模型保证所有的内存访问是一致的,即每个处理器核心共享相同的内存地址空间。但随着CPU核心数的增加,这样的架构难免遇到问题,比如对总线的带宽带来挑战、访问同一块内存的冲突问题。为了解决这些问题,有人搞出了NUMA。NUMA构架细节NUMA 全称 Non-Uniform Memory Access,译为“非一致性内存访问”。这种构架下,不同的内存器件和CPU核心从属不同的 Node,每个 Node 都有自己的集成内存控制器(IMC,Integrated Memory Controller)。在 Node 内部,架构类似SMP,使用 IMC Bus 进行不同核心间的通信;不同的 Node 间通过QPI(Quick Path Interconnect)进行通信,如下图所示:NUMANUMA一般来说,一个内存插槽对应一个 Node。需要注意的一个特点是,QPI的延迟要高于IMC Bus,也就是说CPU访问内存有了远近(remote/local)之别,而且实验分析来看,这个差别非常明显。在Linux中,对于NUMA有以下几个需要注意的地方:默认情况下,内核不会将内存页面从一个 NUMA Node 迁移到另外一个 NUMA Node;但是有现成的工具可以实现将冷页面迁移到远程(Remote)的节点:NUMA Balancing;关于不同 NUMA Node 上内存页面迁移的规则,社区中有依然有不少争论。对于初次了解NUMA的人来说,了解到这里就足够了,本文的细节探讨也止步于此,如果想进一步深挖,可以参考开源小站这篇文章。上机演示NUMA Node 分配作者使用的机器中,有两个 NUMA Node,每个节点管理16GB内存。NUMA Node 绑定Node 和 Node 之间进行通信的代价是不等的,同样是 Remote 节点,其代价可能不一样,这个信息在 node distances 中以一个矩阵的方式展现。我们可以将一个进程绑定在某个 CPU 或 NUMA Node 的内存上执行,如上图所示。NUMA 状态发布于 2019-05-30 19:42内存管理计算机体系架构​赞同 218​​6 条评论​分享​喜欢​收藏​申请

每个程序员都应该知道的 CPU 知识:NUMA - 知乎

每个程序员都应该知道的 CPU 知识:NUMA - 知乎首发于计算机体系结构切换模式写文章登录/注册每个程序员都应该知道的 CPU 知识:NUMAFOCUS什么是 NUMA?早期的计算机,内存控制器还没有整合进 CPU,所有的内存访问都需要经过北桥芯片来完成。如下图所示,CPU 通过前端总线(FSB,Front Side Bus)连接到北桥芯片,然后北桥芯片连接到内存——内存控制器集成在北桥芯片里面。这种架构被称为 UMA1(Uniform Memory Access, 一致性内存访问 ):总线模型保证了 CPU 的所有内存访问都是一致的,不必考虑不同内存地址之间的差异。在 UMA 架构下,CPU 和内存之间的通信全部都要通过前端总线。而提高性能的方式,就是不断地提高 CPU、前端总线和内存的工作频率。后面的故事,大部分人都很清楚:因为物理条件的限制,不断提高工作频率的路子走不下去了。CPU 性能的提升开始从提高主频转向增加 CPU 数量(多核、多 CPU)。越来越多的 CPU 对前端总线的争用,使前端总线成为了瓶颈。为了消除 UMA 架构的瓶颈,NUMA2(Non-Uniform Memory Access, 非一致性内存访问)架构诞生了:CPU 厂商把内存控制器集成到 CPU 内部,一般一个 CPU socket 会有一个独立的内存控制器。每个 CPU scoket 独立连接到一部分内存,这部分 CPU 直连的内存称为“本地内存”。CPU 之间通过 QPI(Quick Path Interconnect) 总线进行连接。CPU 可以通过 QPI 总线访问不和自己直连的“远程内存”。和 UMA 架构不同,在 NUMA 架构下,内存的访问出现了本地和远程的区别:访问远程内存的延时会明显高于访问本地内存。NUMA 的设置Linux 有一个命令 numactl3 可以查看或设置 NUMA 信息。执行 numactl --hardware 可以查看硬件对 NUMA 的支持信息:# numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

node 0 size: 96920 MB

node 0 free: 2951 MB

node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

node 1 size: 98304 MB

node 1 free: 33 MB

node distances:

node 0 1

0: 10 21

1: 21 10

CPU 被分成 node 0 和 node 1 两组(这台机器有两个 CPU Socket)。一组 CPU 分配到 96 GB 的内存(这台机器总共有 192GB 内存)。node distances 是一个二维矩阵,node[i][j] 表示 node i 访问 node j 的内存的相对距离。比如 node 0 访问 node 0 的内存的距离是 10,而 node 0 访问 node 1 的内存的距离是 21。执行 numactl --show 显示当前的 NUMA 设置:# numactl --show

policy: default

preferred node: current

physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

cpubind: 0 1

nodebind: 0 1

membind: 0 1

numactl 命令还有几个重要选项:--cpubind=0: 绑定到 node 0 的 CPU 上执行。--membind=1: 只在 node 1 上分配内存。--interleave=nodes:nodes 可以是 all、N,N,N 或 N-N,表示在 nodes 上轮循(round robin)分配内存。--physcpubind=cpus:cpus 是 /proc/cpuinfo 中的 processor(超线程) 字段,cpus 的格式与 --interleave=nodes 一样,表示绑定到 cpus 上运行。--preferred=1: 优先考虑从 node 1 上分配内存。numactl 命令的几个例子:# 运行 test_program 程序,参数是 argument,绑定到 node 0 的 CPU 和 node 1 的内存

numactl --cpubind=0 --membind=1 test_program arguments

# 在 processor 0-4,8-12 上运行 test_program

numactl --physcpubind=0-4,8-12 test_program arguments

# 轮询分配内存

numactl --interleave=all test_program arguments

# 优先考虑从 node 1 上分配内存

numactl --preferred=1

测试 NUMA#include

#include

#include

#include

int main(int argc, char** argv) {

int size = std::stoi(argv[1]);

std::vector> data(size, std::vector(size));

struct timeval b;

gettimeofday(&b, nullptr);

# 按列遍历,避免 CPU cache 的影响

for (int col = 0; col < size; ++col) {

for (int row = 0; row < size; ++row) {

data[row][col] = rand();

}

}

struct timeval e;

gettimeofday(&e, nullptr);

std::cout << "Use time "

<< e.tv_sec * 1000000 + e.tv_usec - b.tv_sec * 1000000 - b.tv_usec

<< "us" << std::endl;

}

# numactl --cpubind=0 --membind=0 ./numa_test 20000

Use time 16465637us

# numactl --cpubind=0 --membind=1 ./numa_test 20000

Use time 21402436us

可以看出,测试程序使用远程内存比使用本地内存慢了接近 30%Linux 的 NUMA 策略Linux 识别到 NUMA 架构后,默认的内存分配方案是:优先从本地分配内存。如果本地内存不足,优先淘汰本地内存中无用的内存。使内存页尽可能地和调用线程处在同一个 node。这种默认策略在不需要分配大量内存的应用上一般没什么问题。但是对于数据库这种可能分配超过一个 NUMA node 的内存量的应用来说,可能会引起一些奇怪的性能问题。下面是在网上看到的的例子:由于 Linux 默认的 NUMA 内存分配策略,导致 MySQL 在内存比较充足的情况下,出现大量内存页被换出,造成性能抖动的问题。The MySQL “swap insanity” problem and the effects of the NUMA architecture4A brief update on NUMA and MySQL5参考资料UMA(Uniform Memory Access, 一致性内存访问):https://en.wikipedia.org/wiki/Uniform_memory_accessNUMA(Non-Uniform Memory Access, 非一致性内存访问):https://en.wikipedia.org/wiki/Non-uniform_memory_accessnumactl:https://linux.die.net/man/8/numactlThe MySQL “swap insanity” problem and the effects of the NUMA architecture:http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/A brief update on NUMA and MySQL:http://blog.jcole.us/2012/04/16/a-brief-update-on-numa-and-mysql/NUMA架构的CPU -- 你真的用好了么?:http://cenalulu.github.io/linux/numa/Thread and Memory Placement on NUMA Systems: Asymmetry Matters:https://www.usenix.org/conference/atc15/technical-session/presentation/lepersNUMA (Non-Uniform Memory Access): An Overview:https://queue.acm.org/detail.cfm?id=2513149NUMA Memory Policy:https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.htmlWhat is NUMA?:https://www.kernel.org/doc/html/latest/vm/numa.html发布于 2020-12-12 10:19中央处理器 (CPU)硬件NUMA​赞同 194​​14 条评论​分享​喜欢​收藏​申请转载​文章被以下专栏收录计算机体

什么是NUMA,我们为什么要了解NUMA - 知乎

什么是NUMA,我们为什么要了解NUMA - 知乎首发于linux服务器开发切换模式写文章登录/注册什么是NUMA,我们为什么要了解NUMAlinuxNUMA到底指的是什么?我们怎么可以感受到它的存在?以及NUMA的存在对于我们编程会有什么影响?今天我们一起来看一下。1、NUMA的由来NUMA(Non-Uniform Memory Access),即非一致性内存访问,是一种关于多个CPU如何访问内存的架构模型,早期,在计算机系统中,CPU是这样访问内存的:在这种架构中,所有的CPU都是通过一条总线来访问内存,我们把这种架构叫做SMP架构(Symmetric Multi-Processor),也就是对称多处理器结构。可以看出来,SMP架构有下面4个特点:CPU和CPU以及CPU和内存都是通过一条总线连接起来CPU都是平等的,没有主从关系所有的硬件资源都是共享的,即每个CPU都能访问到任何内存、外设等内存是统一结构和统一寻址的(UMA, Uniform Memory Architecture)SMP架构在CPU核不多的情况下,问题不明显,有实验证明,SMP服务器CPU利用率最好的情况是2至4个CPU:但是随着CPU多核技术的发展,一颗物理CPU中集成了越来越多的core,导致SMP架构的性能瓶颈越来越明显,因为所有的处理器都通过一条总线连接起来,因此随着处理器的增加,系统总线成为了系统瓶颈,另外,处理器和内存之间的通信延迟也较大。为了解决SMP架构下不断增多的CPU Core导致的性能问题,NUMA架构应运而生,NUMA调整了CPU和内存的布局和访问关系,具体示意如下图:在NUMA架构中,将CPU划分到多个NUMA Node中,每个Node有自己独立的内存空间和PCIE总线系统。各个CPU间通过QPI总线进行互通。CPU访问不同类型节点内存的速度是不相同的,访问本地节点的速度最快,访问远端节点的速度最慢,即访问速度与节点的距离有关,距离越远访,问速度越慢,所以叫做非一致性内存访问,这个访问内存的距离我们称作Node Distance。虽然NUMA很好的解决了SMP架构下CPU大量扩展带来的性能问题,但是其自身也存在着不足,当Node节点本地内存不足时,需要跨节点访问内存,节点间的访问速度慢,从而也会带来性能的下降。所以我们在编写应用程序时,要充分利用NUMA系统的这个特点,尽量的减少不同CPU模块之间的交互,避免远程访问资源,如果应用程序能有方法固定在一个CPU模块里,那么应用的性能将会有很大的提升。相关视频推荐90分钟了解Linux内存架构,numa的优势,slab的实现,vmalloc的原理C++高并发编程-原子操作与cpu缓存一致性dpdk让你的开发走向硬核,拉开与crud仔的区别免费学习地址:c/c++ linux服务器开发/后台架构师需要C/C++ Linux服务器架构师学习资料加qun812855908获取(资料包括C/C++,Linux,golang技术,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK,流媒体,CDN,P2P,K8S,Docker,TCP/IP,协程,DPDK,ffmpeg等),免费分享2、NUMA架构下的CPU和内存分布在Linux系统上,可以查看到NUMA架构下CPU和内存的分布情况,不过在这之前,我们先得理清几个概念:Socket:表示一颗物理 CPU 的封装(物理 CPU 插槽),简称插槽。为了避免将逻辑处理器和物理处理器混淆,Intel 将物理处理器称为插槽,Socket表示可以看得到的真实的CPU核 。Core:物理 CPU 封装内的独立的一组程序执行的硬件单元,比如寄存器,计算单元等,Core表示的是在同一个物理核内逻辑层面的核。同一个物理CPU的多个Core,有自己独立的L1和L2 Cache,共享L3 Cache。Thread:使用超线程技术虚拟出来的逻辑 Core,需要 CPU 支持。为了便于区分,逻辑 Core 一般被写作 Processor。在具有 Intel 超线程技术的处理器上,每个内核可以具有两个逻辑处理器,这两个逻辑处理器共享大多数内核资源(如内存缓存和功能单元)。此类逻辑处理器通常称为 Thread 。超线程可以在一个逻辑核等待指令执行的间隔(等待从cache或内存中获取下一条指令),把时间片分配到另一个逻辑核。高速在这两个逻辑核之间切换,让应用程序感知不到这个间隔,误认为自己是独占了一个核。对于每个逻辑线程,拥有完整独立的寄存器集合和本地中断逻辑,共享执行单元和一二三级Cache,超线程技术可以带来20%~30%的性能提升。Node:即NUMA Node,包含有若干个 CPU Core 的组。Socket、Core和Threads之间的关系示意如下:在Linux系统中,可以用lscpu查看NUMA和CPU的对应关系:从上图可以看到,这台服务器有两个NUMA node,有两个Socket,每个Socket也就是一个物理CPU,有14个逻辑Core,每个逻辑Core有两个线程(服务器开启了超线程),所以总共的CPU个数(以超线程计数)为:2*14*2 = 56个。使用numactl -H命令可以看到NUMA下的内存分布:所以这台服务器上CPU和内存在NUMA下的分布如下:NUMA架构下的CPU,先从逻辑Core开始编号,如果开启了超线程,就从Core总数的后面继续编号,例如上图中从cpu8开始之后的都是开启超线程之后的CPU线程。另外,在实际编程中,我们还可以通过numastat命令查看NUMA系统下内存的访问命中率:numa_hit:成功分配给此节点的页面数量。numa_miss:由于预期节点上的内存较低,在此节点上分配的页面数量。每个 numa_miss 事件在另一个节点上都有对应的 numa_foreign 事件。numa_foreign:最初用于分配给另一节点的页面数量。每个 numa_foreign 事件在另一节点上都有对应的 numa_miss 事件。interleave_hit:成功分配给此节点的交集策略页面数量。local_node:此节点上的进程在这个节点上成功分配的页面数量。other_node:通过另一节点上的进程在这个节点上分配的页面数量。如果miss值和foreign值越高,就要考虑线程绑定以及内存分配使用的问题。需要注意的是,NUMA Node和socket并不一定是一对一的关系,在AMD的CPU中,可能更多见于NUMA Node比socket个数多(一般AMD的CPU的NUMA可以在BIOS中进行配置),而Intel的CPU中,NUMA Node可能比socket的个数还少。例如在下面这台服务器上,使用的是AMD的EPYC 7001,NUMA有8个,但是Socket只有两个:3、NUMA架构下的编程NUMA架构显著的特点就是CPU访问本地内存快,访问远程内存慢。所以我们在NUMA架构下编写程序,要扬长避短,多核多线程编程中,我们要尽可能的利用CPU Core的亲和性,将线程绑定到对应的CPU上,并且该线程从该CPU对应的本地内存上去申请内存,这样才能最大限度发挥NUMA架构的优势,达到比较好的处理性能。简单来说,就是本地的处理器、本地的内存来处理本地的设备上产生的数据。如果有一个PCI设备(比如网卡)在Node0上,就用Node0上的核来处理该设备,处理该设备用到的数据结构和数据缓冲区都从Node0上分配。在DPDK中,有一个rte_socket_id()函数,可以获取当前线程所在的NUMA Node,在使用DPDK提供申请内存的接口中,一般都需要传入参数NUMA id,也是基于提高NUMA架构下的报文转发性能考虑。好了,关于NUMA的介绍以及NUMA下CPU内存的分布的内容就到这里了发布于 2023-07-14 17:12・IP 属地湖南NUMALinuxC / C++​赞同 29​​9 条评论​分享​喜欢​收藏​申请转载​文章被以下专栏收录linux服务

Linux内存管理:NUMA技术详解(非一致内存访问架构)_非一致性互连-CSDN博客

>

Linux内存管理:NUMA技术详解(非一致内存访问架构)_非一致性互连-CSDN博客

Linux内存管理:NUMA技术详解(非一致内存访问架构)

最新推荐文章于 2023-04-26 15:18:26 发布

rtoax

最新推荐文章于 2023-04-26 15:18:26 发布

阅读量9.7k

收藏

56

点赞数

13

分类专栏:

【基础知识】

【Linux内核】

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

本文链接:https://blog.csdn.net/Rong_Toa/article/details/109137312

版权

【Linux内核】

同时被 2 个专栏收录

734 篇文章

278 订阅

订阅专栏

【基础知识】

427 篇文章

34 订阅

订阅专栏

图片来源:https://zhuanlan.zhihu.com/p/68465952

 

《Linux内存管理:转换后备缓冲区(TLB)原理》

《内存管理:Linux Memory Management:MMU、段、分页、PAE、Cache、TLB》

《Memory Management Concepts overview(内存管理基本概念)》

《NUMA - Non Uniform Memory Architecture 非统一内存架构》

《什么是NUMA?》

《NUMA全称 Non-Uniform Memory Access,译为“非一致性内存访问”,积极NUMA内存策略》

《《深入浅出DPDK》读书笔记(三):NUMA - Non Uniform Memory Architecture 非统一内存架构》

《Linux的NUMA技术》or《Linux的NUMA技术》

《Linux中的物理内存管理 [一]》

《Linux-2.6.32 NUMA架构之内存和调度》

Table of Contents

一.背景

二.NUMA存储管理

三、NUMA调度器

四、CpuMemSets

五、测试

相关主题

 

一.背景

所谓物理内存,就是安装在机器上的,实打实的内存设备(不包括硬件cache),被CPU通过总线访问。在多核系统中,如果物理内存对所有CPU来说没有区别,每个CPU访问内存的方式也一样,则这种体系结构被称为Uniform Memory Access(UMA)。

如果物理内存是分布式的,由多个cell组成(比如每个核有自己的本地内存),那么CPU在访问靠近它的本地内存的时候就比较快,访问其他CPU的内存或者全局内存的时候就比较慢,这种体系结构被称为Non-Uniform Memory Access(NUMA)。

以上是硬件层面上的NUMA,而作为软件层面的Linux,则对NUMA的概念进行了抽象。即便硬件上是一整块连续内存的UMA,Linux也可将其划分为若干的node。同样,即便硬件上是物理内存不连续的NUMA,Linux也可将其视作UMA。

所以,在Linux系统中,你可以基于一个UMA的平台测试NUMA上的应用特性。从另一个角度,UMA就是只有一个node的特殊NUMA,所以两者可以统一用NUMA模型表示。

https://zhuanlan.zhihu.com/p/68465952

传统的SMP(对称多处理器)中,所有处理器都共享系统总线,因此当处理器的数目增大时,系统总线的竞争冲突加大,系统总线将成为瓶颈,所以目前SMP系统的CPU数目一般只有数十个,可扩展能力受到极大限制。NUMA技术有效结合了SMP系统易编程性和MPP(大规模并行)系统易扩展性的特点,较好解决了SMP系统的可扩展性问题,已成为当今高性能服务器的主流体系结构之一。

在NUMA系统中,当Linux内核收到内存分配的请求时,它会优先从发出请求的CPU本地或邻近的内存node中寻找空闲内存,这种方式被称作local allocation,local allocation能让接下来的内存访问相对底层的物理资源是local的。

每个node由一个或多个zone组成(我们可能经常在各种对虚拟内存和物理内存的描述中迷失,但以后你见到zone,就知道指的是物理内存),每个zone又由若干page frames组成(一般page frame都是指物理页面)。

https://zhuanlan.zhihu.com/p/68465952

基于NUMA架构的高性能服务器有HP的Superdome、SGI的Altix 3000、IBM的 x440、NEC的TX7、AMD的Opteron等。(基于参考资料中作者编写时间是04年7月) - https://zhuanlan.zhihu.com/p/68465952

 

1.1 概念

https://www.cnblogs.com/zhenjing/archive/2012/03/21/linux_numa.html

NUMA具有多个节点(Node),每个节点可以拥有多个CPU(每个CPU可以具有多个核或线程),节点内使用共有的内存控制器,因此节点的所有内存对于本节点的所有CPU都是等同的,而对于其它节点中的所有CPU都是不同的。节点可分为本地节点(Local Node)、邻居节点(Neighbour Node)和远端节点(Remote Node)三种类型。

本地节点:对于某个节点中的所有CPU,此节点称为本地节点;邻居节点:与本地节点相邻的节点称为邻居节点;远端节点:非本地节点或邻居节点的节点,称为远端节点。

邻居节点和远端节点,称作非本地节点(Off Node)。

CPU访问不同类型节点内存的速度是不相同的:本地节点>邻居节点>远端节点。访问本地节点的速度最快,访问远端节点的速度最慢,即访问速度与节点的距离有关,距离越远访问速度越慢,此距离称作Node Distance。

常用的NUMA系统中:硬件设计已保证系统中所有的Cache是一致的(Cache Coherent, ccNUMA);不同类型节点间的Cache同步时间不一样,会导致资源竞争不公平,对于某些特殊的应用,可以考虑使用FIFO Spinlock保证公平性。

 

 

二.NUMA存储管理

NUMA系统是由多个结点通过高速互连网络连接而成的,如图1是SGI Altix 3000 ccNUMA系统中的两个结点。

图1 SGI Altix3000系统的两个结点.gif

NUMA系统的结点通常是由一组CPU(如,SGI Altix 3000是2个Itanium2 CPU)和本地内存组成,有的结点可能还有I/O子系统。由于每个结点都有自己的本地内存,因此全系统的内存在物理上是分布的,每个结点访问本地内存和访问其它结点的远地内存的延迟是不同的,为了减少非一致性访存对系统的影响,在硬件设计时应尽量降低远地内存访存延迟(如通过Cache一致性设计等),而操作系统也必须能感知硬件的拓扑结构,优化系统的访存。

目前IA64 Linux所支持的NUMA架构服务器的物理拓扑描述是通过ACPI(Advanced Configuration and Power Interface)实现的。ACPI是由Compaq、Intel、Microsoft、Phoenix和Toshiba联合制定的BIOS规范,它定义了一个非常广泛的配置和电源管理,目前该规范的版本已发展到2.0,3.0o版本正在制定中,具体信息可以从 http://www.acpi.info网站上获得。ACPI规范也已广泛应用于IA-32架构的至强服务器系统中。

Linux对NUMA系统的物理内存分布信息是从系统firmware的ACPI表中获得的,最重要的是SRAT(System Resource Affinity Table)和SLIT(System Locality Information Table)表,其中SRAT包含两个结构:

Processor Local APIC/SAPIC Affinity Structure:记录某个CPU的信息;Memory Affinity Structure:记录内存的信息;

SLIT表则记录了各个结点之间的距离,在系统中由数组node_distance[ ]记录。

Linux采用Node、Zone和页三级结构来描述物理内存的,如图2所示,

图2 Linux中Node、Zone和页的关系

 

 

2.1 结点

Linux用一个struct pg_data_t结构来描述系统的内存,系统中每个结点都挂接在一个pgdat_list列表中,对UMA体系结构,则只有一个静态的pg_data_t结构contig_page_data。对NUMA系统来说则非常容易扩充,NUMA系统中一个结点可以对应Linux存储描述中的一个结点,具体描述见linux/mmzone.h。

 

typedef struct pglist_data {

zone_t node_zones[MAX_NR_ZONES];

zonelist_t node_zonelists[GFP_ZONEMASK+1];

int nr_zones;

struct page *node_mem_map;

unsigned long *valid_addr_bitmap;

struct bootmem_data *bdata;

unsigned long node_start_paddr;

unsigned long node_start_mapnr;

unsigned long node_size;

int node_id;

struct pglist_data *node_next;

} pg_data_t;

下面就该结构中的主要域进行说明,

域说明Node_zones该结点的zone类型,一般包括ZONE_HIGHMEM、ZONE_NORMAL和ZONE_DMA三类Node_zonelists分配时内存时zone的排序。它是由free_area_init_core()通过page_alloc.c中的build_zonelists()设置zone的顺序nr_zones该结点的 zone 个数,可以从 1 到 3,但并不是所有的结点都需要有 3 个 zonenode_mem_map它是 struct page 数组的第一页,该数组表示结点中的每个物理页框。根据该结点在系统中的顺序,它可在全局 mem_map 数组中的某个位置Valid_addr_bitmap用于描述结点内存空洞的位图node_start_paddr该结点的起始物理地址node_start_mapnr给出在全局 mem_map 中的页偏移,在free_area_init_core() 计算在 mem_map 和 lmem_map 之间的该结点的页框数目node_size该 zone 内的页框总数node_id该结点的 ID,全系统结点 ID 从 0 开始

系统中所有结点都维护在 pgdat_list 列表中,在 init_bootmem_core 函数中完成该列表初始化工作。

影响zonelist方式

https://www.cnblogs.com/zhenjing/archive/2012/03/21/linux_numa.html

采用Node方式组织的zonelist为:

 即各节点按照与本节点的Node Distance距离大小来排序,以达到更优的内存分配。

zonelist[2]

https://www.cnblogs.com/zhenjing/archive/2012/03/21/linux_numa.html

配置NUMA后,每个节点将关联2个zonelist:

1) zonelist[0]中存放以Node方式或Zone方式组织的zonelist,包括所有节点的zone;2) zonelist[1]中只存放本节点的zone即Legacy方式;

zonelist[1]用来实现仅从节点自身zone中的内存分配(参考__GFP_THISNODE标志)。

 

Page Frame

虽然内存访问的最小单位是byte或者word,但MMU是以page为单位来查找页表的,page也就成了Linux中内存管理的重要单位。包括换出(swap out)、回收(relcaim)、映射等操作,都是以page为粒度的。

因此,描述page frame的struct page自然成为了内核中一个使用频率极高,非常重要的结构体,来看下它是怎样构成的(为了讲解需要并非最新内核代码):

struct page {

unsigned long flags;

atomic_t count;

atomic_t _mapcount;

struct list_head lru;

struct address_space *mapping;

unsigned long index;

...

}

flags表示page frame的状态或者属性,包括和内存回收相关的PG_active, PG_dirty, PG_writeback, PG_reserved, PG_locked, PG_highmem等。其实flags是身兼多职的,它还有其他用途,这将在下文中介绍到。count表示引用计数。当count值为0时,该page frame可被free掉;如果不为0,说明该page正在被某个进程或者内核使用,调用page_count()可获得count值。_mapcount表示该page frame被映射的个数,也就是多少个page table entry中含有这个page frame的PFN。lru是"least recently used"的缩写,根据page frame的活跃程度(使用频率),一个可回收的page frame要么挂在active_list双向链表上,要么挂在inactive_list双向链表上,以作为页面回收的选择依据,lru中包含的就是指向所在链表中前后节点的指针(参考这篇文章)。如果一个page是属于某个文件的(也就是在page cache中),则mapping指向文件inode对应的address_space(这个结构体虽然叫address_space,但并不是进程地址空间里的那个address space),index表示该page在文件内的offset(以page size为单位)。

有了文件的inode和index,当这个page的内容需要和外部disk/flash上对应的部分同步时,才可以找到具体的文件位置。如果一个page是anonymous的,则mapping指向表示swap cache的swapper_space,此时index就是swapper_space内的offset。

事实上,现在最新Linux版本的struct page实现中大量用到了union,也就是同一个元素在不同的场景下有不同的意义。这是因为每个page frame都需要一个struct page来描述,一个page frame占4KB,一个struct page占32字节,那所有的struct page需要消耗的内存占了整个系统内存的32/4096,不到1%的样子,说小也小,但一个拥有4GB物理内存的系统,光这一项的开销最大就可达30多MB。

如果能在struct page里省下4个字节,那就能省下4多MB的内存空间,所以这个结构体的设计必须非常考究,不能因为多一种场景的需要就在struct page中增加一个元素,而是应该尽量采取复用的方式。

需要注意的是,struct page描述和管理的是这4KB的物理内存,它并不关注这段内存中的数据变化。 - https://zhuanlan.zhihu.com/p/68465952

 

2.2 Zone

每个结点的内存被分为多个块,称为zones,它表示内存中一段区域。一个zone用struct_zone_t结构描述,zone的类型主要有ZONE_DMA、ZONE_NORMAL和ZONE_HIGHMEM。ZONE_DMA位于低端的内存空间,用于某些旧的ISA设备。

ZONE_NORMAL的内存直接映射到Linux内核线性地址空间的高端部分,许多内核操作只能在ZONE_NORMAL中进行。

因为硬件的限制,内核不能对所有的page frames采用同样的处理方法,因此它将属性相同的page frames归到一个zone中。对zone的划分与硬件相关,对不同的处理器架构是可能不一样的。

比如在i386中,一些使用DMA的设备只能访问0~16MB的物理空间,因此将0~16MB划分为了ZONE_DMA。ZONE_HIGHMEM则是适用于要访问的物理地址空间大于虚拟地址空间,不能建立直接映射的场景。除开这两个特殊的zone,物理内存中剩余的部分就是ZONE_NORMAL了。 - https://zhuanlan.zhihu.com/p/68465952

例如,在X86中,zone的物理地址如下:

类型地址范围ZONE_DMA前16MB内存ZONE_NORMAL16MB - 896MBZONE_HIGHMEM896 MB以上

 

Zone是用struct zone_t描述的,它跟踪页框使用、空闲区域和锁等信息,具体描述如下:

typedef struct zone_struct {

spinlock_t lock;

unsigned long free_pages;

unsigned long pages_min, pages_low, pages_high;

int need_balance;

free_area_t free_area[MAX_ORDER];

wait_queue_head_t * wait_table;

unsigned long wait_table_size;

unsigned long wait_table_shift;

struct pglist_data *zone_pgdat;

struct page *zone_mem_map;

unsigned long zone_start_paddr;

unsigned long zone_start_mapnr;char *name;unsigned long size;

} zone_t;

在其他一些处理器架构中,ZONE_DMA可能是不需要的,ZONE_HIGHMEM也可能没有。比如在64位的x64中,因为内核虚拟地址空间足够大,不再需要ZONE_HIGH映射,但为了区分使用32位地址的DMA应用和使用64位地址的DMA应用,64位系统中设置了ZONE_DMA32和ZONE_DMA。

所以,同样的ZONE_DMA,对于32位系统和64位系统表达的意义是不同的,ZONE_DMA32则只对64位系统有意义,对32位系统就等同于ZONE_DMA,没有单独存在的意义。

此外,还有防止内存碎片化的ZONE_MOVABLE和支持设备热插拔的ZONE_DEVICE。可通过“cat /proc/zoneinfo |grep Node”命令查看系统中包含的zones的种类。

[rongtao@toa ~]$ cat /proc/zoneinfo |grep Node

Node 0, zone DMA

Node 0, zone DMA32

[rongtao@toa ~]$

下面就该结构中的主要域进行说明,

域说明Lock旋转锁,用于保护该zonefree_pages该zone空闲页总数pages_min, pages_low, pages_high Zone的阈值need_balance该标志告诉kswapd需要对该zone的页进行交换Free_area空闲区域的位图,用于buddy分配器wait_table等待释放该页进程的队列散列表,这对wait_on_page()和unlock_page()是非常重要的。当进程都在一条队列上等待时,将引起进程的抖动zone_mem_map全局mem_map中该zone所引用的第一页zone_start_paddr含义与node_start_paddr类似zone_start_mapnr含义与node_start_mapnr类似Name该zone的名字。如,“DMA”,“Normal”或“HighMem”SizeZone的大小,以页为单位

当系统中可用的内存比较少时,kswapd将被唤醒,并进行页交换。如果需要内存的压力非常大,进程将同步释放内存。如前面所述,每个zone有三个阈值,称为pages_low,pages_min和pages_high,用于跟踪该zone的内存压力。pages_min的页框数是由内存初始化free_area_init_core函数,根据该zone内页框的比例计算的,最小值为20页,最大值一般为255页。当到达pages_min时,分配器将采用同步方式进行kswapd的工作;当空闲页的数目达到pages_low时,kswapd被buddy分配器唤醒,开始释放页;当达到pages_high时,kswapd将被唤醒,此时kswapd不会考虑如何平衡该zone,直到有pages_high空闲页为止。一般情况下,pages_high缺省值是pages_min的3倍。

Linux存储管理的这种层次式结构可以将ACPI的SRAT和SLIT信息与Node、Zone实现有效的映射,从而克服了传统Linux中平坦式结构无法反映NUMA架构的缺点。当一个任务请求分配内存时,Linux采用局部结点分配策略,首先在自己的结点内寻找空闲页;如果没有,则到相邻的结点中寻找空闲页;如果还没有,则到远程结点中寻找空闲页,从而在操作系统级优化了访存性能。

https://zhuanlan.zhihu.com/p/68465952

Zone虽然是用于管理物理内存的,但zone与zone之间并没有任何的物理分割,它只是Linux为了便于管理进行的一种逻辑意义上的划分。Zone在Linux中用struct zone表示(以下为了讲解需要,调整了结构体中元素的顺序):

struct zone {

spinlock_t lock;

unsigned long spanned_pages;

unsigned long present_pages;

unsigned long nr_reserved_highatomic;

atomic_long_t managed_pages;

struct free_area free_area[MAX_ORDER];

unsigned long _watermark[NR_WMARK];

long lowmem_reserve[MAX_NR_ZONES];

atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

unsigned long zone_start_pfn;

struct pglist_data *zone_pgdat;

struct page *zone_mem_map;

...

}

lock是用来防止并行访问struct zone的spin lock,它只能保护struct zone这个结构体哈,可不能保护整个zone里的所有pages。spanned_pages是这个zone含有的总的page frames数目。在某些体系结构(比如Sparc)中,zone中可能存在没有物理页面的"holes",spanned_pages减去这些holes里的absent pages就是present_pages。

nr_reserved_highatomic是为某些场景预留的内存(参考本系列的第三篇文章),managed_pages是由buddy内存分配系统管理的page frames数目,其实也就是present_pages减去reserved pages。

free_area由free list空闲链表构成,表示zone中还有多少空余可供分配的page frames。_watermark有min(mininum), low, high三种,可作为启动内存回收的判断标准(详情请参考这篇文章)。

lowmem_reserve是给更高位的zones预留的内存(更详细的介绍请参考这篇文章)。vm_stat作为zone的内存使用情况的统计信息,是“/proc/zoneinfo”的数据来源。

zone_start_pfn是zone的起始物理页面号,zone_start_pfn+spanned_pages就是该zone的结束物理页面号。zone_pgdat是指向这个zone所属的node的。zone_mem_map指向由struct page构成的mem_map数组。

因为内核对zone的访问是很频繁的,为了更好的利用硬件cache来提高访问速度,struct zone中还有一些填充位,用于帮助结构体元素的cache line对齐。这和struct page对内存精打细算的使用形成了鲜明的对比,因为zone的种类很有限,一个系统中一共也不会有多少个zones,struct zone这个结构体的体积大点也没有什么关系。 https://zhuanlan.zhihu.com/p/68465952

Node Distance

https://www.cnblogs.com/zhenjing/archive/2012/03/21/linux_numa.html

上节中的例子是以2个节点为例,如果有>2个节点存在,就需要考虑不同节点间的距离来安排节点,例如以4个节点2个ZONE为例,各节点的布局(如4个XLP832物理CPU级联)值如下:

上图中,Node0和Node2的Node Distance为25,Node1和Node3的Node Distance为25,其它的Node Distance为15。

 

三、NUMA调度器

NUMA系统中,由于局部内存的访存延迟低于远地内存访存延迟,因此将进程分配到局部内存附近的处理器上可极大优化应用程序的性能。Linux 2.4内核中的调度器由于只设计了一个运行队列,可扩展性较差,在SMP平台表现一直不理想。当运行的任务数较多时,多个CPU增加了系统资源的竞争,限制了负载的吞吐率。在2.5内核开发时,Ingo Molnar写了一个多队列调度器,称为O(1),从2.5.2开始O(1)调度器已集成到2.5内核版本中。O(1)是多队列调度器,每个处理器都有一条自己的运行队列,但由于O(1)调度器不能较好地感知NUMA系统中结点这层结构,从而不能保证在调度后该进程仍运行在同一个结点上,为此,Eirch Focht开发了结点亲和的NUMA调度器,它是建立在Ingo Molnar的O(1)调度器基础上的,Eirch将该调度器向后移植到2.4.X内核中,该调度器最初是为基于IA64的NUMA机器的2.4内核开发的,后来Matt Dobson将它移植到基于X86的NUMA-Q硬件上。

 

3.1 初始负载平衡

在每个任务创建时都会赋予一个HOME结点(所谓HOME结点,就是该任务获得最初内存分配的结点),它是当时创建该任务时全系统负载最轻的结点,由于目前Linux中不支持任务的内存从一个结点迁移到另一个结点,因此在该任务的生命期内HOME结点保持不变。一个任务最初的负载平衡工作(也就是选该任务的HOME结点)缺省情况下是由exec()系统调用完成的,也可以由fork()系统调用完成。在任务结构中的node_policy域决定了最初的负载平衡选择方式。

Node_policy平衡方式注释0(缺省值)do_execve()任务由fork()创建,但不在同一个结点上运行exec()1do_fork()如果子进程有新的mm结构,选择新的HOME结点2do_fork()选择新的HOME结点

3.2 动态负载平衡

在结点内,该NUMA调度器如同O(1)调度器一样。在一个空闲处理器上的动态负载平衡是由每隔1ms的时钟中断触发的,它试图寻找一个高负载的处理器,并将该处理器上的任务迁移到空闲处理器上。在一个负载较重的结点,则每隔200ms触发一次。调度器只搜索本结点内的处理器,只有还没有运行的任务可以从Cache池中移动到其它空闲的处理器。

如果本结点的负载均衡已经非常好,则计算其它结点的负载情况。如果某个结点的负载超过本结点的25%,则选择该结点进行负载均衡。如果本地结点具有平均的负载,则延迟该结点的任务迁移;如果负载非常差,则延迟的时间非常短,延迟时间长短依赖于系统的拓扑结构。

 

四、CpuMemSets

SGI的Origin 3000 ccNUMA系统在许多领域得到了广泛应用,是个非常成功的系统,为了优化Origin 3000的性能,SGI的IRIX操作系统在其上实现了CpuMemSets,通过将应用与CPU和内存的绑定,充分发挥NUMA系统本地访存的优势。Linux在NUMA项目中也实现了CpuMemSets,并且在SGI的Altix 3000的服务器中得到实际应用。

CpuMemSets为Linux提供了系统服务和应用在指定CPU上调度和在指定结点上分配内存的机制。CpuMemSets是在已有的Linux调度和资源分配代码基础上增加了cpumemmap和cpumemset两层结构,底层的cpumemmap层提供一个简单的映射对,主要功能是:将系统的CPU号映射到应用的CPU号、将系统的内存块号映射到应用的内存块号;上层的cpumemset层主要功能是:指定一个进程在哪些应用CPU上调度任务、指定内核或虚拟存储区可分配哪些应用内存块。

 

4.1 cpumemmap

内核任务调度和内存分配代码使用系统号,系统中的CPU和内存块都有对应的系统号。应用程序使用的CPU号和内存块号是应用号,它用于指定在cpumemmap中CPU和内存的亲和关系。每个进程、每个虚拟内存区和Linux内核都有cpumemmap,这些映射是在fork()、exec()调用或创建虚拟内存区时继承下来的,具有root权限的进程可以扩展cpumemmap,包括增加系统CPU和内存块。映射的修改将导致内核调度代码开始运用新的系统CPU,存储分配代码使用新的内存块分配内存页,而已在旧块上分配的内存则不能迁移。Cpumemmap中不允许有空洞,例如,假设cpumemmap的大小为n,则映射的应用号必须从0到n-1。

Cpumemmap中系统号和应用号并不是一对一的映射,多个应用号可以映射到同一个系统号。

 

4.2 cpumemset

系统启动时,Linux内核创建一个缺省的cpumemmap和cpumemset,在初始的cpumemmap映射和cpumemset中包含系统目前所有的CPU和内存块信息。

Linux内核只在该任务cpumemset的CPU上调度该任务,并只从该区域的内存列表中选择内存区分配给用户虚拟内存区,内核则只从附加到正在执行分配请求CPU的cpumemset内存列表中分配内存。

一个新创建的虚拟内存区是从任务创建的当前cpumemset获得的,如果附加到一个已存在的虚拟内存区时,情况会复杂些,如内存映射对象和Unix System V的共享内存区可附加到多个进程,也可以多次附加到同一个进程的不同地方。如果被附加到一个已存在的内存区,缺省情况下新的虚拟内存区继承当前附加进程的cpumemset,如果此时标志位为CMS_SHARE,则新的虚拟内存区链接到同一个cpumemset。

当分配页时,如果该任务运行的CPU在cpumemset中有对应的存储区,则内核从该CPU的内存列表中选择,否则从缺省的CPU对应的cpumemset选择内存列表。

 

4.3硬分区和CpuMemSets

在一个大的NUMA系统中,用户往往希望控制一部分CPU和内存给某些特殊的应用。目前主要有两种技术途径:硬分区和软分区技术,CpuMemSets是属于软分区技术。将一个大NUMA系统的硬分区技术与大NUMA系统具有的单系统映像优势是矛盾的,而CpuMemSets允许用户更加灵活的控制,它可以重叠、划分系统的CPU和内存,允许多个进程将系统看成一个单系统映像,并且不需要重启系统,保障某些CPU和内存资源在不同的时间分配给指定的应用。

SGI的CpuMemSets软分区技术有效解决硬分区中的不足,一个单系统的SGI ProPack Linux服务器可以分成多个不同的系统,每个系统可以有自己的控制台、根文件系统和IP网络地址。每个软件定义的CPU组可以看成一个分区,每个分区可以重启、安装软件、关机和更新软件。分区间通过SGI NUMAlink连接进行通讯,分区间的全局共享内存由XPC和XPMEM内核模块支持,它允许一个分区的进程访问另一个分区的物理内存。

 

五、测试

为了有效验证Linux NUMA系统的性能和效率,我们在SGI公司上海办事处测试了NUMA架构对SGI Altix 350性能。

该系统的配置如下: CPU:8个1.5 GHz Itanium2 内存:8GB 互连结构:如图3所示

图3 SGI Altix350 4个计算模块的Ring拓扑

 

测试用例:

1、Presta MPI测试包(来自ASCI Purple的Benchmark)

从互连拓扑结构可以看出,计算模块内部的访存延迟不需要通过互连,延迟最逗,剩下的需要通过1步或2步互连到达计算模块,我们通过Presta MPI测试包,重点测试每步互连对系统的影响,具体结果如下:

 

最小延迟(us)一步延迟(us)两步延迟(us)1.61.82.0

2、NASA的NPB测试

进程数 程序名 1248IS墙钟(s)10.255.383.001.66加速比11.93.46.17EP墙钟(s)144.2672.1336.1218.09加速比123.97.97FT墙钟(s)138.2990.3947.4622.21加速比11.522.916.25CG墙钟(s)131.6567.3436.7921.58加速比11.93.66.1LU墙钟(s)584.14368.92144.7366.38加速比11.64.08.7SP墙钟(s)627.73 248.22 加速比1 2.5 BT墙钟(s)1713.89 521.63 加速比1 3.2 

上述测试表明,SGI Altix 350系统具有较高的访存和计算性能,Linux NUMA技术已进入实用阶段。

本文的Linux NUMA测试工作得到国防科大计算机学院周恩强讲师、SGI公司上海办事处孙晓工程师的大力支持,在此表示衷心的感谢。

 

相关主题

[1] Matthew Dobson, Patricia Gaughen, Michael Hohnbaum, Erich Focht, “Linux Support for NUMA Hardware”, Linux Symposium 2003[2] Kazuto MIYOSHI, Jun’ichi NOMURA, Hiroshi AONO, Erich Focht, Takayoshi KOCHI, “IPF Linux Feature Enhancements for TX7”, NEC Res. & Develop. Vol.44 No.1, 2003[3] Mel Gorman, “Understanding The Linux Virtual Memory Manager”, 2003[4] Eirch Focht, “Node affine NUMA scheduler”, 2002[5] Paul Larson, “内核比较:2.6 内核中改进了内存管理”,developerWorks 中国网站,2004[6] Steve Neuner, “Scaling Linux to New Heights: the SGI Altix 3000 System”, Linux Journal, Feb, 2003

 

优惠劵

rtoax

关注

关注

13

点赞

56

收藏

觉得还不错?

一键收藏

知道了

12

评论

Linux内存管理:NUMA技术详解(非一致内存访问架构)

图片来源:https://zhuanlan.zhihu.com/p/68465952《Linux内存管理:转换后备缓冲区(TLB)原理》《内存管理:Linux Memory Management:MMU、段、分页、PAE、Cache、TLB》《Memory Management Concepts overview(内存管理基本概念)》《NUMA - Non Uniform Memory Architecture 非统一内存架构》《什么是NUMA?》《NUMA全称 Non-U...

复制链接

扫一扫

专栏目录

linux那些事之numa balance

weixin_42730667的博客

04-15

2816

numa balance

numa balance是页迁移技术最重要也是最开始使用应用的技术,主要是为解决由于调度器进程在两个node迁移造成访问内存性能问题。

numba balance通过页迁移技术,将不在运行节点的内存迁移到所运行的节点上。对应numa balance要注意几点:

防止进程在两个节点之前来回切换造成频繁进行页迁移(ping pang现场)从而影响性能。numa balance本身就是为了提高性能,不能因为造成频繁页迁移而对性能进行一个反向优化。

多个线程分布在不同线程 访.

NUMA内存架构下的Spark性能优化

05-02

这是第12次上海 Spark Meetup的分享资料.本文回顾NUMA的技术要点以及它如何影响内存密集型Spark应用程序的性能。然后将介绍识别NUMA性能问题的工具和方法,以及作者团队为Spark任务调度添加NUMA感知方面的工作。

12 条评论

您还未登录,请先

登录

后发表或查看评论

NUMA详解

zyqash的博客

04-26

5237

Numa(Non-Uniform Memory Access)是一种计算机架构,它允许多个处理器通过共享内存来访问系统的物理内存,但是不同的内存区域可能由不同的处理器处理,这种内存访问方式被称为“非一致性内存访问”。在这种架构中,每个处理器都有自己的本地内存,但它们可以访问其他处理器的内存。这种架构可以提高多处理器系统的性能,并使系统更具可扩展性。在Numa系统中,每个处理器有一个本地内存子系统,每个本地内存子系统都连接到一个全局内存交换网络。

杂记十:numa详解

Cape_sir

01-27

1509

1.NUMA的几个概念(Node,socket,core,thread)

对于socket,core和thread会有不少文章介绍,这里简单说一下,具体参见下图:

一句话总结:socket就是主板上的CPU插槽; Core就是socket里独立的一组程序执行的硬件单元,比如寄存器,计算单元等; Thread:就是超线程hyperthread的概念,逻辑的执行单元,独立的执行上下文,但是共享core内的寄存器和计算单元。

NUMA体系结构中多了Node的概念,这个概念其实是用来解决core的分组的问题,具体

NUMA概述

造梦先森Kai的专栏

09-03

8161

NUMA是什么【非统一内存访问(NUMA)是一种用于多处理器的电脑记忆体设计,内存访问时间取决于处理器的内存位置。 在NUMA下,处理器访问它自己的本地存储器的速度比非本地存储器(存储器的地方到另一个处理器之间共享的处理器或存储器)快一些。】下图就描述了一个比较形象的NUMA架构:我们有两个NUMA结点。每个NUMA结点有一些CPU, 一个内部总线,和自己的内存,甚至可以有自己的IO。每个CPU有离

【性能】什么是NUMA(Non-Uniform Memory Access)|什么是SMP

bandaoyu的note

02-16

8473

什么是NUMA(Non-Uniform Memory Access)

NUMA VS. UMA

NUMA(Non-Uniform Memory Access)非均匀内存访问架构是指多处理器系统中,内存的访问时间是依赖于处理器和内存之间的相对位置的。 这种设计里存在和处理器相对近的内存,通常被称作本地内存;还有和处理器相对远的内存, 通常被称为非本地内存。

UMA(Uniform Memory Access)均匀内存访问架构则是与NUMA相反,所以处理器对共享内存的访问距离和时间是相同的。由此可知,不论

总线一致性:高性能SoC核心技术

数字IC/FPGA杂货铺,找一找,或许有你想要的~

01-31

1316

Coherent Bus

numa balance实现浅析

faxiang1230的专栏

04-03

6171

NUMA结构的系统上通过numa balance将task和它访问的内存放到相同的node上

NUMA架构详解

热门推荐

bob的博客

05-05

3万+

基本概念

SMP VS. AMP

SMP(Symmetric Multiprocessing), 即对称多处理器架构,是目前最常见的多处理器计算机架构。

AMP(Asymmetric Multiprocessing), 即非对称多处理器架构,则是与SMP相对的概念。

那么两者之间的主要区别是什么呢? 总结下来有这么几点,

SMP的多个处理器都是同构的,使用相同架构的CPU;而AMP的多个处理...

numa详解

吴业亮的专栏

12-30

3万+

作者:【吴业亮】

博客:http://blog.csdn.net/wylfengyujiancheng

一、系统架构的演进从SMP到NUMA

1、SMP(Symmetric Multi-Processor)

所谓对称多处理器结构,是指服务器中多个CPU对称工作,无主次或从属关系。各CPU共享相同的物理内存,每个 CPU访问内存中的任何地址所需时间是相同的,因此SMP也被称为一致存储器访问结构(U...

numa-angularjs-client:Numa 的 AngularJS 客户端

07-07

Numa AngularJS 客户端

什么?

“Numa”源自“pneúma”,意为风、气息、精神。 这是 Numa 的 AngularJS 客户端的存储库,正在进行中。 Numa 正在成为分享诗歌、沉思和自由形式辞典的中心。 它将成为一个收集反馈并与他人联系的平台。 后裔。

为什么?

我正在开发 Numa,因为我关心与人的联系; 与我的朋友和家人分享富有表现力的作品; 并拥有一个干净、动态和现代的网络应用程序,用于共享任何形式的抒情表达。

好的,我如何让客户端运行?

下载并安装 您还可以按照快速轻松地安装 Node.js 和 npm

下载并安装。

安装 : npm install -g grunt-cli

安装 : npm install -g bower

克隆 repo: git clone https://github.com/boyz-2-men/numa-angul

vSphere NUMA技术架构概述.pptx

10-15

vSphere NUMA技术架构概述.pptx

基于numa架构的tcmalloc内存管理算法

05-04

对tcmalloc算法的更改,支持numa架构,在numa架构下有更好的性能

Umpire:面向应用程序的API,用于NUMA和GPU架构上的内存管理

02-05

裁判员v5.0.0

Umpire是一个资源管理库,它允许在具有多个内存设备(例如NUMA和GPU)的计算机上发现,提供和管理内存。

裁判使用CMake和BLT处理构建。 由于BLT作为子模块提供,因此首先请确保运行:

$ git submodule init && git submodule update

然后,确保已加载现代编译器,并且配置非常简单:

$ mkdir build && cd build

$ cmake ..

CMake将提供有关正在使用哪个编译器的输出。 CMake完成后,即可使用Make构建Umpire:

$ make

对于更高级的配置,您可以使用标准CMake变

SQLServer新特性:内存数据库

01-30

而随着这些年硬件的发展,现在服务器拥有几百G内存并不罕见,此外由于NUMA架构的成熟,也消除了多CPU访问内存的瓶颈问题,因此内存数据库得以出现。内存的学名叫做RandomAccessMemory(RAM),因此如其特性一样,是...

FRT1000柔性转子滑动轴承实验台-ZD说明书.pdf

03-04

FRT1000柔性转子滑动轴承实验台-ZD说明书

LKS1000V2_ENG.pdf

03-04

LKS1000V2_ENG

深度学习数据集之图像分类数据集:大型中药药材图像分类数据集(100分类)

03-04

数据集包含大型中药药材图像分类数据集(100分类),数据按照文件夹储存,不需要处理可直接用作深度学习训练数据。

本数据集分为以下100类:Bailian、HaiIong、Jinguolan、Qianghuo等等共100类别(每个类别均有100张图片左右)

数据集总大小:301MB

下载解压后的图像目录:训练集(8066张图片)、和测试集(1892张图片)

data-train 训练集-每个子文件夹放同类别的图像,文件夹名为分类类别

data-test 测试集-每个子文件夹放同类别的图像,文件夹名为分类类别

除此之外,提供了classes的json字典类别文件,以及可视化的脚本py文件

JAVA练手项目,学生信息管理系统.zip

最新发布

03-04

人工智能-项目实践-信息管理系统设计与开发

详述NUMA内存分配策略Linux

05-13

NUMA(Non-Uniform Memory Access)是一种计算机硬件架构,它通过将内存划分为多个节点(node)来提高系统性能。在 NUMA 架构中,每个节点都有自己的本地内存和处理器,同时还可以访问其他节点的内存和处理器。因此,NUMA 架构可以实现更高的可扩展性和性能。

在 Linux 系统中,NUMA 内存分配策略主要有两种:首选节点(Preferred Node)和本地节点(Local Node)。

首选节点策略指定一个节点作为内存分配的首选节点,如果该节点上没有足够的空闲内存,则会从其他节点中选择一个可用的节点进行分配。这种策略适用于需要在特定节点上运行的应用程序,例如数据库或虚拟机。

本地节点策略则优先在请求内存的进程所在的节点上分配内存。如果该节点上没有足够的内存,则会从其他节点中选择一个可用的节点进行分配。这种策略适用于需要快速访问本地内存的应用程序,例如科学计算或图形处理。

Linux 系统还提供了其他一些 NUMA 内存分配策略,例如交错(Interleave)和远程节点(Remote Node)等。交错策略将内存均匀地分配到所有节点上,而远程节点策略则将内存分配到远程节点上,以减少节点之间的数据传输。

可以使用 numactl 命令来管理 NUMA 内存分配策略,例如设置首选节点、查看节点信息、绑定进程等。在编写 NUMA 应用程序时,也可以使用一些库函数来控制内存分配策略,例如 numa_alloc_local() 和 numa_alloc_onnode() 等。

“相关推荐”对你有帮助么?

非常没帮助

没帮助

一般

有帮助

非常有帮助

提交

rtoax

CSDN认证博客专家

CSDN认证企业博客

码龄7年

暂无认证

1239

原创

1万+

周排名

1858

总排名

372万+

访问

等级

3万+

积分

2571

粉丝

2651

获赞

2409

评论

1万+

收藏

私信

关注

热门文章

轻松解决远程链接的“Gtk-WARNING **: cannot open display;”或“Cannot connect to display;”问题

109564

PostgreSQL的登录、创建用户、数据库并赋权

49247

几篇关于【核心网】MME、PGW、SGW和PCRF的介绍

41058

5G基站君的进化之路 — CU和DU分离

38704

gtk学习总结:GTK从入门到放弃,三天包教包会

37845

分类专栏

笔记

50篇

【操作系统】

52篇

【高性能计算】

48篇

【基础知识】

427篇

【计算机网络】

324篇

【开源社区】

42篇

【Linux内核】

734篇

【人机交互】

134篇

【算法与数学】

112篇

【数据库】

58篇

【通信技术】

96篇

【语言与编译】

456篇

【虚拟化】

121篇

最新评论

virtio 网络的演化:原始virtio > vhost-net(内核态) > vhost-user(DPDK) > vDPA

rtoax:

你看下原文链接哈,这是转载的文章!

virtio 网络的演化:原始virtio > vhost-net(内核态) > vhost-user(DPDK) > vDPA

wayne&wang:

博主您好,你的文章写的很好,请问一下,比如第2节的 “图2 Vhost-net为后端的virtio网络架构”的图片原始出处是哪里?

#include_next

隐→尘烟_Li:

是全部包含进来吗?

CentOS 7 Linux实时内核下的epoll性能分析

Cc又菜又帅:

分析的很棒

Kernel Crash kdump 使用指南

风露清愁06:

echo 1 > /proc/sys/kernel/unknown_nmi_panic 需要注意的是,启用这个特性的话,是不能够同时启用NMI_WATCHDOG的!否则系统会Panic!……这里说的是 kernel.nmi_watchdog 吧,如果是,我认为这个说法不对,这个用的是PMU nmi,属于NMI_LOCAL,用来做hard lockup检测的,NMI按钮应该是属于NMI_UNKNOWN

大家在看

(day 2)JavaScript学习笔记(基础之变量、常量和注释)

268

【最详细的python教程】Python标识符和关键字

.NET CORE 高并发处理

小程序API能力集成指南——画布API汇总(四)

635

使用NFS网关功能将HDFS挂载到本地系统

最新文章

Linux操作系统下载器 Motrix

disassemble_section

Why kernl miss __blk_account_io_start kprobe

2023年8篇

2022年102篇

2021年555篇

2020年708篇

2019年505篇

2018年222篇

2017年93篇

目录

目录

分类专栏

笔记

50篇

【操作系统】

52篇

【高性能计算】

48篇

【基础知识】

427篇

【计算机网络】

324篇

【开源社区】

42篇

【Linux内核】

734篇

【人机交互】

134篇

【算法与数学】

112篇

【数据库】

58篇

【通信技术】

96篇

【语言与编译】

456篇

【虚拟化】

121篇

目录

评论 12

被折叠的  条评论

为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

祝福语

请填写红包祝福语或标题

红包数量

红包个数最小为10个

红包总金额

红包金额最低5元

余额支付

当前余额3.43元

前往充值 >

需支付:10.00元

取消

确定

下一步

知道了

成就一亿技术人!

领取后你会自动成为博主和红包主的粉丝

规则

hope_wisdom 发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额

0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。 2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值

NUMA_百度百科

_百度百科 网页新闻贴吧知道网盘图片视频地图文库资讯采购百科百度首页登录注册进入词条全站搜索帮助首页秒懂百科特色百科知识专题加入百科百科团队权威合作下载百科APP个人中心NUMA播报讨论上传视频非一致性内存访问架构收藏查看我的收藏0有用+10NUMA(Non Uniform Memory Access)技术可以使众多服务器像单一系统那样运转,同时保留小系统便于编程和管理的优点。基于电子商务应用对内存访问提出的更高的要求,NUMA也向复杂的结构设计提出了挑战。中文名NUMA外文名Non Uniform Memory Access所属类别用于多处理器的电脑记忆体设计目录1产生2基本概念3基本介绍4缓存相关5发展经过6集群运算产生播报编辑非统一内存访问(NUMA)是一种用于多处理器的电脑内存体设计,内存访问时间取决于处理器的内存位置。 在NUMA下,处理器访问它自己的本地存储器的速度比非本地存储器(存储器的地方到另一个处理器之间共享的处理器或存储器)快一些。NUMA架构在逻辑上遵循对称多处理(SMP)架构。 它是在二十世纪九十年代被开发出来的,开发商包括Burruphs (优利系统), Convex Computer(惠普),意大利霍尼韦尔信息系统(HISI)(后来的Group Bull),Silicon Graphics公司(后来的硅谷图形),Sequent电脑系统(后来的IBM),通用数据(EMC), Digital (后来的Compaq ,HP)。 这些公司研发的技术后来在类Unix操作系统中大放异彩,并在一定程度上运用到了Windows NT中。首次商业化实现基于NUMA的Unix系统的是对称多处理XPS-100系列服务器,它是由VAST公司的Dan Gielen为HISI设计的。 这个架构的巨大成功使HISI成为了欧洲的顶级Unix厂商。基本概念播报编辑现代计算机的CPU处理速度比它的主存速度快不少。而在早期的计算和数据处理中,CPU通常比它的主存慢。但是随着超级计算机的到来,处理器和存储器的性能在二十世纪六十年代达到平衡。自从那个时候,CPU常常对数据感到饥饿而且必须等待存储器的数据到来。为了解决这个问题,很多在80和90年代的超级计算机设计专注于提供高速的存储器访问,使得计算机能够高速地处理其他系统不能处理的大数据集。限制访问存储器的次数是现代计算机提高性能的要点。对于商品化的处理器,这意味着设置数量不断增长的高速缓存和使用不断变得精巧复杂的算法以防止“缓存数据未命中(cache missed)”。但是操作系统和应用程序大小的明显增长压制了前述的缓存技术造成的提升。没有使用NUMA的多处理器系统使得问题更糟糕。因为同一时间只能有一个处理器访问计算机的存储器,所以在一个系统中可能存在多个处理器在等待访问存储器。NUMA通过提供分离的存储器给各个处理器,避免当多个处理器访问同一个存储器产生的性能损失来试图解决这个问题。对于涉及到分散的数据的应用(在服务器和类似于服务器的应用中很常见),NUMA可以通过一个共享的存储器提高性能至n倍,而n大约是处理器(或者分离的存储器)的个数。当然,不是所有数据都局限于一个任务,所以多个处理器可能需要同一个数据。为了处理这种情况,NUMA系统包含了附加的软件或者硬件来移动不同存储器的数据。这个操作降低了对应于这些存储器的处理器的性能,所以总体的速度提升受制于运行任务的特点。基本介绍播报编辑当今数据计算领域的主要应用程序和模型可大致分为联机事务处理(OLTP)、决策支持系统(DSS)和企业信息通讯(BusinessCommunications)三大类。而小型独立服务器模式、SMP(对称多处理)模式、MPP(大规模并行处理)模式和NUMA模式,则是上述3类系统设计人员在计算平台的体系结构方面可以采用的选择。为了全面的了解NUMA的优势,我们不妨先来考察一下这几种模式在处理器与存储器结构方面的区别。SMP模式将多个处理器与一个集中的存储器相连。在SMP模式下,所有处理器都可以访问同一个系统物理存储器,这就意味着SMP系统只运行操作系统的一个拷贝。因此SMP系统有时也被称为一致存储器访问(UMA)结构体系,一致性意指无论在什么时候,处理器只能为内存的每个数据保持或共享一个数值。很显然,SMP的缺点是可伸缩性有限,因为在存储器接口达到饱和的时候,增加处理器并不能获得更高的性能。MPP模式则是一种分布式存储器模式,能够将更多的处理器纳入一个系统的存储器。一个分布式存储器模式具有多个节点,每个节点都有自己的存储器,可以配置为SMP模式,也可以配置为非SMP模式。单个的节点相互连接起来就形成了一个总系统。MPP体系结构对硬件开发商颇具吸引力,因为它们出现的问题比较容易解决,开发成本比较低。由于没有硬件支持共享内存或高速缓存一致性的问题,所以比较容易实现大量处理器的连接。可见,单一SMP模式与MPP模式的关键区别在于,在SMP模式中,数据一致性是由硬件专门管理的,这样做比较容易实现,但成本较高;在MPP模式中,节点之间的一致性是由软件来管理,因此,它的速度相对较慢,但成本却低得多。在美国某大学的研究项目中被提出来的NUMA模式,也采用了分布式存储器模式,不同的是所有节点中的处理器都可以访问全部的系统物理存储器。然而,每个处理器访问本节点内的存储器所需要的时间,可能比访问某些远程节点内的存储器所花的时间要少得多。换句话说,也就是访问存储器的时间是不一致的,这也就是这种模式之所以被称为“NUMA”的原因。简而言之,NUMA既保持了SMP模式单一操作系统拷贝、简便的应用程序编程模式以及易于管理的特点,又继承了MPP模式的可扩充性,可以有效地扩充系统的规模。这也正是NUMA的优势所在。缓存相关播报编辑几乎所有利用少量的极快的非共享的内存例如cache的CPU结构利用内存访问方法中引用的位置。使用NUMA的系统,在共享内存时维持高速缓存一致性的开销非常大。尽管设计与搭建更简单,但是非一致性高速缓存NUMA系统编程在冯诺依曼编程架构标准下变得非常复杂。NUMA典型的,ccNUMA在缓存控制器中使用处理器间通信,以此来保持稳定的存储器映像当多个缓存试图存储在同一个内存位置时。由于这个原因,当多处理器快速连续的尝试访问相同的内存区时ccNUMA可能表现比较差。支持NUMA的操作系统尝试通过以NUMA友好的方式分配处理器和内存,同时避免会使NUMA非友好方式成为必然的调度、锁定算法来降低这种类型访问的频率。另外,缓存一致性协议如MESIF协议试图减少需要维护缓存一致性的通信。可扩展一致性接口(SCI)是一个IEEE标准定义的一个基于目录的缓存一致性协议,以避免在早期的多处理器系统中发现的可扩展性限制。SCI被用作Numascale NumaConnect的基础技术。截止2011年,ccNUMA系统是基于AMD Opteron处理器的多处理器系统(该系统可以在没有外部逻辑的情况下执行),或者基于英特尔安腾处理器(需要芯片组以支持NUMA的系统)。支持CCNUMA的芯片组例子有SGI SHUB(Super hub),Intel E8870,HP sx2000(在Integrity and Superdome服务器中使用),和那些以NEC Itanium-based的系统上。早期ccNUMA系统例如那些来自硅谷图形(计算机公司)是基于MIPS处理器和DEC Alpha 21364 (EV7)处理器的。发展经过播报编辑Sequent公司是世界公认的NUMA技术领袖。早在1986年,Sequent公司率先利用微处理器作为创建大型系统的构建,开发了基于Unix的SMP体系结构,开创了业界转入SMP领域的先河。1999年9月,IBM公司收购了Sequent公司,将NUMA技术集成到IBMUnix阵营中,并推出了能够支持和扩展Intel平台的NUMA-Q系统及解决方案,为全球大型企业客户适应高速发展的电子商务市场提供了更加多样化、高可扩展性及易于管理的选择,成为NUMA技术的领先开发者与革新者。此后,IBM还推出了名为NUMACenter的多层次系统,集成了Unix和WindowsNT系统优势,为WindowsNT应用程序提供了预集成的环境,允许客户在高可扩充性和高可用性的Unix数据层中,自由使用WindowsNT应用程序,有效的实现了Unix和WindowsNT的互操作。NUMA-Q结构的基本构成是Intel的4个处理器组建块(Quad)设计,NUMA-Q实现的2项关键技术是Quad设计和IQ-Link互连设备。NUMA-QQuad由4个处理器、一定数量的内存和7个位于PCI通道的PCI插槽组成。NUMA-Q体系结构利用Quad实现了CC-NUMA结构,大规模扩展并保留了SMP编程模式,并可提供容错光纤通道I/O子系统,使SMP应用程序能运行于其上。NUMA-Q能在单一节点上支持高达256个处理器。IQ-Link互联设备是NUMA-QQuad总线间的互联设备,这种互联设备的一致性严格以硬件实现,不需要用软件维护。IQ-Link互联设备允许使用多个低延迟总线,具备低延迟和吞吐量高的特点,提供了很强的系统可扩充性和整体性能。可见,这种体系结构的优势在于:首先,NUMA的突破性技术彻底摆脱了传统的超大总线对多处理结构的束缚。它大大增强单一操作系统可管理的处理器、内存和I/O插槽。其次,NUMA设计的重点是让处理器快速的访问在同一单元的内存,NUMA-Q处理器访问同一单元上的内存的速度比一般SMP模式超出一倍。并且,NUMA-Q操作系统充分利用处理器缓存,能达到极高的寻址命中率。SMP模式虽然比NUMA简单,但是,所有的处理器访问内存的时间是一致且缓慢的。同时,基于SMP的总线存着在一个物理极限,令系统的扩充性逐步降低。此外,在基于SMP体系结构的大型系统中,平衡的增加处理器、I/O和内存变得更加困难。此外,NUMA系统提供内存互连的硬件系统,这种技术可以开发新型动态的分区系统。系统分区的好处在于允许系统管理员在同一计算机内运行多个操作系统(如Unix和WindowsNT),并根据用户工作负荷的要求,在不同的操作系统环境间,简单的管理和使用CPU和内存资源,从而实现最佳的性能和最高的资源利用率。NUMA-Q现已成为IBM互联网服务器部门的支柱产品,加强了IBM服务器在电子商务领域的竞争力。不难看出,NUMA-Q的目标市场是那些解决“关键事务性”(Mission-Critical)的商业数据中心。这些商业数据中心的计算机系统具有一些共同的特征,如具有高可用性、高可靠性、能够适应与日俱增的性能需求的高可扩充性的特点。NUMA-Q体系结构可以帮助联机事务处理、决策支持系统和企业信息通讯系统设计人员创建这种大规模的“关键事务性”解决方案。因此,NUMA-Q广泛的适用于具有大量I/O计算、商业智能、客户关系管理、企业资源规划的环境。它给企业提供利用同一组部件创建多种体系结构的灵活性,以及适用于多种解决方案的高可用性和高可管理性的工具集,同时可以支持多用户和更大的吞吐量,减少客户故障停机时间,提升了I/O功能,实现更大的联机存储与备份能力,并具有很强的扩展性,可以最大程度地保护客户的投资。目前,包括美国Nasdaq证券自动报价系统,波音飞机制造公司、福特汽车公司等在内的诸多国际著名企业都选用了IBMNUMA-Q体系结构的服务器,全球最大的Internet儿童产品零售商eToys依靠NUMA-Q成功地实现了电子商务。国内已有很多大型企业,包括中国银行、中国建设银行、邮政储蓄管理局、北京西单商场股份有限公司及国务院办公厅等等部门,采用IBMNUMA-Q建立了自己的系统环境。业界许多服务器产品供应商,如Sun、HP、Compaq、Unisys、SGI和DataGeneral等厂商的硬件结构也将向NUMA结构转移,很多厂商正在计划或正在研制基于NUMA体系结构的计算机系统。IBM也将推出更有竞争力的第4代NUMA-Q体系结构,迎接NUMA对复杂设计、多路I/O提出的挑战。高速缓存相关的非一致性内存访问(CacheCoherentNon-UniformMemoryAccess,CC-NUMA)是NUMA的一种类型。在CC-NUMA系统中,分布式内存相连接形成单一内存,内存之间没有页面复制或数据复制,也没有软件消息传送。CC-NUMA只有一个内存映象,存储部件利用铜缆和某些智能硬件进行物理连接。CacheCoherent是指不需要软件来保持多个数据拷贝的一致性,也不需要软件来实现操作系统与应用系统的数据传输。如同在SMP模式中一样,单一操作系统和多个处理器完全在硬件级实现管理。高速缓存的内存体系结构(Cache-OnlyMemoryArchitecture,COMA)是CC-NUMA体系结构的竞争者,两者拥有相同的目标,但实现方式不同。COMA节点不对内存部件进行分布,也不通过互连设备使整个系统保持一致性。COMA节点没有内存,只在每个Quad中配置大容量的高速缓存。集群运算播报编辑我们可以把NUMA看作集群运算的一个紧密耦合的形式。虚拟内存对集群结构的页式调度技术的加入更是使得NUMA可以完全由软件实现。基于软件的NUMA在节点间的延迟仍然比基于硬件的NUMA大几个数量级。新手上路成长任务编辑入门编辑规则本人编辑我有疑问内容质疑在线客服官方贴吧意见反馈投诉建议举报不良信息未通过词条申诉投诉侵权信息封禁查询与解封©2024 Baidu 使用百度前必读 | 百科协议 | 隐私政策 | 百度百科合作平台 | 京ICP证030173号 京公网安备110000020000

什么是NUMA,我们为什么要了解NUMA-阿里云开发者社区

什么是NUMA,我们为什么要了解NUMA-阿里云开发者社区

产品解决方案文档与社区权益中心定价云市场合作伙伴支持与服务了解阿里云售前咨询 95187-1 在线服务售后咨询 4008013260 在线服务其他服务 我要建议 我要投诉更多联系方式备案控制台开发者社区首页探索云世界探索云世界云上快速入门,热门云上应用快速查找了解更多问产品动手实践考认证TIANCHI大赛活动广场活动广场丰富的线上&线下活动,深入探索云世界任务中心做任务,得社区积分和周边高校计划让每位学生受益于普惠算力训练营资深技术专家手把手带教话题畅聊无限,分享你的技术见解开发者评测最真实的开发者用云体验乘风者计划让创作激发创新阿里云MVP遇见技术追梦人直播技术交流,直击现场下载下载海量开发者使用工具、手册,免费下载镜像站极速、全面、稳定、安全的开源镜像技术资料开发手册、白皮书、案例集等实战精华插件为开发者定制的Chrome浏览器插件探索云世界新手上云云上应用构建云上数据管理云上探索人工智能云计算弹性计算无影存储网络倚天云原生容器serverless中间件微服务可观测消息队列数据库关系型数据库NoSQL数据库数据仓库数据管理工具PolarDB开源向量数据库热门Modelscope模型即服务弹性计算云原生数据库物联网云效DevOps龙蜥操作系统平头哥钉钉开放平台大数据大数据计算实时数仓Hologres实时计算FlinkE-MapReduceDataWorksElasticsearch机器学习平台PAI智能搜索推荐人工智能机器学习平台PAI视觉智能开放平台智能语音交互自然语言处理多模态模型pythonsdk通用模型开发与运维云效DevOps钉钉宜搭支持服务镜像站码上公益

开发者社区

开发与运维

文章

正文

什么是NUMA,我们为什么要了解NUMA

2023-07-25

486

版权

版权声明:

本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《

阿里云开发者社区用户服务协议》和

《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写

侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

简介:

在IA多核平台上进行开发时,我们经常会提到NUMA这个词 ,那么NUMA到底指的是什么?我们怎么可以感受到它的存在?以及NUMA的存在对于我们编程会有什么影响

网络技术开发笔记

目录

热门文章

最新文章

为什么选择阿里云什么是云计算全球基础设施技术领先稳定可靠安全合规分析师报告产品和定价全部产品免费试用产品动态产品定价价格计算器云上成本管理解决方案技术解决方案文档与社区文档开发者社区天池大赛培训与认证权益中心免费试用高校计划企业扶持计划推荐返现计划支持与服务基础服务企业增值服务迁云服务官网公告健康看板信任中心关注阿里云关注阿里云公众号或下载阿里云APP,关注云资讯,随时随地运维管控云服务售前咨询:95187-1售后服务:400-80-13260法律声明及隐私权政策Cookies政策廉正举报安全举报联系我们加入我们阿里巴巴集团淘宝网天猫全球速卖通阿里巴巴国际交易市场1688阿里妈妈飞猪阿里云计算AliOS万网高德UC友盟优酷钉钉支付宝达摩院淘宝海外阿里云盘饿了么© 2009-2024 Aliyun.com 版权所有 增值电信业务经营许可证: 浙B2-20080101 域名注册服务机构许可: 浙D3-20210002 京D3-20220015浙公网安备 33010602009975号浙B2-20080101-4

何为非统一内存访问(NUMA)? — The Linux Kernel documentation

何为非统一内存访问(NUMA)? — The Linux Kernel documentation

The Linux Kernel

6.8.0-rc7

Quick search

Contents

A guide to the Kernel Development Process

Submitting patches: the essential guide to getting your code into the kernel

Code of conduct

Kernel Maintainer Handbook

All development-process docs

Core API Documentation

Driver implementer's API guide

Kernel subsystem documentation

Locking in the kernel

Linux kernel licensing rules

How to write kernel documentation

Development tools for the kernel

Kernel Testing Guide

Kernel Hacking Guides

Linux Tracing Technologies

fault-injection

Kernel Livepatching

Rust

The Linux kernel user's and administrator's guide

The kernel build system

Reporting issues

User-space tools

The Linux kernel user-space API guide

The Linux kernel firmware guide

Open Firmware and Devicetree

CPU Architectures

Unsorted Documentation

Reliability, Availability and Serviceability features

Translations

中文翻译

与Linux 内核社区一起工作

内部API文档

开发工具和流程

面向用户的文档

固件相关文档

体系结构文档

其他文档

术语表

索引和表格

繁體中文翻譯

La documentazione del kernel Linux

한국어 번역

日本語訳

Traducción al español

Disclaimer

This Page

Show Source

Chinese (Simplified)

English

始于1999年11月,作者:

何为非统一内存访问(NUMA)?¶

这个问题可以从几个视角来回答:硬件观点和Linux软件视角。

从硬件角度看,NUMA系统是一个由多个组件或装配组成的计算机平台,每个组件可能包含0个或更多的CPU、

本地内存和/或IO总线。为了简洁起见,并将这些物理组件/装配的硬件视角与软件抽象区分开来,我们在

本文中称这些组件/装配为“单元”。

每个“单元”都可以看作是系统的一个SMP[对称多处理器]子集——尽管独立的SMP系统所需的一些组件可能

不会在任何给定的单元上填充。NUMA系统的单元通过某种系统互连连接在一起——例如,交叉开关或点对点

链接是NUMA系统互连的常见类型。这两种类型的互连都可以聚合起来,以创建NUMA平台,其中的单元与其

他单元有多个距离。

对于Linux,感兴趣的NUMA平台主要是所谓的缓存相干NUMA--简称ccNUMA系统系统。在ccNUMA系统中,

所有的内存都是可见的,并且可以从连接到任何单元的任何CPU中访问,缓存一致性是由处理器缓存和/或

系统互连在硬件中处理。

内存访问时间和有效的内存带宽取决于包含CPU的单元或进行内存访问的IO总线距离包含目标内存的单元

有多远。例如,连接到同一单元的CPU对内存的访问将比访问其他远程单元的内存经历更快的访问时间和

更高的带宽。 NUMA平台可以在任何给定单元上访问多种远程距离的(其他)单元。

平台供应商建立NUMA系统并不只是为了让软件开发人员的生活变得有趣。相反,这种架构是提供可扩展

内存带宽的一种手段。然而,为了实现可扩展的内存带宽,系统和应用软件必须安排大部分的内存引用

[cache misses]到“本地”内存——同一单元的内存,如果有的话——或者到最近的有内存的单元。

这就自然而然有了Linux软件对NUMA系统的视角:

Linux将系统的硬件资源划分为多个软件抽象,称为“节点”。Linux将节点映射到硬件平台的物理单元

上,对一些架构的细节进行了抽象。与物理单元一样,软件节点可能包含0或更多的CPU、内存和/或IO

总线。同样,对“较近”节点的内存访问——映射到较近单元的节点——通常会比对较远单元的访问经历更快

的访问时间和更高的有效带宽。

对于一些架构,如x86,Linux将“隐藏”任何代表没有内存连接的物理单元的节点,并将连接到该单元

的任何CPU重新分配到代表有内存的单元的节点上。因此,在这些架构上,我们不能假设Linux将所有

的CPU与一个给定的节点相关联,会看到相同的本地内存访问时间和带宽。

此外,对于某些架构,同样以x86为例,Linux支持对额外节点的仿真。对于NUMA仿真,Linux会将现

有的节点或者非NUMA平台的系统内存分割成多个节点。每个模拟的节点将管理底层单元物理内存的一部

分。NUMA仿真对于在非NUMA平台上测试NUMA内核和应用功能是非常有用的,当与cpusets一起使用时,

可以作为一种内存资源管理机制。[见 CPUSETS]

对于每个有内存的节点,Linux构建了一个独立的内存管理子系统,有自己的空闲页列表、使用中页列表、

使用统计和锁来调解访问。此外,Linux为每个内存区[DMA、DMA32、NORMAL、HIGH_MEMORY、MOVABLE

中的一个或多个]构建了一个有序的“区列表”。zonelist指定了当一个选定的区/节点不能满足分配请求

时要访问的区/节点。当一个区没有可用的内存来满足请求时,这种情况被称为“overflow 溢出”或

“fallback 回退”。

由于一些节点包含多个包含不同类型内存的区,Linux必须决定是否对区列表进行排序,使分配回退到不同

节点上的相同区类型,或同一节点上的不同区类型。这是一个重要的考虑因素,因为有些区,如DMA或DMA32,

代表了相对稀缺的资源。Linux选择了一个默认的Node ordered zonelist。这意味着在使用按NUMA距

离排序的远程节点之前,它会尝试回退到同一节点的其他分区。

默认情况下,Linux会尝试从执行请求的CPU被分配到的节点中满足内存分配请求。具体来说,Linux将试

图从请求来源的节点的适当分区列表中的第一个节点进行分配。这被称为“本地分配”。如果“本地”节点不能

满足请求,内核将检查所选分区列表中其他节点的区域,寻找列表中第一个能满足请求的区域。

本地分配将倾向于保持对分配的内存的后续访问 “本地”的底层物理资源和系统互连——只要内核代表其分配

一些内存的任务后来不从该内存迁移。Linux调度器知道平台的NUMA拓扑结构——体现在“调度域”数据结构

中[见 Scheduler Domains]——并且调度器试图尽量减少任务迁移到遥

远的调度域中。然而,调度器并没有直接考虑到任务的NUMA足迹。因此,在充分不平衡的情况下,任务可

以在节点之间迁移,远离其初始节点和内核数据结构。

系统管理员和应用程序设计者可以使用各种CPU亲和命令行接口,如taskset(1)和numactl(1),以及程

序接口,如sched_setaffinity(2),来限制任务的迁移,以改善NUMA定位。此外,人们可以使用

Linux NUMA内存策略修改内核的默认本地分配行为。 [见

NUMA Memory Policy].

系统管理员可以使用控制组和CPUsets限制非特权用户在调度或NUMA命令和功能中可以指定的CPU和节点

的内存。 [见 CPUSETS]

在不隐藏无内存节点的架构上,Linux会在分区列表中只包括有内存的区域[节点]。这意味着对于一个无

内存的节点,“本地内存节点”——CPU节点的分区列表中的第一个区域的节点——将不是节点本身。相反,它

将是内核在建立分区列表时选择的离它最近的有内存的节点。所以,默认情况下,本地分配将由内核提供

最近的可用内存来完成。这是同一机制的结果,该机制允许这种分配在一个包含内存的节点溢出时回退到

其他附近的节点。

一些内核分配不希望或不能容忍这种分配回退行为。相反,他们想确保他们从指定的节点获得内存,或者

得到通知说该节点没有空闲内存。例如,当一个子系统分配每个CPU的内存资源时,通常是这种情况。

一个典型的分配模式是使用内核的numa_node_id()或CPU_to_node()函数获得“当前CPU”所在节点的

节点ID,然后只从返回的节点ID请求内存。当这样的分配失败时,请求的子系统可以恢复到它自己的回退

路径。板块内核内存分配器就是这样的一个例子。或者,子系统可以选择在分配失败时禁用或不启用自己。

内核分析子系统就是这样的一个例子。

如果架构支持——不隐藏无内存节点,那么连接到无内存节点的CPU将总是产生回退路径的开销,或者一些

子系统如果试图完全从无内存的节点分配内存,将无法初始化。为了透明地支持这种架构,内核子系统可

以使用numa_mem_id()或cpu_to_mem()函数来定位调用或指定CPU的“本地内存节点”。同样,这是同

一个节点,默认的本地页分配将从这个节点开始尝试。

©The kernel development community.

|

Powered by Sphinx 5.0.1

& Alabaster 0.7.12

|

Page source

Access Denied

Access Denied

Access Denied

You don't have permission to access "http://www.intel.cn/content/www/cn/zh/developer/articles/technical/hardware-and-software-approach-for-using-numa-systems.html" on this server.

Reference #18.75051c78.1709813697.17f73f20

NUMA (Non-Uniform Memory Access): An Overview - ACM Queue

NUMA (Non-Uniform Memory Access): An Overview - ACM Queue

Current Issue   Past Issues   Topics

August 9, 2013Volume 11, issue 7

 

PDF

NUMA (Non-Uniform Memory Access): An Overview

NUMA becomes more common because memory controllers get close to execution units on microprocessors.

Christoph Lameter, Ph.D.

NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system's memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths. (Altix and UV, SGI's large address space systems, are examples. The designers of these products had to modify the Linux kernel to support NUMA; in these machines, processors in multiple chassis are linked via a proprietary interconnect called NUMALINK.)

Today, processors are so fast that they usually require memory to be directly attached to the socket that they are on. A memory access from one socket to memory from another has additional latency overhead to accessing local memory—it requires the traversal of the memory interconnect first. On the other hand, accesses from a single processor to local memory not only have lower latency compared to remote memory accesses but do not cause contention on the interconnect and the remote memory controllers. It is good to avoid remote memory accesses. Proper placement of data will increase the overall bandwidth and improve the latency to memory.

As the trend toward improving system performance by bringing memory nearer to processor cores continues, NUMA will play an increasingly important role in system performance. Modern processors have multiple memory ports, and the latency of access to memory varies depending even on the position of the core on the die relative to the controller. Future generations of processors will have increasing differences in performance as more cores on chip necessitate more sophisticated caching. As the access properties of these different kinds of memory continue to diverge, operating systems may need new functionality to provide good performance.

NUMA systems today (2013) are mostly encountered on multisocket systems. A typical high- end business-class server today comes with two sockets and will therefore have two NUMA nodes. Latency for a memory access (random access) is about 100 ns. Access to memory on a remote node adds another 50 percent to that number.

Performance-sensitive applications can require complex logic to handle memory with diverging performance characteristics. If a developer needs explicit control of the placement of memory for performance reasons, some operating systems provide APIs for this (for example, Linux, Solaris, and Microsoft Windows provide system calls for NUMA). However, various heuristics have been developed in the operating systems that manage memory access to allow applications to transparently utilize the NUMA characteristics of the underlying hardware.

A NUMA system classifies memory into NUMA nodes (which Solaris calls locality groups). All memory available in one node has the same access characteristics for a particular processor. Nodes have an affinity to processors and to devices. These are the devices that can use memory on a NUMA node with the best performance since they are locally attached. Memory is called node local if it was allocated from the NUMA node that is best for the processor. For example, the NUMA system exhibited in Figure 1 has one node belonging to each socket, with four cores each.

The process of assigning memory from the NUMA nodes available in the system is called NUMA placement. As placement influences only performance and not the correctness of the code, heuristic approaches can yield acceptable performance. In the special case of noncache-coherent NUMA systems, this may not be true since writes may not arrive in the proper sequence in memory. However, there are multiple challenges in coding for noncache-coherent NUMA systems. We restrict ourselves here to the common cache-coherent NUMA systems.

The focus in these discussions will be mostly on Linux since this operating system has refined NUMA facilities and is widely used in performance-critical environments today. The author was involved with the creation of the NUMA facilities in Linux and is most familiar with those.

Solaris has somewhat comparable features (see http://docs.oracle.com/cd/E19963-01/html/820-1691/gevog.html; http://docs.oracle.com/cd/E19082-01/819-2239/6n4hsf6rf/index.html; and http://docs.oracle.com/cd/E19082-01/819-2239/madv.so.1-1/index.html), but the number of systems deployed is orders of magnitude less. Work is under way to add support to other Unix-like operating systems, but that support so far has been mostly confined to operating-system tuning parameters for placing memory accesses. Microsoft Windows also has a developed NUMA subsystem that allows placing memory structures effectively, but the software is used mostly for enterprise applications A System with Two NUMA Nodes and Eight Processors NUMA node 0 NUMA node 1 core core core core interconnect core core core core rather than high-performance computing. Memory-access speed requirements for enterprise-class applications are frequently more relaxed than in high-performance computing, so less effort is spent on NUMA memory handling in Windows than in Linux.

How Operating Systems Handle Numa Memory

There are several broad categories in which modern production operating systems allow for the management of NUMA: accepting the performance mismatch, hardware memory striping, heuristic memory placement, a static NUMA configurations, and application-controlled NUMA placement.

Ignore The Difference

Since NUMA placement is a best-effort approach, one option is simply to ignore the possible performance benefit and just treat all memory as if no performance differences exist. This means that the operating system is not aware of memory nodes. The system is functional, but performance varies depending on how memory happens to be allocated. The smaller the differences between local and remote accesses, the more viable this option becomes.

This approach allows software and the operating system to run unmodified. Frequently, this is the initial approach for system software when systems with NUMA characteristics are first used. The performance will not be optimal and will likely be different each time the machine and/ or application runs, because the allocation of memory to performance-critical segments varies depending on the system configuration and timing effects on boot-up.

Memory Striping In Hardware

Some machines can set up the mapping from memory addresses to the cache lines in the nodes in such a way that consecutive cache lines in an address space are taken from different memory controllers (interleaving at the cache-line level). As a result, the NUMA effects are averaged out (since structures larger than a cache line will then use cache lines on multiple NUMA nodes). Overall system performance is more deterministic compared with the approach of just ignoring the difference, and the operating system still does not need to know about the difference in memory performance, meaning no NUMA support is needed in the operating system. The danger of overloading a node is reduced since the accesses are spread out among all available NUMA nodes.

The drawback is that the interconnect is in constant use. Performance will never be optimal since the striping means that cache lines are frequently accessed from remote NUMA nodes.

Heuristic Memory Placement For Applications

If the operating system is NUMA-aware (under Linux, NUMA must be enabled at compile time and the BIOS or firmware must provide NUMA memory information for the NUMA capabilities to become active; NUMA can be disabled and controlled at runtime with a kernel parameter), it is useful to have measures that allow applications to allocate memory in ways that minimize signal path length so performance is increased. The operating system has to adopt a policy that maximizes performance for as many applications as possible. Most applications run with improved performance using the heuristic approach, especially compared with the approaches discussed earlier. A NUMA-aware operating system determines memory characteristics from the firmware and can therefore tune its own internal operations to the memory configuration. Such tuning requires coding effort, however, so only performance-critical portions of the operating system tend to get optimized for NUMA affinities, whereas less-performance-critical components tend to operate on the assumption that all memory is equal.

The most common assumptions made by the operating system are that the application will run on the local node and that memory from the local node is to be preferred. If possible, all memory requested by a process will be allocated from the local node, thereby avoiding the use of the cross- connect. The approach does not work, though, if the number of required processors is higher than the number of hardware contexts available on a socket (when processors on both NUMA nodes must be used); if the application uses more memory than available on a node; or if the application programmer or the scheduler decides to move application threads to processors on a different socket after memory allocation has occurred.

In general, small Unix tools and small applications work very well with this approach. Large applications that make use of a significant percentage of total system memory and of a majority of the processors on the system will often benefit from explicit tuning or software modifications that take advantage of NUMA.

Most Unix-style operating systems support this mode of operation. Notably, FreeBSD and Solaris have optimizations to place memory structures to avoid bottlenecks. FreeBSD can place memory round-robin on multiple nodes so that the latencies average out. This allows FreeBSD to work better on systems that cannot do cache-line interleaving on the BIOS or hardware level (additional NUMA support is planned for FreeBSD 10). Solaris also replicates important kernel data structures per locality group.

Special Numa Configuration For Applications

The operating system provides configuration options that allow the operator to tell the operating system that an application should not be run with the default assumptions regarding memory placement. It is possible to establish memory-allocation policies for an application without modifying code.

Command-line tools exist under Linux that can set up policies to determine memory affinities (taskset, numactl). Solaris has tunable parameters for how the operating system allocates memory from locality groups. These are roughly comparable to Linux's process memory-allocation policies.

Application Control Of Numa Allocations

The application may want fine-grained control of how the operating system handles allocation for each of its memory segments. For that purpose, system calls exist that allow the application to specify which memory region should use which policies for memory allocations.

The main performance issues typically involve large structures that are accessed frequently by the threads of the application from all memory nodes and that often contain information that needs to be shared among all threads. These are best placed using interleaving so that the objects are distributed over all available nodes.

How Does Linux Handle Numa?

Linux manages memory in zones. In a non-NUMA Linux system, zones are used to describe memory ranges required to support devices that are not able to perform DMA (direct memory access) to all memory locations. Zones are also used to mark memory for other special needs such as movable memory or memory that requires explicit mappings for access by the kernel (HIGHMEM), but that is not relevant to the discussion here. When NUMA is enabled, more memory zones are created and they are also associated with NUMA nodes. A NUMA node can have multiple zones since it may be able to serve multiple DMA areas. How Linux has arranged memory can be determined by looking at /proc/zoneinfo. The NUMA node association of the zones allows the kernel to make decisions involving the memory latency relative to cores.

On boot-up, Linux will detect the organization of memory via the ACPI (Advanced Configuration and Power Interface) tables provided by the firmware and then create zones that map to the NUMA nodes and DMA areas as needed. Memory allocation then occurs from the zones. Should memory in one zone become exhausted, then memory reclaim occurs: the system will scan through the least recently used pages trying to free a certain number of pages. Counters that show the current status of memory in various nodes/zones can also be seen in /proc/zoneinfo. Figure 2 shows types of memory in a zone/node. Types of Memory in a Zone/Node free memory unmapped page cache (e.g. cached disk contents) page mapped to processes (e.g. text segments, mmapped les) anonymous pages (e.g. stack, heap) dirty or writeback pages (disk I/O e.g.) unevictable pages (mlock e.g.) kernel, driver and unreclaimable slab memory

Memory Policies

How memory is allocated under NUMA is determined by a memory policy. Policies can be specified for memory ranges in a process's address space, or for a process or the system as a whole. Policies for a process override the system policy, and policies for a specific memory range override a process's policy.

The most important memory policies are:

NODE LOCAL. The allocation occurs from the memory node local to where the code is currently executing.

INTERLEAVE. Allocation occurs round-robin. First a page will be allocated from node 0, then from node 1, then again from node 0, etc. Interleaving is used to distribute memory accesses for structures that may be accessed from multiple processors in the system in order to have an even load on the interconnect and the memory of each node.

There are other memory policies that are used in special situations, which are not mentioned here for brevity's sake. The two policies just mentioned are generally the most useful and the operating system uses them by default. NODE LOCAL is the default allocation policy if the system is up and running.

The Linux kernel will use the INTERLEAVE policy by default on boot-up. Kernel structures created during bootstrap are distributed over all the available nodes in order to avoid putting excessive load on a single memory node when processes require access to the operating-system structures. The system default policy is changed to NODE LOCAL when the first userspace process (init daemon) is started.

The active memory allocation policies for all memory segments of a process (and information that shows how much memory was actually allocated from which node) can be seen by determining the process id and then looking at the contents of /proc//numa_maps.

Basic Operations On Process Startup

Processes inherit their memory policy from their parent. Most of the time the policy is left at the default, which means NODE LOCAL. When a process is started on a processor, memory is allocated for that process from the local NUMA node. All other allocations of the process (through growing the heap, page faults, mmap, and so on) will also be satisfied from the local NUMA node.

The Linux scheduler will attempt to keep the process cache hot during load balancing. This means the scheduler's preference is to leave the process on processors that share the L1-processor cache, then on processors that share L2, and then on processors that share L3, with the processor that the process ran on last. If there is an imbalance beyond that, the scheduler will move the process to any other processor on the same NUMA node.

As a last resort the scheduler will move the process to another NUMA node. At that point the code will be executing on the processor of one node, while the memory allocated before the move has been allocated on the old node. Most memory accesses from the process will then be remote, which will cause the performance of the process to degrade.

There has been some recent work in making the scheduler NUMA-aware to ensure that the pages of a process can be moved back to the local node, but that work is available only in Linux 3.8 and later, and is not considered mature. Further information on the state of affairs may be found on the Linux kernel mailing lists and in articles on lwn.net.

Reclaim

Linux typically allocates all available memory in order to cache data that may be used again later. When memory begins to be low, reclaim will be used to find pages that are either not in use or unlikely to be used soon. The effort required to evict a page from memory and to get the page back if needed varies by type of page. Linux prefers to evict pages from disk that are not mapped into any process space because it is easy to drop all references to the page. The page can be reread from disk if it is needed later. Pages that are mapped into a process's address space require that the page first be removed from that address space before the page can be reused. A page that is not a copy of a page from disk (anonymous pages) can be evicted only if the page is first written out to swap space (an expensive operation). There are also pages that cannot be evicted at all, such as mlocked() memory or pages in use for kernel data.

The impact of reclaim on the system can therefore vary. In a NUMA system multiple types of memory will be allocated on each node. The amount of free space on each node will vary. So if there is a request for memory and using memory on the local node would require reclaim but another node has enough memory to satisfy the request without reclaim, the kernel has two choices:

• Run a reclaim pass on the local node (causing kernel processing overhead) and then allocate node- local memory to the process.

• Just allocate from another node that does not need a reclaim pass. Memory will not be node local, but we avoid frequent reclaim passes. Reclaim will be performed when all zones are low on free memory. This approach reduces the frequency of reclaim and allows more of the reclaim work to be done in a single pass.

For small NUMA systems (such as the typical two-node servers) the kernel defaults to the second approach. For larger NUMA systems (four or more nodes) the kernel will perform a reclaim in order to get node-local memory whenever possible because the latencies have higher impacts on process performance.

There is a knob in the kernel that determines how the situation is to be treated in /proc/sys/vm/zone_reclaim. A value of 0 means that no local reclaim should take place. A value of 1 tells the kernel that a reclaim pass should be run in order to avoid allocations from the other node. On boot- up a mode is chosen based on the largest NUMA distance in the system.

If zone reclaim is switched on, the kernel still attempts to keep the reclaim pass as lightweight as possible. By default, reclaim will be restricted to unmapped page-cache pages. The frequency of reclaim passes can be further reduced by setting /proc/sys/vm/min_unmapped_ratio to the percentage of memory that must contain unmapped pages for the system to run a reclaim pass. The default is 1 percent.

Zone reclaim can be made more aggressive by enabling write-back of dirty pages or the swapping of anonymous pages, but in practice doing so has often resulted in significant performance issues.

Basic Numa Command-Line Tools

The main tool used to set up the NUMA execution environment for a process is numactl. Numactl can be used to display the system NUMA configuration, and to control shared memory segments. It is possible to restrict processes to a set of processors, as well as to a set of memory nodes. Numactl can be used, for example, to avoid task migration between nodes or restrict the memory allocation to a certain node. Note that additional reclaim passes may be required if the allocation is restricted. Those cases are not influenced by zone-reclaim mode because the allocation is restricted by a memory policy to a specific set of nodes, so the kernel cannot simply pick memory from another NUMA node.

Another tool that is frequently used for NUMA is taskset. It basically allows only binding of a task to processors and therefore has only a subset of numactl's capability. Taskset is heavily used in non-NUMA environments, and its familiarity results in developers preferring to use taskset instead of numactl on NUMA systems.

Numa Information

There are numerous ways to view information about the NUMA characteristics of the system and of various processes currently running. The hardware NUMA configuration of a system can be viewed by using numactl -hardware. This includes a dump of the SLIT (system locality information table) that shows the cost of accesses to different nodes in a NUMA system. The example below shows a NUMA system with two nodes. The distance for a local access is 10. A remote access costs twice as much on this system (20). This is the convention, but the practice of some vendors (especially for two-node systems) is simply to report 10 and 20 without regard to the actual latency differences to memory.

$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 node 0 size: 131026 MB node 0 free: 588 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 node 1 size: 131072 MB node 1 free: 169 MB node distances: node  0  1   0: 10 20   1: 20 10

Numastat is another tool that is used to show how many allocations were satisfied from the local node. Of particular interest is the numa_miss counter, which indicates that the system assigned memory from a different node in order to avoid reclaim. These allocations also contribute to other node. The remainder of the count is intentional off-node allocations. The amount of off-node memory can be used as a guide to figure out how effectively memory was assigned to processes running on the system.

$ numastat                 node0       node1 numa_hit        13273229839 4595119371 numa_miss       2104327350  6833844068 numa_foreign    6833844068  2104327350 interleave_hit  52991       52864 local_node      13273229554 4595091108 other_node      2104327635  6833872331

How memory is allocated to a process can be seen via a status file in /proc//numa_maps:

# cat /proc/1/numa_maps 7f830c175000 default anon=1 dirty=1 active=0 N1=1 7f830c177000 default file=/lib/x86_64-linux-gnu/ld-2.15.so anon=1 dirty=1 active=0 N1=1 7f830c178000 default file=/lib/x86_64-linux-gnu/ld-2.15.so anon=2 dirty=2 active=0 N1=2 7f830c17a000 default file=/sbin/init mapped=18 N1=18 7f830c39f000 default file=/sbin/init anon=2 dirty=2 active=0 N1=2 7f830c3a1000 default file=/sbin/init anon=1 dirty=1 active=0 N1=1 7f830dc56000 default heap anon=223 dirty=223 active=0 N0=52 N1=171 7fffb6395000 default stack anon=5 dirty=5 active=1 N1=5

The output shows the virtual address of the policy and then some information about the NUMA characteristics of the memory range. Anon means that the pages do not have an associated file on disk. Nx shows the number of pages on the respective node.

The information about how memory is used in the system as a whole is available in /proc/meminfo. The same information is also available for each NUMA node in /sys/devices/system/node/node/meminfo. Numerous other bits of information are available from the directory where meminfo is located. It is possible to compact memory, get distance tables, and manage huge pages and mlocked pages by inspecting and writing values to key files in that directory.

First-Touch Policy

Specifying memory policies for a process or address range does not cause any allocation of memory, which is often confusing to newcomers. Memory policies specify what should happen when the system needs to allocate memory for a virtual address. Pages in a process's memory space that have not been touched or that are zero do not have memory assigned to them. The processor will generate a hardware fault when a process touches or writes to an address (page fault) that is not yet populated. During page-fault handling by the kernel, the page is allocated. The instruction that caused the fault is then restarted and will be able to access the memory as needed.

What matters, therefore, is the memory policy in effect when the allocation occurs. This is called the first touch. The first-touch policy refers to the fact that a page is allocated based on the effective policy when some process first uses a page in some fashion.

The effective memory policy on a page depends on memory policies assigned to a memory range or on a memory policy associated with a task. If a page is only in use by a single thread, then there is no ambiguity as to which policy will be followed. However, pages are often used by multiple threads. Any one of them may cause the page to be allocated. If the threads have different memory policies, then the page may as a result seem to be allocated in surprising ways for a process that also sees the same page later.

For example, it is fairly common that text segments are shared by all processes that use the same executable. The kernel will use the page from the text segment if it is already in memory regardless of the memory policy set on a range. The first user of a page in a text segment will therefore determine its location. Libraries are frequently shared among binaries, and especially the C library will be used by almost all processes on the system. Many of the most-used pages are therefore allocated during boot-up when the first binaries run that use the C library. The pages will at that point become established on a particular NUMA node and will stay there for the time that the system is running.

First-touch phenomena limit the placement control that a process has over its data. If the distance to a text segment has a significant impact on process performance, then dislocated pages will have to be moved in memory. Memory could appear to have been allocated on NUMA nodes not permitted by the memory policy of the current task because an earlier task has already brought the data into memory.

Moving Memory

Linux has the capability to move memory. The virtual address of the memory in the process space stays the same. Only the physical location of the data is moved to a different node. The effect can be observed by looking at /proc//numa_maps before and after a move.

Migrating all of a process's memory to a node can optimize application performance by avoiding cross-connect accesses if the system has placed pages on other NUMA nodes. However, a regular user can move only pages of a process that are referenced only by that process (otherwise, the user could interfere with performance optimization of processes owned by other users). Only root has the capability to move all pages of a process.

It can be difficult to ensure that all pages are local to a process since some text segments are heavily shared and there can be only one page backing an address of a text segment. This is particularly an issue with the C library and other heavily shared libraries.

Linux has a migratepages command-line tool to manually move pages around by specifying a pid and the source and destination nodes. The memory of the process will be scanned for pages currently allocated on the source node. They will be moved to the destination node.

Numa Scheduling

The Linux scheduler had no notion of the page placement of memory in a process until Linux 3.8. Decisions about migrating processes were made based on an estimate of the cache hotness of a process's memory. If the Linux scheduler moved the execution of a process to a different NUMA node, the performance of that process could be harmed because its memory now needed access via the cross-connect. Once that move was complete the scheduler would estimate that the process memory was cache hot on the remote node and leave the process there as long as possible. As a result, administrators who wanted the best performance felt it best not to let the Linux scheduler interfere with memory placement. Processes were often pinned to a specific set of processors using taskset, or the system was partitioned using the cpusets feature to keep applications within the NUMA node boundaries.

In Linux 3.8 the first steps were taken to address this situation by merging a framework that will eventually enable the scheduler to consider the page placement and perhaps automatically migrate pages from remote nodes to the local node. However, a significant development effort is still needed, and the existing approaches do not always enhance load performance. This was the state of affairs in April 2013, when this section was written. More recent information may be found on the Linux kernel mailing list on http://vger.kernel.org or in articles on Linux Weekly News (http://lwn.net). See, for example, http://lwn.net/Articles/486858/.

Conclusion

NUMA support has been around for a while in various operating systems. NUMA support in Linux has been available since early 2000 and is continually being refined. Kernel NUMA support frequently optimizes process execution without the need for user intervention, and in most use cases an operating system can simply be run on a NUMA system, providing decent performance for typical applications.

Special NUMA configuration through tools and kernel configuration comes into play when the heuristics provided by the operating system do not provide satisfactory application performance to the end user. This is typically the case in high-performance computing, high-frequency trading, and for realtime applications, but these issues recently have become more significant for regular enterprise-class applications. Traditionally, NUMA support required special knowledge about the application and hardware for proper tuning using the knobs provided by the operating systems. Recent developments (especially around the Linux NUMA scheduler) will likely enable operating systems to automatically balance a NUMA application load properly over time.

The use of NUMA needs to be guided by the increase in performance that is possible. The larger the difference between local and remote memory access, the greater the benefits that arise from NUMA placement. NUMA latency differences are due to memory accesses. If the application does not rely on frequent memory accesses (because, for example, the processor caches absorb most of the memory operations), NUMA optimizations will have no effect. Also, for I/O-bound applications the bottleneck is typically the device and not memory access. An understanding of the characteristics of the hardware and software is required in order to optimize applications using NUMA.

Additional Reading

McCormick, P. S., Braithwaite, R. K., Feng, W. 2011. Empirical memory-access cost models in multicore NUMA architectures. Virginia Tech Department of Computer Science.

Hacker, G. 2012. Using NUMA on RHEL 6; http://www.redhat.com/summit/2012/pdf/2012-DevDay-Lab-NUMA-Hacker.pdf.

Kleen, A. 2005. A NUMA API for Linux. Novell; http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf.

Lameter, C. 2005. Effective synchronization on Linux/NUMA systems. Gelato Conference; http://www.lameter.com/gelato2005.pdf.

Lameter, C. 2006. Remote and local memory: memory in a Linux/NUMA system. Gelato Conference: SGI. Li, Y., Pandis, I., Mueller, R., Raman, V., Lohman, G. 2013. NUMA-aware algorithms: the case of data shuffling. University of Wisconsin-Madison / IBM Almaden Research Center. Love, R. 2004. Linux Kernel Development. Indianapolis: Sams Publishing.

Oracle. 2010. Memory and Thread Placement Optimization Developer's Guide; http://docs.oracle.com/cd/E19963-01/html/820-1691/.

Schimmel, K. 1994. Unix Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers. Addison-Wesley.

LOVE IT, HATE IT? LET US KNOW [email protected]

Christoph Lameter specializes in high-performance computing and high-frequency trading technologies. As an operating-system designer and developer, he has been developing memory management technologies for Linux to enhance performance and reduce latencies. He is fond of new technologies and new ways of thinking that disrupt existing industries and cause new development communities to emerge.

© 2013 ACM 1542-7730/13/0700 $10.00 12

Originally published in Queue vol. 11, no. 7—

Comment on this article in the ACM Digital Library

More related articles:

Michael Mattioli - FPGAs in Client Compute Hardware

FPGAs (field-programmable gate arrays) are remarkably versatile. They are used in a wide variety of applications and industries where use of ASICs (application-specific integrated circuits) is less economically feasible. Despite the area, cost, and power challenges designers face when integrating FPGAs into devices, they provide significant security and performance benefits. Many of these benefits can be realized in client compute hardware such as laptops, tablets, and smartphones.

Bill Hsu, Marc Sosnick-Pérez - Realtime GPU Audio

Today’s CPUs are capable of supporting realtime audio for many popular applications, but some compute-intensive audio applications require hardware acceleration. This article looks at some realtime sound-synthesis applications and shares the authors’ experiences implementing them on GPUs (graphics processing units).

David Bacon, Rodric Rabbah, Sunil Shukla - FPGA Programming for the Masses

When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other. Processors are highly programmable but often inefficient in terms of power and performance. ASICs implement a dedicated and fixed function and provide the best power and performance characteristics, but any functional change requires a complete (and extremely expensive) re-spinning of the circuits.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz - CPU DB: Recording Microprocessor History

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds.

© ACM, Inc. All Rights Reserved.

What is NUMA? — The Linux Kernel documentation

What is NUMA? — The Linux Kernel documentation

The Linux Kernel

4.18.0

Linux kernel licensing rules

The Linux kernel user’s and administrator’s guide

The Linux kernel user-space API guide

Working with the kernel development community

Development tools for the kernel

How to write kernel documentation

Kernel Hacking Guides

Linux Tracing Technologies

Kernel Maintainer Handbook

The Linux driver implementer’s API guide

Core API Documentation

Linux Media Subsystem Documentation

Linux Networking Documentation

The Linux Input Documentation

Linux GPU Driver Developer’s Guide

Security Documentation

Linux Sound Subsystem Documentation

Linux Kernel Crypto API

Linux Filesystems API

Linux Memory Management Documentation

User guides for MM features

Kernel developers MM documentation

Active MM

Memory Balancing

Cleancache

Frontswap

High Memory Handling

Heterogeneous Memory Management (HMM)

hwpoison

Hugetlbfs Reservation

Kernel Samepage Merging

When do you need to notify inside page table lock ?

What is NUMA?

Overcommit Accounting

Page migration

Page fragments

page owner: Tracking about who allocated each page

remap_file_pages() system call

Short users guide for SLUB

Split page table lock

Transparent Hugepage Support

Unevictable LRU Infrastructure

z3fold

zsmalloc

SuperH Interfaces Guide

Korean translations

Chinese translations

Japanese translations

The Linux Kernel

Docs »

Linux Memory Management Documentation »

What is NUMA?

View page source

Started Nov 1999 by Kanoj Sarcar

What is NUMA?¶

This question can be answered from a couple of perspectives: the

hardware view and the Linux software view.

From the hardware perspective, a NUMA system is a computer platform that

comprises multiple components or assemblies each of which may contain 0

or more CPUs, local memory, and/or IO buses. For brevity and to

disambiguate the hardware view of these physical components/assemblies

from the software abstraction thereof, we’ll call the components/assemblies

‘cells’ in this document.

Each of the ‘cells’ may be viewed as an SMP [symmetric multi-processor] subset

of the system–although some components necessary for a stand-alone SMP system

may not be populated on any given cell. The cells of the NUMA system are

connected together with some sort of system interconnect–e.g., a crossbar or

point-to-point link are common types of NUMA system interconnects. Both of

these types of interconnects can be aggregated to create NUMA platforms with

cells at multiple distances from other cells.

For Linux, the NUMA platforms of interest are primarily what is known as Cache

Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible

to and accessible from any CPU attached to any cell and cache coherency

is handled in hardware by the processor caches and/or the system interconnect.

Memory access time and effective memory bandwidth varies depending on how far

away the cell containing the CPU or IO bus making the memory access is from the

cell containing the target memory. For example, access to memory by CPUs

attached to the same cell will experience faster access times and higher

bandwidths than accesses to memory on other, remote cells. NUMA platforms

can have cells at multiple remote distances from any given cell.

Platform vendors don’t build NUMA systems just to make software developers’

lives interesting. Rather, this architecture is a means to provide scalable

memory bandwidth. However, to achieve scalable memory bandwidth, system and

application software must arrange for a large majority of the memory references

[cache misses] to be to “local” memory–memory on the same cell, if any–or

to the closest cell with memory.

This leads to the Linux software view of a NUMA system:

Linux divides the system’s hardware resources into multiple software

abstractions called “nodes”. Linux maps the nodes onto the physical cells

of the hardware platform, abstracting away some of the details for some

architectures. As with physical cells, software nodes may contain 0 or more

CPUs, memory and/or IO buses. And, again, memory accesses to memory on

“closer” nodes–nodes that map to closer cells–will generally experience

faster access times and higher effective bandwidth than accesses to more

remote cells.

For some architectures, such as x86, Linux will “hide” any node representing a

physical cell that has no memory attached, and reassign any CPUs attached to

that cell to a node representing a cell that does have memory. Thus, on

these architectures, one cannot assume that all CPUs that Linux associates with

a given node will see the same local memory access times and bandwidth.

In addition, for some architectures, again x86 is an example, Linux supports

the emulation of additional nodes. For NUMA emulation, linux will carve up

the existing nodes–or the system memory for non-NUMA platforms–into multiple

nodes. Each emulated node will manage a fraction of the underlying cells’

physical memory. NUMA emluation is useful for testing NUMA kernel and

application features on non-NUMA platforms, and as a sort of memory resource

management mechanism when used together with cpusets.

[see Documentation/cgroup-v1/cpusets.txt]

For each node with memory, Linux constructs an independent memory management

subsystem, complete with its own free page lists, in-use page lists, usage

statistics and locks to mediate access. In addition, Linux constructs for

each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],

an ordered “zonelist”. A zonelist specifies the zones/nodes to visit when a

selected zone/node cannot satisfy the allocation request. This situation,

when a zone has no available memory to satisfy a request, is called

“overflow” or “fallback”.

Because some nodes contain multiple zones containing different types of

memory, Linux must decide whether to order the zonelists such that allocations

fall back to the same zone type on a different node, or to a different zone

type on the same node. This is an important consideration because some zones,

such as DMA or DMA32, represent relatively scarce resources. Linux chooses

a default Node ordered zonelist. This means it tries to fallback to other zones

from the same node before using remote nodes which are ordered by NUMA distance.

By default, Linux will attempt to satisfy memory allocation requests from the

node to which the CPU that executes the request is assigned. Specifically,

Linux will attempt to allocate from the first node in the appropriate zonelist

for the node where the request originates. This is called “local allocation.”

If the “local” node cannot satisfy the request, the kernel will examine other

nodes’ zones in the selected zonelist looking for the first zone in the list

that can satisfy the request.

Local allocation will tend to keep subsequent access to the allocated memory

“local” to the underlying physical resources and off the system interconnect–

as long as the task on whose behalf the kernel allocated some memory does not

later migrate away from that memory. The Linux scheduler is aware of the

NUMA topology of the platform–embodied in the “scheduling domains” data

structures [see Documentation/scheduler/sched-domains.txt]–and the scheduler

attempts to minimize task migration to distant scheduling domains. However,

the scheduler does not take a task’s NUMA footprint into account directly.

Thus, under sufficient imbalance, tasks can migrate between nodes, remote

from their initial node and kernel data structures.

System administrators and application designers can restrict a task’s migration

to improve NUMA locality using various CPU affinity command line interfaces,

such as taskset(1) and numactl(1), and program interfaces such as

sched_setaffinity(2). Further, one can modify the kernel’s default local

allocation behavior using Linux NUMA memory policy.

[see Documentation/admin-guide/mm/numa_memory_policy.rst.]

System administrators can restrict the CPUs and nodes’ memories that a non-

privileged user can specify in the scheduling or NUMA commands and functions

using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt]

On architectures that do not hide memoryless nodes, Linux will include only

zones [nodes] with memory in the zonelists. This means that for a memoryless

node the “local memory node”–the node of the first zone in CPU’s node’s

zonelist–will not be the node itself. Rather, it will be the node that the

kernel selected as the nearest node with memory when it built the zonelists.

So, default, local allocations will succeed with the kernel supplying the

closest available memory. This is a consequence of the same mechanism that

allows such allocations to fallback to other nearby nodes when a node that

does contain memory overflows.

Some kernel allocations do not want or cannot tolerate this allocation fallback

behavior. Rather they want to be sure they get memory from the specified node

or get notified that the node has no free memory. This is usually the case when

a subsystem allocates per CPU memory resources, for example.

A typical model for making such an allocation is to obtain the node id of the

node to which the “current CPU” is attached using one of the kernel’s

numa_node_id() or CPU_to_node() functions and then request memory from only

the node id returned. When such an allocation fails, the requesting subsystem

may revert to its own fallback path. The slab kernel memory allocator is an

example of this. Or, the subsystem may choose to disable or not to enable

itself on allocation failure. The kernel profiling subsystem is an example of

this.

If the architecture supports–does not hide–memoryless nodes, then CPUs

attached to memoryless nodes would always incur the fallback path overhead

or some subsystems would fail to initialize if they attempted to allocated

memory exclusively from a node without memory. To support such

architectures transparently, kernel subsystems can use the numa_mem_id()

or cpu_to_mem() function to locate the “local memory node” for the calling or

specified CPU. Again, this is the same node from which default, local page

allocations will be attempted.

Next

Previous

© Copyright The kernel development community.

Built with Sphinx using a theme provided by Read the Docs.