测量NUMA（非统一内存访问）没有明显的不对称性为什么？

浏览：62日期：2024-03-28

(adsbygoogle = window.adsbygoogle || []).push({}); 如何解决测量NUMA（非统一内存访问）没有明显的不对称性为什么？？

啊哈！神秘主义者是对的！硬件预取以某种方式优化了我的读/写。

如果这是缓存优化，那么强制使用内存屏障将使优化失败：

c = __sync_fetch_and_add(((char*)x) + j, 1);

但这没有任何区别。确实有所作为的是，将我的迭代器索引乘以质数1009来破坏预取优化：

*(((char*)x) + ((j * 1009) % N)) += 1;

有了这一更改，NUMA的不对称性就清楚地显示出来了：

numa_available() 0numa node 0 10101010 12884901888numa node 1 01010101 12874584064Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725Elapsed read/write by thread on core 0: 00:00:00.942300Elapsed read/write by thread on core 1: 00:00:01.216286Elapsed read/write by thread on core 2: 00:00:00.909353Elapsed read/write by thread on core 3: 00:00:01.218935Elapsed read/write by thread on core 4: 00:00:00.898107Elapsed read/write by thread on core 5: 00:00:01.211413Elapsed read/write by thread on core 6: 00:00:00.898021Elapsed read/write by thread on core 7: 00:00:01.207114

至少我认为这是正在发生的事情。

感谢mysticial！

对于只看一下这篇文章以大致了解NUMA性能特征的任何人，根据我的测试，这是底线：

对非本地NUMA节点的内存访问的延迟约为对本地节点的内存访问的延迟的1.33倍。

解决方法

我尝试测量NUMA的非对称内存访问效果，但失败了。

本实验

在2.93GHz，2个CPU，8核的Intel Xeon X5570上执行。

在固定到核心0的线程上，我使用numa_alloc_local在核心0的NUMA节点上分配了大小为10,000,000字节的数组 x。然后我遍历数组 x 50次，并读取和写入数组中的每个字节。测量经过的时间以执行50次迭代。

然后，在服务器中的每个其他内核上，我固定一个新线程，并再次测量经过的时间以对数组 x中的 每个字节进行50次读写操作。

数组 x 很大，可以最大程度地减少缓存影响。我们要在CPU必须一直到RAM进行加载和存储的时候来衡量速度，而不是在缓存有帮助的时候来衡量。

我的服务器中有两个NUMA节点，因此我希望在分配了数组 x 的同一节点上具有亲和力的内核具有更快的读写速度。我没看到

为什么？

正如我在其他地方看到的那样，也许NUMA仅与具有8-12个以上内核的系统有关？

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>#include <iostream>#include <boost/thread/thread.hpp>#include <boost/date_time/posix_time/posix_time.hpp>#include <pthread.h>void pin_to_core(size_t core){ cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core,&cpuset); pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuset);}std::ostream& operator<<(std::ostream& os,const bitmask& bm){ for(size_t i=0;i<bm.size;++i) {os << numa_bitmask_isbitset(&bm,i); } return os;}void* thread1(void** x,size_t core,size_t N,size_t M){ pin_to_core(core); void* y = numa_alloc_local(N); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i)for(size_t j(0);j<N;++j){ c = ((char*)y)[j]; ((char*)y)[j] = c;} boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << 'Elapsed read/write by same thread that allocated on core ' << core << ': ' << (t2 - t1) << std::endl; *x = y;}void thread2(void* x,size_t M){ pin_to_core(core); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i)for(size_t j(0);j<N;++j){ c = ((char*)x)[j]; ((char*)x)[j] = c;} boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << 'Elapsed read/write by thread on core ' << core << ': ' << (t2 - t1) << std::endl;}int main(int argc,const char **argv){ int numcpus = numa_num_task_cpus(); std::cout << 'numa_available() ' << numa_available() << std::endl; numa_set_localalloc(); bitmask* bm = numa_bitmask_alloc(numcpus); for (int i=0;i<=numa_max_node();++i) {numa_node_to_cpus(i,bm);std::cout << 'numa node ' << i << ' ' << *bm << ' ' << numa_node_size(i,0) << std::endl; } numa_bitmask_free(bm); void* x; size_t N(10000000); size_t M(50); boost::thread t1(boost::bind(&thread1,&x,N,M)); t1.join(); for (size_t i(0);i<numcpus;++i) {boost::thread t2(boost::bind(&thread2,x,i,M));t2.join(); } numa_free(x,N); return 0;}输出

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp./numatestnuma_available() 0 <-- NUMA is available on this systemnuma node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0,which is about 12 Gbnuma node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1,which is slightly smaller than node 0Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428Elapsed read/write by thread on core 0: 00:00:01.760554Elapsed read/write by thread on core 1: 00:00:01.719686Elapsed read/write by thread on core 2: 00:00:01.708830Elapsed read/write by thread on core 3: 00:00:01.691560Elapsed read/write by thread on core 4: 00:00:01.686912Elapsed read/write by thread on core 5: 00:00:01.691917Elapsed read/write by thread on core 6: 00:00:01.686509Elapsed read/write by thread on core 7: 00:00:01.689928

不管哪个内核在进行读写，在数组 x上进行50次迭代读取和写入大约需要1.7秒。

更新：

我的CPU上的缓存大小为8Mb，因此10Mb阵列 x 可能不足以消除缓存效果。我尝试了100Mb数组 x，并且尝试在最里面的循环中使用__sync_synchronize（）发出完整的内存隔离。它仍然没有揭示NUMA节点之间的任何不对称性。

更新2：

我尝试使用__sync_fetch_and_add（）读写数组 x 。依然没有。

上一条：bash：./a.out：在由ld生成的运行可执行文件上没有这样的文件或目录下一条：当`unzip -l`时，从zip压缩文件中提取文件名列表

相关文章：
1. solaris基础和常用知识 (2)2. 为什么矛那里的 <a href=" " 这地方为什么是空的呢？？3. 为什么总是提示我说Template "movieTemplate" not found，我路径都引对了呀4. <tr valign="top"> 看不懂5. mysql - sphinx查询 "中国" 时也能查询到 "中华人民共和国"6. MySQL"="自动 like7. node.js mysql Cannot find module "net" 和 "tls"和"fs" 的问题8. mysql 使用 join 还是 "," 进行多表查询？？？9. mysql - 使用hibernate连接数据库时,数据库版本过高不支持关键字"type" ；10. 发现 <li><a href="/index.php">回到前台</a></li>这样回到首页后，不是登录状态