原文:Microbenchmarking Nvidia’s RTX 4090
作者:clamchowder

NVIDIA RTX4090,架构代号Ada Lovelace,采用台积电4nm技术,核心代号AD102。RTX4090除了NV官方大肆宣扬的光追性能外,拥有128个SMs(实际上AD102有144个,GA102有84个)。

此文中的对比试验主要是来自OpenCL写的Microbenchmarking Kernel,主要关注缓存和存储子系统。

一些名词的对应
SMs, Streaming Multiprocessors (NV) = WGPs, Workgroup Processors (AMD)
Shared Memory (NV) = Local Data Share (AMD)

Cache and Memory Latency

Ada Lovelace 的L2大小从Ampere的6MB提升到了96MB(AD102),其中RTX 4090使用了72MB。同时,VRAM的带宽没有显著增加,这里NV的策略是通过提高L2的缓存大小来缓解当计算容量扩展时存储带宽的瓶颈。

Both of Nvidia’s recent architectures have a large L1 cache private to each SM, but Ada Lovelace substantially expands the L2 cache. Ampere had a conventionally sized 6 MB L2, but AD102 packs 96 MB of L2. The RTX 4090 has 72 MB of that L2 cache enabled.

AMD 的 RDNA2 架构使用4层设计的缓存结构,最后一层是Infinity Cache堆叠。Ada Lovelace在L2延迟上相对Ampere有所降低,但是还是不及RDNA2的L2,但是要比Infinity Cache的延迟低。NV用一级缓存就填补了AMD的L2和Infinity Cache两级缓存作用。

Bandwidth

Single Group

RDNA2 在进入L2及以后的部分,带宽要优于Ada Lovelace,同时优于Ampere。Ada Lovelace在这部分相较于Ampere还是有提升的。RDNA2 相比较而言,更擅长处理低占用率的情况。

Groups Scale

对于NV的L2 Cache带宽和AMD的Infinity Cache带宽,在低占用的情况下(< 24),RDNA2 保持优势,在高占用的情况下,NV则占据上风。同时在全部的情况下,Ada Lovelace的带宽都要优于Ampere。

At high occupancy, Ampere had a decent advantage over RDNA 2, but Ada Lovelace takes it a step further. RDNA 2’s L2 and Infinity Cache are both at a clear disadvantage against Lovelace’s L2.

当测试超出L2 Cache时,上面描述的情况更显著,只是Ada Lovelace 的scale up 的速度不及Ampere。但是这并不会影响很大,因为Ada Lovelace拥有更大的L2容量,对VRAM的访问会更少。

VRAM bandwidth also hasn’t dramatically increased over Ampere, but both Nvidia cards still have a ton of bandwidth in absolute terms. For perspective, they have almost as much theoretical VRAM bandwidth as AMD’s Radeon VII, which has a HBM2 based VRAM subsystem.

RDNA2 更依赖于Infinity Cache,VRAM的带宽随着计算规模的占用增加并没有显著增加。

在高占用的测试负载下,Ada Lovelace于L1处的带宽超出RDNA2和Ampere很多,大概2.6倍。当在L2及以后时,RDNA2的带宽要优于Ampere,但是不及Ada Lovelace(即使RDNA2有Infinity Cache加持)。这也验证了之前的论断,RDNA2相比较更擅长处理低占用的情况,而NV在高占用的情况下更有优势。

Local Memory Latency

对于shared memory的延迟,NV架构中shared memory和L1Cache共用一套底层SRAM(128KB),相较于RDNA2(单独的L0单元)的LDS大小(128KB)来说是不足的,但是延迟情况要更优。

A small caveat here is that both of Nvidia’s architectures use a single 128 KB block of SRAM to serve as both L1 cache and local memory. The architecture can provide different combinations of L1 cache and local memory sizes by varying how much of that SRAM is tagged. AMD in contrast uses 16 KB of dedicated L0 vector cache for each CU (half of a WGP), and a dedicated 128 KB of local memory.

Compute && Atomics Latency

Ada Lovelace拥有更快的时钟,在指令周期上和Ampere差不多。

At least as far as we can tell, Ada Lovelace performs a lot like Ampere. Latency for common FP32 operations like addition and fused multiply add remain at 4 cycles. RDNA 2 for comparison has 5 cycle latency for those operations.

在原子操作延迟(可以认为是core to core延迟)上,Ada Lovelace还是不及RDNA2

总结

Nvidia’s new architecture improves performance at low occupancy, eroding one of RDNA 2’s key advantages over GA102. At high occupancy, Ampere was already very strong. Ada Lovelace extends that strength, and towers over RDNA 2.