Usually, we would compare NVIDIA and AMD much like how people would compare companies like Samsung against Apple or Sony against Microsoft and so forth. The comparison or rivalry stems from their systems and service offerings’ similarities which usually make consumers pick one over the other for whatever reason best known to them.
But rivalry to one side, it came surprising when Nvidia did picked it’s main competitor to provide server processors for it’s new DGX A100 deep learning system unlike expectations which was thought of to have been Intel’s Xeon platform. Why many may have been jaw-dropped, speculating the reason for such drastic measure, the company itself had come forward to reveal the reason behind the move.
First two DGX Systems of Nvidia uses the Intel’s Xeon CPUs but the company then dropped them off in the DGX A100 for two AMD 64-Core, Zen 2-based Epyc 7742 CPUs. The new system now uses a new Ampere-based A100 GPU and bosts 5 petaflops of AI compute performance and 320GB of GPU memory with 12.4 TB per seconds of bandwith – that’s a lot of computing power.
The General manager of DGX and Nvidia’s Vice President, Charlie Boyle had made it known that the decision came down to the extra features and performances offered by the Epyc processor. “To keep the GPUs in our system supplied with data, we needed a fast CPU with as many cores and PCI lanes as possible. The AMD CPUs we use have 64 cores each, lots of PCI lanes, and support PCIe Gen4,” he explained.
Aside having eight more cores over the Intel’s Xeon platinum 9282, Epyc 7742 also supports eight-channel memory but Intel’s Xeon Scalable processors supports just six memory channels. AMD also offered the processors much cheaper at US$6,950 compared to US$25,000 despite having more cache and lower TDP.
PCIe 4.0 support is one of the major factors for choosing Epyc, with Intel’s processors still only supporting PCIe 3.0. It means AMD’s CPUs offer 128 lanes and a peak PCIe bandwidth of 512GB/s. “The DGX A100 is the first accelerated system to be all PCIe Gen4, which doubles the bandwidth from PCIe Gen3. All of our IO in the system is Gen4: GPUs, Mellanox CX6 NICs, AMD CPUs, and the NVMe drives we use to stream AI data,” Boyle said.
AMD has a upper hand as it uses the 7nm manufacturing process, though Intel’s 10nm Ice Lake server CPUs, which are expected feature PCIe 4.0 support, arrive later this year.