Crush - "warm-up" supercomputer with AMD EPYC and Instinct MI250X

Crush -

Oak Ridge National Laboratory (ORNL) is often mentioned in news articles as actively testing and implementing new supercomputing technologies. The OLCF (Oak Ridge Leadership Computing Facility) is currently in the process of installing the first exascale Frontier supercomputer in the US based on AMD processors and GPU.

Now a number of architectural features of Frontier have come to light, as the National Center for Computational Sciences' Crusher small cluster, which uses virtually the same HPE Cray nodes as Frontier, has come on line. The system is used as an early access platform and consists of just two cabinets. The first has 128 nodes and the second has 64. Total peak performance is claimed to be 40 Pflops.

The heart of each node is a dedicated version of the AMD EPYC 7A53. The 64 cores (with SMT2) are split into four NUMA domains served by separate memory controllers. Eight DDR4 channels (512 GB total per node) provide 205 GB/s of throughput. There are only four gpus in the system, but they are the latest AMD Instinct MI250X dual-chip, so the system sees them as eight separate GPUs. Each of the gpus is connected to a single NUMA domain via two Infinity Fabric channels, providing 36 GB/s in each direction. The chips inside the MI250X are connected to each other via a faster channel, giving 200 GB/s in both directions. All of the gpus are connected to each other in a «each-to-each» 50-Gbyte/s channel arrangement. At the same time they are directly connected to the factory & ; each one is entitled to its own HPE Slingshot adapter (200 Gbps). Only a pair of 1.92 Tbytes SSDs (4 Gbytes/sec write, 1.6 million IOPS in random operations) are connected to the CPU via a PCIe switch. Each NUMA domain is divided into two L3 sub-domains connected to a single gpu, which allows for flexible load balancing. The primary storage is an external IBM Spectrum Scale storage system with a total capacity of 250 Pbytes and a peak rate of 2.5 Tbytes/s.

Future Frontier supercomputer

The system also has access to the NCSS network, though not directly. In general NFS storage each project can get 50 Gbytes with a storage time of 90 days, and in GPFS on Spectrum Scale 50 Tbytes are already available. Crusher comes with a lot of preinstalled software. The user environment is modular and based on the Lmod system written in Lua. Load balancing is handled by Slurm. For authentication the RSA SecurID hardware token is used.

Load comments