The leader in the use of graphics architectures for computing for a long time was NVIDIA, but a longtime rival in the face of AMD is not going to give up its position. In response to the announcement of the Ampere architecture and next-generation A100 gas pedals, AMD today responded by announcing the world's first CDNA-based accelerant, the high-performance Instinct MI100 processor.
For quite a long time the approach to designing graphics chips remained unified, but it quickly turned out that what is good for games is not always good for computing, and some of the possibilities for applications not related to rendering 3D graphics are simply redundant. Raster operations modules (RBE/ROP) or texture overlays may be an example. What should have happened is that the evolution branches of "graphical" and "computational" processors merged for a while and began to diverge again. And the new AMD Instinct MI100 processor belongs to the purely computational branch of development of this kind of chips.
Now AMD has at its disposal two main architectures, RDNA and CDNA, which are the above mentioned development branches of the GPU. Naturally, the new Instinct MI100 processor has inherited many things from its evolutionary brethren, such as scalar and vector instruction execution blocks: after all, it doesn't matter if they work to calculate graphics or to calculate something else. However, the novelty also has a number of differences that allow it to claim the title of the most powerful and versatile GPU-based gas pedal in the world.
Graphics Processor Evolution Scheme: There is a divergence of features.
Scheme of graphical processors evolution: there is a divergence of features.
AMD has significantly strengthened its position in recent years and this is reflected in the creation of its own unified IP infrastructure: the new chip is made using 7nm process and all interconnect systems, both internal and external, in MI100 are based on the AMD Infinity second generation bus. The external channels are 16 bits wide and operate at 23 Gt/s, but while the previous Instinct models had a maximum of two, the number of Infinity Fabric channels is now increased to three. This makes it easy to organize systems based on four MI100s with the organization of interprocessor communication on the scheme "all with all", which minimizes delays.
Instinct MI100 Accelerators have received a third Infinity Fabric channel.
Instinct MI100 Accelerators received third channel Infinity Fabric
The general organization of the internal architecture MI100 processor inherited from GCN architecture; its basis is 120 computing blocks (compute units, CU). With the adopted AMD scheme "64 shader blocks on 1 CU" it allows to speak about 7680 processors. However, at the computing block level the architecture is significantly redesigned to better meet the requirements of modern computing gas pedal.
In addition to the standard scalar and vector instruction execution blocks, a new module of matrix mathematics, the so-called Matrix Core Engine, has been added, but all blocks of fixed functions have been removed from the silicon MI100: rasterization, tessellation, graphic caches and, of course, display output. The universal encoding-decoding engine of video formats, however, is saved - it is quite often used in computational workloads associated with the processing of multimedia data.
The structural scheme of computing modules in MI100
Each CU contains one block of scalar instructions with its own register file and data cache, and four blocks of vector instructions optimized for FP32 computations by sanalogic blocks. Vector modules have a width of 16 threads and process 64 threads (so-called wavefront in AMD terminology) in four cycles. But the most important thing in the architecture of the new processor is the new blocks of matrix operations.
The presence of Matrix Core Engines allows MI100 to work with a new type of instructions - MFMA (Matrix Fused Multiply-Add). Operations on matrixes of KxN size may contain mixed types of input data: INT4, INT8, FP16, FP32 modes are supported, as well as the new type Bfloat16 (bf16); however, the result is displayed only in INT32 or FP32 formats. Support for so many data types is introduced for versatility and MI100 can show high efficiency in computing scenarios of all kinds.
By using Infinity Fabric 2.0, MI100 performance can be further enhanced.
Using Infinity Fabric 2.0 further enhances MI100 performance
Each CU unit has its own scheduler, branching unit, 16 load-store modules, as well as L1 and Data Share caches with volumes of 16 and 64 Kbytes respectively. But the second level cache is common to the entire chip, it has associativity 16 and the volume of 8 Mbytes. The total throughput of L2 cache reaches 6 Tbytes/s.
More serious volumes of data already lie on the external memory subsystem. In MI100, this is the HBM2, a new processor that supports four or eight HBM2 assemblies running at 2.4 Gt/s. The total memory bandwidth of the subsystem can reach 1.23 Tbytes per second, which is 20% faster than previous AMD processing gas pedals. The memory capacity is 32 GB and supports error correction.
Instinct MI100 general block diagram
The "brain" of the Instinct MI100 chip is made up of four command processors (ACE on the block diagram). Their task is to receive the flow of commands from the API and distribute work tasks to individual computing modules. To connect to the system host processor MI100 has a PCI Express 4.0 controller, which gives a bandwidth of 32 Gbytes/s in each direction. Thus, the "coziest" Instinct MI100 gas pedal will feel together with the second generation AMD EPYC CPU or in systems based on IBM POWER9/10.
Dismissing unnecessary architectural blocks and optimizing the architecture for computing in the widest possible number of formats, Instinct MI100 can claim to be universal. Gas pedals with such features, as AMD rightly believes, will become an important building block in the ecosystem of new generation exascale HPC machines. AMD claims to be the first gas pedal capable of developing more than 10 Tflops in double precision FP64 mode, with a peak of 11.5 Tflops.
Specific and peak performance of MI100
MI100 performance specifics and peaks
In the less accurate formats, the novelty is proportionally faster, and especially well it is given to matrix calculations: for the FP32 performance reaches 46.1 Tflops, and in the new, optimized for the tasks of machine learning bf16 - and 92.3 Tflops, moreover, the previous generation of Instinct gas pedals can not perform such calculations at all. Depending on the data type, the superiority of MI100 over MI50 varies from 1.74x to 6.97x. However, NVIDIA A100 is still noticeably faster in these tasks, but it loses in FP64/FP32.