Architecture: Skylake-X versus Skylake (1)
You could say that Skylake-X is a variant of Skylake, albeit with more cores, an extended PCI-Express controller and a double memory controller. It is true that the Skylake-X CPUs are based on the same architecture that we covered in our Skylake architecture-review quite some time ago. That said, we would not do these new CPUs justice if we would keep it at that. Aside from the amount of cores, PCIe-lanes and memory channels there are other differences as well.
Extra instructions: AVX-512
First of all the floating point execution units within the Skylake-X cores have been made available for the new AVX-512 instructions, successor of AVX 2.0. Where AVX 2.0 was able to process instructions on 256-bit data in one go, AVX-512 can directly process instructions on 512-bit data.
Aside from a lot of new instructions and existing registers that have been enlarged to 512-bit, AVX512 also offers a lot of new registers. A total of 8x as much data can be stored directly with the execution units in the registers in comparison with the time when SSE was the latest instruction set expansion. There are also extra registers that have been added for masking instructions and contain AVX512 as well as all sorts of new instructions in order to speed up various algorithms.
From ring bus to mesh
Another new feature is that Intel no longer uses the ring bus architecture for Skylake-X, something they have been using since 2010. Instead, the different components of the processor can communicate with each other via a mesh network.
The ring bus was first introduced in 2010 with the Sandy Bridge generation processors. For this generation Intel released desktops CPUs with up to 4 cores and server models with up to 8 cores. The ring bus made it relatively easy for Intel to design chips with more or less cores, because they were able to add the extra cores with the connected piece of L3-cache as slices in a chip design as it were. The ring bus connected every component of the chip. This ring bus works as a train track of sorts with stations at every core. Data can be send to every chip component through this bidirectional track, where transport from every station to the next takes one stroke of the clock.
The ring bus was first introduced with the 2010 Sandy Bridge processors and allowed Intel to easily release chip variants with more or less cores.
Throughout the last years the amount of cores in Intel’s server processors exploded. The current Broadwell-generation Xeon E5 v4 processors offer up to 24 cores. In order to support this, Intel had to work their magic so they could implement two ring buses that are connected via special nodes; this was done for the last two generations. These nodes have an extra latency of five strokes of the clock, which means that in the worst case – when the core on the bottom left of the chip needs data from the L3-cache memory that is connected to the core on the top right – there is a latency of 14 strokes of the clock. With even more cores, which was something Intel planned for the current Skylake-generation Xeon-processors, the ring bus would have been too limiting for the performance.
While Intel still uses a ring bus for the desktop and laptop Skylake processors (with a maximum of four cores), this is simply not achievable for the Skylake server CPUs where they wanted to increase the amount of cores again. Because of this, Intel used a new method for the new generation server-CPUs. Within the different versions of the Skylake server-chips the cores are connected in a sort of Matrix-structure, resembling a chessboard. On this chessboard are several communication lines, both horizontal as well as vertical. Through this so-called mesh network of communication channels, the different components of the chip can communicate with each other. Once again, one station is one stroke of the clock in terms of latency. Because of the increased amount of communication channels in the chip in comparison with one or two ring buses, the total bandwidth that the chip components can use to communicate with each other is increased significantly.
The chip design is still modular; between the cores Intel places the memory controllers on the left and right while all other components, such as PCI-Express controller and the connections through communication with other sockets are placed at the top of the chip. As long as the amount of cores (minus one or two for the memory controllers) can fit inside such a matrix, Intel can easily produce variants with more or less cores.
Intel released a die-shot of the 18-core variant of Skylake-X. This is of course the 18-core variant, where we can clearly see the matrix of 4 x 5 cores and the two memory controllers within (bear in mind that the die-shot is rotated 90 degrees to the left compared with the diagram above). The 12-, 14- and 16-core models will be based on the same chip, but with some cores turned off.
The 6-, 8- and 10-core Skylake-X chips have a maximum of 10 cores and are all based on a 10-core chip. It is clear that with this chip the cores are in a 4 x 3 matrix.
A die-shot of the 18-core top model from the Skylake-X series.