Better performance doesn’t always require the latest technology. Proof of this is Sunway’s OceanLight computer, which is located in the National Supercomputing Center of China and, according to reports last year, has exceeded the limit of 1 ExaFLOPS. In the Linpack benchmark, the computer achieved an Rmax value of 1.05 ExaFLOPS, while the Rpeak performance is said to be 1.3 ExaFLOPS.
Although much faster than the current leader of the Top 500 list, Fugaku of Japan (442 PFLOPS Rmax), the supercomputer cannot be included in the list due to lack of standards, although China has a faster system with Tianjin Tianhe-3 with an estimated 1.3 ExaFLOPS (maximum) and 1.7 ExaFLOPS theoretical peak performance (peak).
With small or larger details of OceanLight supercomputer architecture, the website is developed in local areas nextplatform.com occupied. A was used as a basis paper, published by Alibaba Group, Tsinghua University, Damu Academy, Zhejiang Laboratory and Beijing Academy of Artificial Intelligence. The core of the scientific work is actually a machine learning model called BaGuaLu, with parts of OceanLight computer architecture discussed. BaGuaLu runs on more than 37 million cores with 14.5 trillion variables (apparently an FP32 at a single resolution) and can probably scale up to 174 trillion parameters, which corresponds to an approximation of the so-called “brain scale”. What this means is a similarly large number of parameters where the synapses are located in the human brain.
SW26010-Pro Compute Engine in detail
Last year, nextplatform.com looked at how the Exascale system was built by the National Research Center in Parallel Engineering and Technology (NRCPC). It was concluded that the 14nm chips used for this may be capped at similar clock speeds as the 260-core SW26010 processors used in Sunway TaihuLight and are manufactured at 28nm in order to keep temperatures under control. At the same time, it was speculated that the arithmetic and vector-width elements would be doubled to 512 bits, as well as the reservoirs that housed the respective nodes.
At least in the last points, the nextplatform.com thesis was correct. The only thing wrong is the clock rates of the SW26010 Pro processors used. Meanwhile, the diagram provides a more detailed look at the processor’s computation engine.
Source: via nextplatform.com
Its computing cores are divided into a total of six groups of eight by eight cores (Computing Processing Element, CPE) and contain a large management processing element (MPE) along with a DDR4 memory interface (16 GB, 51.4 GB/s). Meanwhile, in each CPE there is 256 KB of L2 cache and a total of four logical blocks that support FP64 and FP32 computations on the one hand, and FP16 and BF16 on the other. Meanwhile, the six CPE modules are connected via a connecting loop with two mesh interfaces.
As far as the performance of a single SW26010-Pro is concerned, 14.03 PetaFLOPS is available for FP64 or FP32 accounts, while 55.3 PetaFLOPS is achieved for FP16 or BF16 accounts. As far as the OceanLight system configuration, one talks about a known test size of 107,520 nodes, each with SW26010-Pro per node, thus a total of 41.93 million computing cores spread across 105 computer cabinets.
Assumed clock rate of 2.22GHz and scalability up to 160 cabinets
The performance of the system is illustrated using another chart. The individual nodes are fused into a supernode, which in turn is connected in a 3-fold hyperbolic lipid tree topology and is unblocked. The interconnection is proprietary, according to the paper, although nextplatform.com believes it may be a modified version of the InfiniBand used in the original TaihuLight system. Regarding the clock rates of the individual SW26010 Pro chipsets, no firm facts can be found, but based on the main data shown, including performance information, it is possible to calculate the speed of 2.22 GHz, at which the processors should operate.
source: BaGuaLu
If you now use the pre-tested equipment of 107,250 nodes as the basis, then the theoretical peak performance of 1.51 ExaFLOPS will be calculated. However, it is believed that the OceanLight system can also be expanded to 160 cabinets, which ultimately means 163,840 nodes and 2.3 exaFLOPS peaks on F64 and FP32 calculations, while with 120 cabinets there would be 1.72 exaFLOPS peaks.
The BaGuaLu model, the main topic of the paper, was meanwhile running on an OceanLight supercomputer with 96,000 nodes and 37.44 million cores in total with 14.5 trillion parameters. If you were to switch said parameters to BF16 or FP16, the system would already be able to calculate 29 trillion parameters, while a setup with 160 cabinets would come with 49.5 trillion parameters. How to achieve the 174 trillion parameters mentioned in the paper is not entirely confirmed in the nextplatform.com report. This number can be achieved, at least in theory, by adding support for the INT8 and INT4 data formats. According to our own calculations, 198 trillion parameters can be achieved with INT4.
“Certified tv guru. Reader. Professional writer. Avid introvert. Extreme pop culture buff.”
More Stories
Remotely controlled cargo ships coming soon on the Elbe Canal?
Siemens technology makes Baden Canton Hospital smart
Discovering an ancient Mayan city – what do the rainforests hide?