Optical & Fixed Networks

China Telecom Completes the Industry's First 1024-GPUs Distributed Lossless AI Computing Cluster via Its 800G WDM Field-Deployed Networks

China Telecom achieves 1024-GPUs Distributed Lossless AI Computing Cluster via 800G WDM Field-deployed Networks

Recently, China Telecom has used 800G/λ and C+L technologies to provide high bandwidth for a distributed cluster with 1024 GPUs in the field-deployed network, achieving distributed training of a 175 billion parameters GPT3 (GPT3-175B) model through 120 km field-deployed G.652 fiber.

The training performance of the introduced pipeline parallel (PP) and data parallel (DP) strategies have both reached over 95% compared to that of centralized training. Furthermore, high-bandwidth, highly-reliable and efficient optical transmission networks have also been proved as a solid foundation for AI computing interconnection.

Currently, the number of GPUs in a single AI DC has reached over 10,000 or even 100,000, driving interconnection bandwidth demands up to 100 Tbit/s or even exceeding the Pbit/s level. In this case, the ultra-high bandwidth, ultra-high reliability, and ultra-high efficiency of optical transmission systems are critical to sustaining high computing efficiency of distributed training. To address massive data transmission demands, 800G/λ rate with high-order modulation format is adopted to enhance the spectral efficiency. Additionally, the industry-standard C+L band is employed to enable ultra-high transmission bandwidth. China Telecom has deployed an intelligent computing verification network between Wuqing and Runze AI DCs in Tianjin, supporting high-bandwidth interconnection through multiple loopbacks over 120 km. To ensure high reliability of data transmission, tests were conducted on link bit errors, wavelength failures, and fiber faults. Results show that an 800G wavelength failure can reduce computing efficiency by over 40%, while fiber interruptions exceeding 100 ms may significantly degrade performance or even halt training tasks. Thus, WSON-based rerouting protection is used to ensure that the business restoration time between AI DCs remains within 50 ms, thereby maintaining reliable interconnection for distributed AI services and preserving computing efficiency. To enhance link utilization, China Telecom has proposed a minute-level dynamic wavelength provisioning and teardown solution, enabling time-based orchestration of computing and network resources, improving overall network resource utilization. This trial lays the groundwork for cross-regional, cross-layer, and cross-domain collaborative scheduling of computing resources, representing a significant advancement in China Telecom's cloud-network integration efforts.

In the future, China Telecom will continue to drive innovation and enhance computing capabilities through advanced network connectivity. It aims to establish a robust optical foundation for intelligent computing interconnection by leveraging high-bandwidth, highly-reliable, and efficient optical transmission networks. The company will accelerate the development of integrated digital infrastructure for cloud-network synergy and explore new paradigms for intelligent computing architectures. These efforts aim to empower the intelligent transformation of diverse industries.



More Articles you may be Interested in...