Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI
The growing demand for higher compute and memory capacity driven by artificial intelligence (AI) applications pushes higher core counts in modern systems. Many-core architectures exhibiting spatial interconnects with high on-chip bandwidth are ideal for these workloads due to their data movement flexibility and sheer parallelism. However, the size of such platforms makes them particularly susceptible to manufacturing defects, prompting a need for designs and mechanisms that improve yield. Despite these techniques, nonfunctional cores and links are unavoidable. Although prior works address defective cores by disabling them and only scheduling workload to functional ones, communication latency through spatial interconnects is tightly associated with the locations of defective cores and cores with assigned work. Based on this observation, we present Si-Kintsugi, a defect-aware workload scheduling framework for spatial architectures with mesh topology. First, we design a novel and generalizable workload mapping representation and cost function that integrates defect pattern information. The mapping representation is formed into a 1D vector with simple constraints, making it an ideal candidate for open source heuristic-based optimization algorithms. After a communication latency optimized workload mapping is found, dataflow between the mapped cores is automatically generated to balance communication and computation cost. Si-Kintsugi is extensively evaluated on various workloads (i.e., BERT, ResNet, GEMM) across a wide range of defect patterns and rates. Experiment results show that Si-Kintsugi generates a workload schedule that is on average 1.34 × faster than the industry standard layer-pipelined schedule on defective platforms.