Scholars@Duke publication: Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

Publication , Journal Article

Yu, F; Xu, Z; Shangguan, L; Wang, D; Stamoulis, D; Madhok, R; Karianakis, N; Li, A; Liu, CC; Chen, Y; Chen, X

Published in: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

January 1, 2024

As the size of Deep Neural Networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and Neural Architecture Search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this paper, we identify the root cause behind the mismatch between workload reduction and latency reduction is GPU tail effect – a classic system issue caused by resource under-utilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the light-weight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%-27% latency reduction over SOTA DNN pruning and NAS methods.

Duke Scholars

Author Yiran Chen Electrical and Computer Engineering

Published In

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

DOI

10.1109/TCAD.2024.3404413

EISSN

1937-4151

ISSN

0278-0070

Publication Date

January 1, 2024

Related Subject Headings

Computer Hardware & Architecture
4607 Graphics, augmented reality and games
4009 Electronics, sensors and digital hardware
1006 Computer Hardware
0906 Electrical and Electronic Engineering

Citation

APA

Chicago

ICMJE

MLA

NLM

Yu, F., Xu, Z., Shangguan, L., Wang, D., Stamoulis, D., Madhok, R., … Chen, X. (2024). Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. https://doi.org/10.1109/TCAD.2024.3404413

Yu, F., Z. Xu, L. Shangguan, D. Wang, D. Stamoulis, R. Madhok, N. Karianakis, et al. “Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, January 1, 2024. https://doi.org/10.1109/TCAD.2024.3404413.

Yu F, Xu Z, Shangguan L, Wang D, Stamoulis D, Madhok R, et al. Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2024 Jan 1;

Yu, F., et al. “Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Jan. 2024. Scopus, doi:10.1109/TCAD.2024.3404413.

Yu F, Xu Z, Shangguan L, Wang D, Stamoulis D, Madhok R, Karianakis N, Li A, Liu CC, Chen Y, Chen X. Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2024 Jan 1;

Published In

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

DOI

10.1109/TCAD.2024.3404413

EISSN

1937-4151

ISSN

0278-0070

Publication Date

January 1, 2024

Related Subject Headings

Computer Hardware & Architecture
4607 Graphics, augmented reality and games
4009 Electronics, sensors and digital hardware
1006 Computer Hardware
0906 Electrical and Electronic Engineering