Author : Uday Bhanu Sharma Mallappa
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (136 download)
Book Synopsis AI for Design Optimization and Design for AI Acceleration by : Uday Bhanu Sharma Mallappa
Download or read book AI for Design Optimization and Design for AI Acceleration written by Uday Bhanu Sharma Mallappa and published by . This book was released on 2022 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Integrated circuit (IC) design at the scale of billions of circuit elements would be unimaginable without the software and services from the Electronic Design Automation (EDA) industry. However, today, the designers using these EDA tools and flows are confronted by long runtimes, high design costs and low power, performance and area (PPA) gains when transitioning to the latest process nodes. The long tool-runtimes and high tool-license costs make it prohibitively expensive for a thorough design-space exploration. Furthermore, pessimistic margins introduced at various stages of the EDA flow, to balance the accuracy-runtime tradeoff, result in suboptimal design implementations. To counter these issues and keep up with the pace of PPA expectations from the market, the dissertation contributes to two promising opportunities at the top of the computing stack; (1) algorithmic improvements and (2) domain-specialized hardware. For algorithmic contributions, we exploit AI-based techniques (i) to reduce the design and schedule costs of advanced node IC design, and (ii) to efficiently search for optimal design implementations. A significant portion of the design cycle is spent on the static timing analysis (STA) at multiple corners and multiple modes (MCMM). To address the schedule costs of STA engines, we propose a learning model to accurately predict expensive path-based analysis (PBA) results from pessimistic graph-based analysis (GBA). We also devise a MCMM timing model using learning-based techniques, to predict accurate timing results at unobserved signoff corners, using timing results from a small subset of corners. Our PBA-GBA model reduces the maximum PBA-GBA divergence from 50.78ps to 39.46ps, for a 350K-instance design in 28nm FDSOI foundry. Our MCMM timing prediction model uses timing results from 10 observed corners, to predict timing results at the remaining 48 unobserved corners with less than 0.5% relative root mean squared error (RMSE), for a 1M-instance design in 16nm enablement. Besides STA, two most important and critical phases of the IC design cycle are the placement of standard cells, and the routing tasks at various abstraction levels. To demonstrate the use of learning-based models for efficient search of optimal placement implementation, we propose a reinforcement learning (RL)-based framework RLPlace for the task of detailed placement optimization. With global placement output of two critical IPs as the start point, RLPlace achieves up to 1.35% half-perimeter wirelength (HPWL) improvement as compared to the commercial tool's detailed placement results. To efficiently search for optimal routing solutions in network-based communication systems, we propose a SMT-based framework to jointly determine routing and virtual channel (VC) assignment solutions in network-on-chip (NOC) design. Our novel formulation enables better deadlock-free performance, achieving up to 30% better performance than the state-of-the-art application-aware oblivious routing algorithms. We propose two novel hardware accelerators for image classification tasks, to exemplify the performance and energy benefits of domain-specialized hardware. To alleviate the computation and energy burden of neural network inference, we focus on two key areas; (i) skipping unnecessary computations, and (i) maximizing the reuse of redundant computations. Our TermiNETor framework skips ineffectual computations during the inference of image classification tasks. TermiNETor relies on bit-serial weight processing, to dynamically predict and skip the computations that are unnecessary for downstream computations. Our TermiNETor framework achieves up to 1.7x reduction of operation count compared to non-skipping baseline without accuracy degradation, and the hardware implementation of TermiNETor framework improves the average energy efficiency by 3.84x over SCNN [6], and by 1.98x over FuseKNA [7]. Our second accelerator PatterNet demonstrates the performance and energy benefits of reusing redundant computations during the inference phase of image classification. PatterNet is based on patterned neural networks for computation reuse, and supported with a novel pattern-stationary architecture. With similar accuracy results, our PatterNet accelerator reduces the memory and operation count up to 80.2% and 73.1%, respectively, and 107x more energy efficient compared to Nvidia 1080 GTX. We demonstrate the silicon implementation of PatterNet and TermiNETor accelerators in TSMC40nm foundry enablement.