CASPA Special Interest Group Seminar II: From 5,000X Model Compression to 50X Acceleration – Achieving Real-Time Execution of ALL DNNs on Mobile Devices – CASPA

Picture1-c434f420-b349-11e9-a491-610acb3c7810

Past Seminar Events September 4, 2019

Date: Wednesday September 4th, 2019

Time: 6:30 pm to 9:00 pm

Venue: ITRI International Inc. 2870 Zanker Road, #140 San Jose, CA 95134

Registration Link

Agenda:

6:30pm – 7:00pm Registration & Networking

7:00 pm – 7:15pm CASPA introduction

7:15 pm – 8:15pm From 5,000X Model Compression to 50X Acceleration – Achieving Real-Time Execution of ALL DNNs on Mobile Devices, Prof. Yanzhi Wang, Northeastern University

8:15 pm – 8:30pm Questions/Answers

8:30 pm – 9:00pm Networking

Speaker:

Dr. Yanzhi Wang,

Assistant Professor, Northeastern University

Dr. Yanzhi Wang is currently an assistant professor in the Department of Electrical and Computer Engineering at Northeastern University. He has received his Ph.D. Degree in Computer Engineering from University of Southern California (USC) in 2014, and his B.S. Degree with Distinction in Electronic Engineering from Tsinghua University in 2009.

Dr. Wang’s current research interests mainly focus on DNN model compression and energy-efficient implementation (on various platforms). His research maintains the highest model compression rates on representative DNNs since 09/2018. His work on AQFP superconducting based DNN acceleration is by far the highest energy efficiency among all hardware devices. His work has been published broadly in top conference and journal venues (e.g., ASPLOS, ISCA, MICRO, HPCA, ISSCC, AAAI, ICML, CVPR, ICLR, IJCAI, ECCV, ACM MM, DAC, ICCAD, FPGA, LCTES, CCS, VLDB, ICDCS, TComputer, TCAD, JSAC, Nature SP, etc.), and has been cited above 4,400 times. He has received four Best Paper Awards, has another seven Best Paper Nominations and three Popular Paper Awards.

Abstract:

This presentation focuses on two recent contributions on model compression and acceleration of deep neural networks (DNNs). The first is a systematic, unified DNN model compression framework based on the powerful optimization tool ADMM (Alternating Direction Methods of Multipliers), which applies to non-structured and various types of structured weight pruning as well as weight quantization technique of DNNs. It achieves unprecedented model compression rates on representative DNNs, consistently outperforming competing methods. When weight pruning and quantization are combined, we achieve up to 4,438X weight storage reduction without accuracy loss, which is two orders of magnitude higher than prior methods. Our most recent results conducted a comprehensive comparison between non-structured and structured weight pruning with quantization in place, and suggest that non-structured weight pruning is not desirable at any hardware platform.

However, using mobile devices as an example, we show that existing model compression techniques, even assisted by ADMM, are still difficult to translate into notable acceleration or real-time execution of DNNs. Therefore, we need to go beyond the existing model compression schemes, and develop novel schemes that are desirable for both algorithm and hardware. Compilers will act as the bridge between algorithm and hardware, maximizing parallelism and hardware performance. We develop a combination of pattern pruning and connectivity pruning, which is desirable at all of theory, algorithm, compiler, and hardware levels. We achieve 20ms inference time of large-scale DNN VGG-16 on smartphone without accuracy loss, which is 50X faster than TensorFlow-Lite. We can potentially enable real-time execution of all DNNs using the proposed framework.