Author : Sunwoo Kim
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.8/5 (34 download)
Book Synopsis Model Compression for Efficient Machine Learning Inference by : Sunwoo Kim
Download or read book Model Compression for Efficient Machine Learning Inference written by Sunwoo Kim and published by . This book was released on 2022 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: This dissertation presents model compression methods to facilitate the practicality of deep learning and machine learning frameworks for real-time applications. Starting from conventional compression techniques such as quantization to reduce bit-widths, we extend to developing novel and compact frameworks through a lossless compression approach. We begin with an extreme network quantization algorithm to compress a floating-point deep neural network using single bit representations. The training is done in two rounds to preserve the model performance, first in a weight compressed real-valued network and then in a bitwise version with the same topology. The pretrained weights of the first round are used to initialize the weights of the bitwise network, where we redefine the feedforward procedure with bitwise values and operations. Only the bitwise network is used for deployment for test time inference, which not only makes it easier to put on small devices but also expedites the inference speed with bitwise arithmetic operations. For this study, we aim at compressing a recurrent neural network architecture for single-channel source separation. Applying extreme quantization on this type of network poses additional challenges due to its complex recurrent relations as quantization noise can accumulate over multiple time frames. We address this by proposing a more delicate solution to incrementally binarize the model parameters in order to minimize the potential loss that can occur from a sudden introduction of quantization. As the proposed binarization technique turns only a few randomly chosen parameters into their binary versions, it gives the network training procedure a chance to gently adapt to the partly quantized version of the network. It eventually achieves the full binarization by incrementally increasing the amount of binarization over the iterations. Binarization can be extended to data compression to provide the same benefits of extreme compression rates and expedited inference speeds using supported algorithms and hardware. Similarly to binarizing model weights, we propose to compress the bitwidths of data down to binary form with emphasis on minimizing loss of information. To this end, we introduce locality sensitive hash functions (LSH) to reduce the storage overhead while preserving the semantic similarity between the high-dimensional data points in the Euclidean space and binary codes. However, given the random nature of LSH projection vectors, a large bitstring is required to form discriminative hash codes that can guarantee high precision. In this dissertation, we propose to learn the locality sensitive hash functions using boosting theory to efficiently encode the underlying structure of data into hash codes. Our adaptive boosting algorithm learns simple logistic regressors as the weak learners. The algorithm differs from AdaBoost in the sense that the projections are trained to minimize the distances between the self-similarity matrix of the hash codes and that of the original data points, rather than the misclassification rate. We evaluate our discriminative hash codes on a source separation problem framed as a similarity search task. Upon training our hash functions, their binary classification results transform each data point into a bit string, on which simple bitwise operations calculate Hamming distance to find the nearest neighbors from the hashed dictionary. Quantization and other model compression methods can achieve good compression rates, but they are applied as a post-training procedure that propagate noise and decrease generalization performance. Quantization-aware training helps to minimize the accuracy drop by simulating the low precision inference using the same floating point backpropagation, there is a limit to the amount of recovery from this fine-tuning procedure. Furthermore, quantized models demand dedicated hardware designs to support bit-level manipulation in memory and computation units to reap the benefits from model reduction. We address this worsened generalization and hardware compatibility issue of model compression methods by improving compact models to outperform larger model counterparts as a form of lossless compression. The first approach is personalization, in which small models are fine-tuned to their test-time specificity. Personalized compact models are trained in original floating-point values without structural modifications, and do not require any specialized hardware. We aim at use-cases for end-user devices in realistic settings where we often encounter only a few classes within a target domain that tend to reoccur in the specific environment. Hence, we postulate a small personalized model suffices to handle this focused subset of the original universal problem. Our goal in this test-time adaptation is to develop personalized speech enhancement model targeting edge-devices that can perform well for relevant users' voices and surrounding acoustics (e.g. a family-owned smart assistant device). One major challenge for personalization is a major data shortage issue due to recent privacy infringement and data leakage issues. Our goal in this test-time adaptation is to perform personalized speech enhancement without utilizing clean speech target of the test speaker using a knowledge distillation framework. We distill the denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. Experimental results show that the personalized models outperform larger non-personalized baseline models, demonstrating that personalization achieves model compression with no loss of denoising performance. Finally, we propose another lossless approach using evolutionary algorithms to optimize compact generative adversarial networks. We coordinate the adversarial characteristics with a coevolutionary strategy and evolve a population of models to achieve high fitness corresponding to generative performance and training stability. Our framework exposes individuals to not only various but also fit and stronger adversaries per generation to learn robust and compact models for efficient and faster inference. The experimental results demonstrate generative models trained using the proposed coevolutionary strategy can produce small models capable of outperforming larger counterparts trained under the regular adversarial framework.