Modern neural networks have come a long way in terms of performance. They excel in various applications like language, math, and vision. But here’s the catch – these networks are huge and demand a lot of computational resources. This poses a problem when it comes to serving these models to users, especially in settings where resources are limited, like wearables and smartphones.
So how do we address this issue? Well, one popular approach is to prune the networks by removing some of their weights. This way, we can reduce the computational resources needed for inference without significantly affecting the network’s utility. In simple terms, pruning means getting rid of connections between neurons that aren’t crucial for the network’s performance.
Now, pruning can be done at different stages of the network’s training process. But in this post, we’ll focus on post-training pruning. This means that we already have a pre-trained network and we want to determine which weights should be pruned. One common method is magnitude pruning, where we remove weights with the smallest magnitude. While this method is efficient, it doesn’t consider the impact of weight removal on the network’s performance.
Another popular approach is optimization-based pruning, where weights are removed based on how their removal affects the loss function. Sounds good in theory, but most existing optimization-based methods struggle to find the right balance between performance and computational requirements. Some methods make approximations that scale well but have lower performance, while others perform better but are less scalable.
In our recent work, titled “Fast as CHITA: Neural Network Pruning with Combinatorial Optimization,” which we presented at ICML 2023, we introduce CHITA – the Combinatorial Hessian-free Iterative Thresholding Algorithm. CHITA outperforms existing pruning methods in terms of scalability and performance trade-offs. We achieve this by leveraging advancements in high-dimensional statistics, combinatorial optimization, and neural network pruning.
For example, CHITA can be 20 to 1000 times faster than state-of-the-art methods when pruning networks like ResNet. Moreover, it improves accuracy by over 10% in many scenarios. That’s some impressive performance right there.
Now, let’s talk about some technical improvements that CHITA brings to the table. First off, it efficiently uses second-order information, which has shown to be effective in pruning methods. Typically, computing the Hessian matrix or its inverse is a real pain due to its large size. But CHITA cleverly sidesteps this issue by using second-order information without explicitly computing or storing the Hessian matrix. This makes it more scalable.
Secondly, CHITA utilizes combinatorial optimization, which takes into account the impact of pruning one weight on others. This prevents us from mistakenly pruning important weights just because they seem unimportant in isolation. By considering the relationships between weights, CHITA ensures better performance.
Now, let’s dive into the technical details. Pruning can be seen as a best-subset selection problem. We want to find the subset of weights that has the minimum loss among all possible pruning candidates. However, solving this problem directly is computationally intractable. So instead, we approximate the loss function with a quadratic function using a second-order Taylor series. This allows us to estimate the Hessian matrix with the empirical Fisher information matrix.
But here’s the twist – instead of explicitly computing the Hessian matrix, we reformulate the problem to exploit the low-rank structure of the Fisher information matrix. This reformulation treats pruning as a sparse linear regression problem, where each regression coefficient corresponds to a weight in the network. By solving this regression problem, we can identify which weights should be pruned.
Now, let’s talk about the optimization algorithms behind CHITA. We reduce pruning to a linear regression problem with a sparsity constraint, meaning that only a certain number of regression coefficients can be nonzero. To solve this problem, we use an iterative hard thresholding algorithm that performs gradient descent and sets all coefficients outside the top k (those with the largest magnitude) to zero. This algorithm explores different pruning candidates and optimizes over the weights simultaneously.
To ensure faster convergence, we developed a new line-search method that finds an appropriate learning rate for the algorithm. This leads to a significant improvement in convergence speed. We also implemented various computational schemes to enhance CHITA’s efficiency and the quality of the second-order approximation, resulting in an improved version called CHITA++.
We conducted experiments to compare the performance of CHITA with other state-of-the-art pruning methods on different network architectures like ResNet and MobileNet. The results speak for themselves. CHITA is not only much more scalable but also achieves better accuracy compared to other methods.
In conclusion, CHITA is a game-changer when it comes to pruning pre-trained neural networks. It offers scalability and competitive performance by efficiently leveraging second-order information and incorporating ideas from combinatorial optimization and high-dimensional statistics. With CHITA, we can significantly reduce computational requirements while maintaining accuracy. This opens up possibilities for deploying neural networks in resource-constrained settings. However, there are still limitations to be addressed, and we look forward to future work in this exciting field.