Channel pruning (He et al., 2017) aims at reducing the number of input channels of each convolutional layer while minimizing the reconstruction loss of its output feature maps, using preserved input channels only. Similar to other model compression components based on channel pruning, this can lead to direct reduction in both model size and computational complexity (in terms of FLOPs).
In PocketFlow, we provide ChannelPrunedRmtLearner as the remastered version of the previous ChannelPrunedLearner, with simplified and easier-to-understand implementation. The underlying algorithm is based on (He et al., 2017), with a few modifications. However, the support for RL-based hyper-parameter optimization is not yet ready and will be provided in the near future.
For a convolutional layer, we denote its input feature map as
The convolutional operation can be understood as standard matrix multiplication between two matrices, one from im2col operator to produce a matrix
The matrix multiplication can be decomposed along the dimension of input channels. We divide
In (He et al., 2017), a
The above problem can be tackled by firstly solving
The coefficient of
Below is the full list of hyper-parameters used in ChannelPrunedRmtLearner:
| Name | Description |
|---|---|
cpr_save_path |
model's save path |
cpr_save_path_eval |
model's save path for evaluation |
cpr_save_path_ws |
model's save path for warm-start |
cpr_prune_ratio |
target pruning ratio |
cpr_skip_frst_layer |
skip the first convolutional layer for channel pruning |
cpr_skip_last_layer |
skip the last convolutional layer for channel pruning |
cpr_skip_op_names |
comma-separated Conv2D operations names to be skipped |
cpr_nb_smpls |
number of cached training samples for channel pruning |
cpr_nb_crops_per_smpl |
number of random crops per sample |
cpr_ista_lrn_rate |
ISTA's learning rate |
cpr_ista_nb_iters |
number of iterations in ISTA |
cpr_lstsq_lrn_rate |
least-square regression's learning rate |
cpr_lstsq_nb_iters |
number of iterations in least-square regression |
cpr_warm_start |
use a channel-pruned model for warm start |
Here, we provide detailed description (and some analysis) for above hyper-parameters:
cpr_save_path: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.cpr_save_path_eval: save path for model created in the evaluation graph. The resulting checkpoint files can be used to export GraphDef & TensorFlow Lite model files.cpr_save_path_ws: save path for model used for warm-start. This learner supports loading a previously-saved channel-pruned model, so that no need to perform channel selection again. This is only used whencpr_warm_startisTrue.cpr_prune_ratio: target pruning ratio for input channels of each convolutional layer. The largercpr_prune_ratiois, the more input channels will be pruned. Ifcpr_prune_ratioequals 0, then no input channels will be pruned and model remains the same; ifcpr_prune_ratioequals 1, then all input channels will be pruned.cpr_skip_frst_layer: whether to skip the first convolutional layer for channel pruning. The first convolutional layer may be directly related to input images and pruning its input channel may harm the performance significantly.cpr_skip_last_layer: whether to skip the last convolutional layer for channel pruning. The first convolutional layer may be directly related to final outputs and pruning its input channel may harm the performance significantly.cpr_skip_op_names: comma-separated Conv2D operations names to be skipped. For instance, ifcpr_skip_op_namesis set to "aaa,bbb", then any Conv2D operation whose name contains either "aaa" or "bbb" will be skipped and no channel pruning will be applied on it.cpr_nb_smpls: number of cached training samples for channel pruning. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.cpr_nb_crops_per_smpl: number of random crops per sample. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.cpr_ista_lrn_rate: ISTA's learning rate for LASSO regression. Ifcpr_ista_lrn_rateis too large, then the optimization process may become unstable; ifcpr_ista_lrn_rateis too small, then the optimization process may require lots of iterations until convergence.cpr_ista_nb_iters: number of iterations for LASSO regression.cpr_lstsq_lrn_rate: Adam's learning rate for least-square regression. Ifcpr_lstsq_lrn_rateis too large, then the optimization process may become unstable; ifcpr_lstsq_lrn_rateis too small, then the optimization process may require lots of iterations until convergence.cpr_lstsq_nb_iters: number of iterations for least-square regression.cpr_warm_start: whether to use a previously-saved channel-pruned model for warm-start.
In this section, we present some of our results for applying ChannelPrunedRmtLearner for compression image classification and object detection models.
For image classification, we use ChannelPrunedRmtLearner to compress the ResNet-18 model on the ILSVRC-12 dataset:
| Model | Prune Ratio | FLOPs | Distillation? | Top-1 Acc. | Top-5 Acc. |
|---|---|---|---|---|---|
| ResNet-18 | 0.2 | 73.32% | No | 69.43% | 88.97% |
| ResNet-18 | 0.2 | 73.32% | Yes | 68.78% | 88.71% |
| ResNet-18 | 0.3 | 61.31% | No | 68.44% | 88.30% |
| ResNet-18 | 0.3 | 61.31% | Yes | 68.85% | 88.53% |
| ResNet-18 | 0.4 | 50.70% | No | 67.17% | 87.48% |
| ResNet-18 | 0.4 | 50.70% | Yes | 67.35% | 87.83% |
| ResNet-18 | 0.5 | 41.27% | No | 65.73% | 86.38% |
| ResNet-18 | 0.5 | 41.27% | Yes | 65.98% | 86.98% |
| ResNet-18 | 0.6 | 32.07% | No | 63.38% | 84.62% |
| ResNet-18 | 0.6 | 32.07% | Yes | 63.65% | 85.47% |
| ResNet-18 | 0.7 | 24.28% | No | 60.26% | 82.70% |
| ResNet-18 | 0.7 | 24.28% | Yes | 60.43% | 82.96% |
For object detection, we use ChannelPrunedRmtLearner to compress the SSD-VGG16 model on the Pascal VOC 07-12 dataset:
| Model | Prune Ratio | FLOPs | Pruned Layers | mAP |
|---|---|---|---|---|
| SSD-VGG16 | 0.2 | 67.34% | Backbone | 77.53% |
| SSD-VGG16 | 0.2 | 66.50% | All | 77.22% |
| SSD-VGG16 | 0.3 | 53.58% | Backbone | 76.94% |
| SSD-VGG16 | 0.3 | 52.32% | All | 76.90% |
| SSD-VGG16 | 0.4 | 41.63% | Backbone | 75.81% |
| SSD-VGG16 | 0.4 | 39.96% | All | 75.80% |
| SSD-VGG16 | 0.5 | 31.56% | Backbone | 74.42% |
| SSD-VGG16 | 0.5 | 29.47% | All | 73.76% |
In this section, we provide some usage examples to demonstrate how to use ChannelPrunedRmtLearner under different execution modes and hyper-parameter combinations:
To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
# set the target pruning ratio to 0.50
./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
--learner=chn-pruned-rmt \
--cpr_prune_ratio=0.50To compress a ResNet-18 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
# do no apply channel pruning to the last convolutional layer
./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
--learner=chn-pruned-rmt \
--cpr_skip_last_layer=TrueTo compress a MobileNet-v1 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
# use a channel-pruned model for warm-start, so no channel selection is needed
./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
--learner=chn-pruned-rmt \
--cpr_warm_start=True \
--cpr_save_path_ws=./models_cpr_ws/model.ckpt