Date Published: July 20, 2017
Publisher: Public Library of Science
Author(s): Hui-zhen Zhao, Fu-xian Liu, Long-yue Li, Yilun Shang.
Motivated by insights from the maxout-units-based deep Convolutional Neural Network (CNN) that “non-maximal features are unable to deliver” and “feature mapping subspace pooling is insufficient,” we present a novel mixed variant of the recently introduced maxout unit called a mixout unit. Specifically, we do so by calculating the exponential probabilities of feature mappings gained by applying different convolutional transformations over the same input and then calculating the expected values according to their exponential probabilities. Moreover, we introduce the Bernoulli distribution to balance the maximum values with the expected values of the feature mappings subspace. Finally, we design a simple model to verify the pooling ability of mixout units and a Mixout-units-based Network-in-Network (NiN) model to analyze the feature learning ability of the mixout models. We argue that our proposed units improve the pooling ability and that mixout models can achieve better feature learning and classification performance.
In recent years, the regularization of deep learning models through stochastic model averaging has become an effective tool to ameliorate the overfitting phenomenon in supervised classification tasks. Proposed by Hinton in 2012, dropout became the first model regularization method that uses a stochastic model-averaging technique to improve the performance of deep learning models. The basic idea behind the dropout strategy is to sample half the neurons to act on the output through weighting the full connection by a Bernoulli distribution. The effect of the stochastic neurons makes the classification less reliant on arbitrary units, thus reducing overfitting [2,3]. Krizhevsky applied dropout to several different scale benchmark datasets and verified its good performance . Because the “model-averaging ability” of dropout can greatly improve the performance of a convolutional neural network (CNN), various scholars have proposed a number of improved stochastic model-averaging methods to gain further improvements. Wang sped up the dropout training procedure through a Gaussian approximation method . Ba set the probability of dropout in every hidden layer with a 2-layer belief network, which shared the same parameters with a CNN, to improve the learning effect of the network . Tompson applied dropout to the entire feature space, forming the space-dropout method . Based on dropout, Wan proposed the DropConnect method, which randomly dropped connections between units rather than their activation . Through training unit models that share millions of parameters and average the impact of units on the entire model output, dropout was shown to be effective against overfitting and to improve model feature learning ability. The dropout regime can be viewed as making a significant update to a different model on a different subset in back propagation; therefore, a model combined with dropout appears to perform better when it takes relatively large steps in parameter space. Thus, the ideal regime for dropout is when the overall training procedure resembles training an ensemble with bagging under parameter-sharing constraints .
In this section, we first introduce the maxout unit and summarize its characteristics. And then, we present the dropout strategy and the “mean network”.
We propose adopting a modified maxout unit, namely, the mixout unit, and we demonstrate that the mixout unit improves the pooling ability of the maxout unit.
In this section, we assessed the performance of the mixout units from two aspects. The first aspect addresses an analysis of the pooling ability of the mixout units. We designed a simple CNN model and analyzed the pooling ability of mixout units, maxout units  and probout units  both qualitatively and quantitatively. The second aspect addresses the performance analysis of the general mixout model. We designed the mixout-units-based NiN model (M-NiN) based on the frequently used NiN model  and compared its performance with models based on maxout units, probouts units, ReLU and its variants. We used an Intel (R) Core (TM) i5-4590 processor and an AMD Radeon HD 7000 series graphics card for our experiments.
In this paper, we presented the mixout unit as an adaptive activation function. We designed a sample model for pooling ability analysis as well as a general model for performance analysis of mixout units. The mechanism for mixing the maximum and average of the mixout units improves the subspace pooling operation, thus leading to a better utilization of the model-averaging ability of dropout. We conducted several experiments on three benchmark datasets. The results revealed the desirable properties of mixout units. Because sufficient use of the model-averaging technique of dropout prevents overfitting to some extent, mixout units have unsatisfactory performances on large datasets, which somewhat constrains the overfitting effect. Interesting avenues for future work include applying mixout units to other network architectures such as VGG or ResNet.