How to speed up the learning mechanism in a connectionist...

How to speed up the learning mechanism in a connectionist model
Szil´ard Vajda, Abdel Bela¨ıd
Henri Poincar´e University, Nancy 1
Loria Research Center
READ Group
Campus Scientifique, BP. 239 Vandoeuvre les Nancy, 54506, France
In this paper a fast data driven learning-corpus building algorithm (FDDLCB) is proposed. The generic technique allows to build dynamically a representative and compact learning corpus for a connectionist model. The constructed dataset contains just a reduced number of patterns but sufficiently descriptive to characterize the different classes which should be separated. The method is based
on a double least mean squares (LMS) error minimization
mechanism trying to find the optimal boundaries of the different pattern classes. In the classical learning process the
LMS is serving to minimize the error during the learning
and this process is improved with a second one, as the new
samples selection is also based on the idea to minimize the
recognition error. Reinforcing the class boundaries where
the recognition fails let us achieve a rapid and good generalization without any loss of occuracy. A modified version of
the algorithm will be also presented. The experiments were
performed on MNIST1 separated digit dataset. The encouraging result (98.51%) using just 1.85% of the available patterns from the original training dataset is comparable even
with the state of the art techniques.
1. Introduction
In the last few decades among the neural network
(NN) based applications that have been proposed for pattern recognition, character recognition has been one of
the most successful. The most common approach applied in this field is the usage of a multi-layer perceptron (MLP) with a supervised learning scheme based
mainly on the least mean squares (LMS) error minimization rule.
Modified NIST
Generally to train such a system, a huge data amount
is needed in order to cover the different intra-class and
inter-class variations. Hence in the neural network scheme
based approaches we can find equally advantages and disadvantages. Among the advantages we can enumerate: good
generalization property based on solid mathematical background, good convergence rate, fast recognition process,
etc. [2]. Among the disadvantages can be enumerated some
restrictions like: the size of the input should be fixed, the
nature of the input is not obvious for the different pattern
recognition tasks, it is hard to implant in the network topology some a priori knowledge based on the data corpus. In
the same time, the system convergence speed can be long,
as the adjustment of the decision surface (hyperplane) in the
function of the network’s free parameters and the huge data
amount can be a time costly process. The excessive data
amount, the network architecture and the course of dimensionality are always an endless trade-off in the neural networks theory [15].
In order to tackle these problems, different techniques
have been proposed in the literature. Some of them use
some a priori knowledge derived from the dataset, some
other methods modify the network topology in order to reduce the number of free-parameters, some other techniques
try to reduce the dimension of the input using feature selection techniques and some others try to develop so called
active learning techniques which goal is to use an optimal
training dataset through selection of most informative patterns from the original dataset.
Our proposition is based on this active learning technique more exactly belonging to the branch of incremental learning where the dataset is constructed dynamically
during the training. Using the FDDLCB algorithm based
mainly on LMS error minimization we have reduced considerably the training time factor without any loss of accuracy.
The rest of the paper is organized as follows. In Section
2. a brief description of the existing improvements and ac-
tive learning techniques is provided while Section 3. will
discuss in detail the proposed algorithm. Section 4. is dedicated to the results and finally in Section 5. some concluding remarks are given.
2. Related works
During the years different classifier systems have been
proposed and developed for the different pattern recognition tasks inspired from real life applications. Nowadays,
more or less each baseline recognition system has reached
its limits as there is no existing system allowing to realistically model the human vision. Hence, the research was
oriented toward different improvements of the existing systems by refining the mathematical formalism or by implanting in the systems empirical knowledge. Considering the
nature of the improvements, different research axis can be
In order to select the most descriptive features, in the
last few years many feature extraction algorithms were
proposed. In character recognition these features can be
grouped as: statistical features, geometrical features [4, 8,
9, 14], size and rotation invariant features [1, 7, 12, 19], etc.
For the combination of these features many feature subset selection mechanisms were designed. The feature subset selection is based mainly on neural networks (NN) and
genetic algorithms (GA) [21] operating with randomized
heuristic search techniques [3].
Concerning the network topology there is a consensus in
the NN community. One or two hidden layer is sufficient
for the different pattern classification problems but LeCun
and his colleagues proved than it is possible to use multiple hidden layers based on multiple maps using convolution, sub-sampling and weight sharing techniques in order
to achieve excellent results on separated digit recognition
[4, 18].
Spirkovska and Reid using a higher order neural network
introduced inside the topology some position, size and rotation invariant a priori information [19].
Another solution is the optimal brain damage (OBD),
proposed by LeCun in [5] which removes the unimportant
weights in order to achieve a better generalization and a
speed-up of the training/testing procedure. The optimal cell
damage (OCD) and its derivatives are also based on the idea
to prune the network structure. All approaches solve more
or less the encountered problems but each of them has a
considerable time complexity.
The best approach seems to be the so called active learning and its different derivatives. In such an approach the
learner, the classifier is guided during the training process.
Some information control mechanism is implanted in the
system. Rather than passively accepting all the available
training samples, the classifier on his own guides its learn-
ing process by finding the most informative patterns. With
such a guided training, where the training patterns are selected dynamically, we can reduce considerably the training
duration and a better generalization can be obtained as all
non interesting data can be discarded. As stated in [16] the
generalization error decrease more rapidly for active learning than for passive learning.
Engelbrecht in [6] is grouping these techniques in two
classes in function of their action mechanism:
1. Selective learning, where the classifier selects at each
selection interval a new training subset from the original candidate set.
2. Incremental learning, where the classifier starts with
an initial subset selected somehow from the candidate
set. During the learning process, at specified selection
intervals some new samples are selected from the candidate set and these patterns are added to the training
While in [6, 13, 16] the authors have been developed active learning techniques for feedforward neural networks,
for SVM approaches similar systems have been proposed
in [11, 17, 20]. In this second case all the pattern selection
algorithms are based on the idea than the hyperplane constructed by the SVM is depending only on a reduced subset
of the training patterns called also support vectors that lies
close to the decision surface. Mostly the selection methods
in that case are based on kNN, clustering, confidence measure, Hausdorf distance, etc.. The drawback of such systems is quite difficult to fix the different parameters of the
systems as stated by Shin and Cho in [17]. Another limitation of the approach is a second training procedure which is
necessary while for NN the training process is applied just
once. This drawback can be also found in case of the different network pruning algorithms proposed by Le Cun.
3. FDDLCB algorithm description
Our method is based on incremental learning using error selection. The approach is based on an MLP type classifier with one hidden layer. The main idea of the FDDLCB
algorithm is to build-up in run-time a data driven minimal
learning-corpus based on the LMS by adding additional patterns to the training corpus at each training level in order to
cover maximally the different variations of the patterns and
reducing the recognition error.
Let us denote by GlobalLearningCorpus (GLC) the
whole set of patterns which can be used during the training procedure, by GlobalTestingCorpus (GTC) the whole
set of patterns which can be used for the test and by DynamicLearningCorpus (DLC) the minimal set of patterns
which can serve to train the network. Let’s also denote by NN the neural network and by N the iterator,
which provides the number of new patterns to be considered at each learning level and M denotes the number of
classes to be separated.
Algorithm description:
DLC = {xi ∈ GLC | i = 1, M }
Database Building:
}until(NetworkError(NN,DLC,ALL) T hreshold1 )
if (NetworkError(NN,GTC,ALL)≺ T hreshold2 ) then STOP
if (NetworkError(NN,GLC,ALL)≺ T hreshold1 ) then STOP
DLC = DLC ∪ {yi ∈ GLC | i = 1, N }
}until(| GLC |> 0)
NN contains the modified weight set
DLC contains the minimal number of patterns which is sufficient
to train the NN
• TrainNetwork(NN,DATASET) will train the NN with
the given DATASET using classical LMS error minimization and error backpropagation
• TestNetwork(NN,DATASET) will test the NN with the
calculates the error given by NN using SAMPLES NUMBER of patterns from the DATASET
using the LMS criterion
• yi denotes the pattern from the GLC giving the ith
highest error during the test
| denotes the cardinality of the
The algorithm is starting with an initialized DLC set
where we have selected for each class one random representative pattern (xi ) in order to not favour one or another
class initially. The algorithm performs the network training
with these samples. Once the training error is less than an
empirical threshold value, the training process stops and we
test our network with the samples belonging to the GTC.
If the error criterion is satisfied the algorithm stops as the
training was successful. Otherwise we should continue by
adding new samples to our DLC set. To do this we are looking from the GLC for the N samples (yi ) giving the highest
error in the classification. If this error is less than a threshold value we are stopping the algorithm, as we cannot add
extra helpful information to the network. Otherwise we are
picking these N elements from GLC and move them in the
DLC and restart the training on this new extended dataset.
The algorithm stops when the error criterion is satisfied or
there are no more available patterns in GLC set. In the second case we are in the classical training as finally we are
using the whole dataset. So there is no restriction in the algorithm. In the worst case we should have almost the same
results as in case of using the whole dataset.
A modified version of the FDDLCB algorithm consists
to feed the network with class samples having the same distribution. This precaution is necessary as stated by [10] in
order to not influence the system in a way or another. For
that reason we modified the conditions of the DLC set creation. Now, at each iteration we add N samples for each pattern class based on their highest error contribution in their
class instead of using the first N samples of the dataset giving the highest error rate. Using this selection process we
can guarantee the distribution uniformity for each pattern
4. Test results
The experiments performed by the FDDLCB algorithm
used as input data the MNIST reference database. This
dataset contains 60.000 samples for learning and 10.000
samples for test. The 28x28 normalized gray-scale images
contain separated handwritten digits from 0 to 9. The tests
were performed with a fully connected MLP with one hidden layer. As input raw images were used, so our input vector contains 784 values. In order to achieve a recognition
rate like 98.6%, using the whole learning corpus we need at
least 30 learning epochs. That means we should present at
least 30×60.000=1.800.000 patterns to our network. In Table 1we show different constructed datasets, the number of
patterns presented to the system and the obtained results by
the FDDLCB algorithm on the test set in function of the N
So we can achieve comparable result with even 1,110
samples from the possible 60,000 that means the other patterns can be considered redundant information so it’s not
necessary to use them. The learning process can be reduced
substantially as it is possible to achieve almost similar results presenting just 58,690 patterns to the network. So we
can speed up the learning process 14 times that is a considerable gain even for a high-tech computer.
learning set
Recognition rate
Table 1. Results obtained with different
datasets constructed by FDDLCB algorithm
As in [17] the authors provide results of their pattern selection method on MNIST benchmark dataset, a comparison study can be performed.
Nine SVM type binary classifier was used: class 8 is paired
with each of the rest. The reported recognition error in average over nine classifiers is 0.28% using all the available patterns and 0.38% for the pattern selection based technique.
The lost of accuracy is similar as in our case. Unfortunately
there is no results reported concerning the real recognition
accuracy for each separated digit class so a direct comparison can not be performed with our method.
The time factor is reduced with a factor of 11.8 which
is much less as in our case. Similarly the number of used
patterns (16.76%) serving as support vector is much more
considerably than our 1.85% selected patterns to train the
The modified FDDLCB algorithm result 98.01% is near
to the result produced by the original algorithm but it needs
much more iteration and samples (9,000 different samples
were selected while 864,600 patterns were presented to the
In the Figure 1 we present the class distribution for the
different datasets built by FDDLCB in function of the N
parameter. The x-axis means the different classes, and the
y-axis means the distribution percentage of the different
classes. We can see than the element distribution variance
is not significant for the different datasets so the N parameter can control just the learning convergence speed and the
size of the built dataset.
The empirical value N = 50 was established after some
trial runs performed with different N values. We found
this is the optimal value which should be used in order to
achieve a considerable speed gain.
Similarly the results presented in Table 1 prove than the
changing of parameter N has no major influence on the results. It can influence just the size of the built dataset and
Figure 1. The samples distribution in the
classes for the different constructed datasets
the speed of the building process.
Using the same pattern distribution as in Figure 1 using
random choice for the patterns selection for the dataset creation, the recognition accuracy cannot achieve higher average recognition scores than 91.01%.
Analyzing the dynamic learning corpus we can pronounce also in the matter of the intra-class and inter-class
variance. In the MNIST database the class ”0” contains the
fewest variance and the class ”9” contains the most variation, so we need much more samples belonging to class ”9”
in order to achieve a good recognition score.
In pattern complexity terms speaking, the class ”0”, ”1”,
”6” are the classes which are the simplest and the classes
”3”, ”8”, ”9” are the more complex ones, which is natural
as they can be confused.
5. Conclusion
We proposed in this paper generic, simple and fast active learning algorithm to build run-time a minimal learning
corpus, based on an MLP classifier.
The algorithm is based on a dual LMS error estimation,
which can guarantee the convergence of the algorithm. The
first LMS minimization is used in the training process in
the error backpropagation. The second one is used when we
are calculating the LMS error for the samples during the
recognition. The misclassified patterns should be added to
the DLC set in order to minimize the recognition error by
learning these new items which have contributed to the error
accumulation. The method reduces substantially the learning period and discards the redundant information in order
to avoid the overfitting.
The performed tests on MNIST showed that is possible to achieve 98,51% recognition accuracy using just 1,110
different samples and the learning time can be reduced by a
factor of 14, a time gain which is also considerable considering the algorithm complexity.
The mechanism cannot function for the improvement of
the system presented in [18] which is based on the data redundancy.
The algorithm tries to enlarge the different class boundaries using in learning the extreme patterns. The algorithm
increases the number of forward steps (propagation) but decreases substantially the number of backward steps (error
backpropagation) which are much more costly in calculus.
The FDDLCB algorithm can be also used to solve the
challenge proposed by Japkowicz in [10] in order to deal
with the class imbalance problem, which often occurs in the
real world applications.
Many times we deal with learning corpuses where the distribution of the samples for the different classes is not uniform.
There are under represented classes and respectively low
represented classes. The methods presented in [10] based
on down-sizing and re-sampling are restrictive as there is no
rigorous selection criteria to choose which elements should
be discarded or re-sampled.
The FDDLCB can avoid the overfitting effect caused by the
presented methods using a rigorous selection criterion.
[1] S. Adam, J. M. Ogier, C. Cariou, R. Mullot, J. Gardes, and
Y. Lecourtier. Multi-scaled and multi oriented character
recognition: An original strategy. In ICDAR, pages 45–48,
[2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
[3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. MIT Press, 1995.
[4] Y. L. Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Intelligent
Signal Processing, pages 306–351, 2001.
[5] Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. Advances in Neural Information Processing Systems 2
(NIPS*89), 1990.
[6] A. P. Engelbrecht. Selective learning for multilayer feedforward neural networks. J Mira and A Prieto (eds.) In Lecture
Notes in Computer Science, 2084:386–393, 2001.
[7] A. Goshtasby. Description and discrimination of planar
shapes using shapes matrices. IEEE Transactions of Pattern
Recognition and Machine Intelligence, 7(6):738–743, 1984.
[8] I. Guyon. Application of neural networks to character recognition. International Journal of Pattern Recognition and Artificial Intelligence, 5(1):353–382, 1991.
[9] M. S. Hoque and M. C. Fairhurst. A moving window classifier for off-line character recognition. In Proceedings of 7th
International Workshop on Frontiers in Handwriting Recognition, pages 595–600, 1998.
N. Japkowicz. The class imbalance problem: significance
and strategies. In Proceedings of International Conference
on Artificial Intelligence 2000 (IC-AI2000), 2000.
R. Koggalage and S. Halgamuge. Reducing the number
of training samples for support vector machine classification. Neural Information Processing - Letters and Reviews,
2(3):57–65, 2004.
S.-W. Lee, H.-S. Park, and Y. Y. Tang. Translation-, scale-,
and rotation-invariant recognition of hangul characters with
ring projection. Proceeding of 1st International Conference on Document Analysis and Recognition, pages 829–
836, 1991.
S. U. P. Polikar, L. Udpa and V. Honavar. Learn++: An incremental learning algorithm for supervised neural networks.
IEEE Transactions on Systems, Man and Cybernetics - Part
C: Application and Reviews, 31(4):497–508, 2001.
R. Romero, R. Berger, R. Thibadeau, and D. Touretzky.
Neural network classifiers for optical chinese character
recognition. Proceedings of the 4th annual Symposium on
Document Analysis and Information Retrieval, pages 385–
389, 1995.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. pages 318–
362, 1986.
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. Proceeding of 5th Annual ACM Workshop on Computational Learning Theory, pages 287–299, 1992.
H. Shin and S. Cho. Fast pattern selection for support vector classifier. Proceedings of the 7th Pacific-Asia Conference
on Knowledge Discovery and Data Mining, LNCS, (2637),
P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices
for convolutional neural networks applied to visual document analysis. ICDAR, pages 958–962, 2003.
L. Spirkovska and M. B. Reid. Robust position, scale and rotation invariant object recognition using higher- order neural
networks. Pattern Recognition, 25(9):975–985, 1992.
J. Wang, P. Neskovic, and L. N. Cooper. Training data selection for support vector machines. International Conference
on Neural Computation, 2005.
J. Yang and V. Honavar. Feature subset selection using genetic algorithms. IEEE Transactions, Intelligent Systems, (34):45–49, 1998.