New Scaled Conjugate Gradient Algorithm for Training Artificial Neural Networks Based on Pure Conjugacy Condition

Conjugate gradient methods constitute excellent neural network training methods characterized by their simplicity efficiency and their very low memory requirements. In this paper, we propose a new scaled conjugate gradient neural network training algorithm which guarantees descent property with standard Wolfe condition. Encouraging numerical experiments verify that the proposed algorithm provides fast and stable convergence.


1.INTRODUCTION
Learning systems, such as multilayer feed-forward neural networks (FNN) are parallel computational models comprised of densely interconnected, adaptive processing units, characterized by an inherent propensity for learning from experience and also discovering new knowledge.Due to their excellent capability of self-learning and self-adapting, they have been successfully applied in many areas of artificial intelligence [1,2,3]and are often found to be more efficient and accurate than other classification techniques [4].The operation of a FNN is usually based on the following equations: where l j net is the sum of the weight inputs for the j-th node in the l -th layer (j=1,2,…, l N ), j i w , is the weights from the i-th neuron to the j-th neuron at the   l l , 1 th layer, respectively, b is the bias of the j-th neuron at the l-th layer and l j x is the output of the j-th neuron which belongs to the l -th layer, ) ( f is the activation function and O is the output of the nod j at the output layer. Recently many learning algorithms for feed-forward neural networks has been discovered [4,5,6].Several of these algorithms are based on a known method in optimization theory known as the gradient descent algorithm.They usually have a poor convergence rate and depend on parameters which have to be specified by the user, since there is no theoretical basis for choosing them exists.The values of these parameters are often crucial for the success of the algorithm.For example the Standard Back Propagation(SBP) algorithm [7] which often behaves very badly on large-scale problems and which success depends of the user dependent parameters learning rate.
The problem of training a neural network is iteratively adjusting its weights, in order to minimize the difference between the actual output of the network and the desired output of the training set.Actually finding such minimum is equivalent to minimization of the error function which defined by: respectively.The index j denotes the particular learning pattern.The vector W is composed of all weights in the net [4].
From an optimization point of view learning in a neural network is equivalent to minimizing a global error function, which is a multivariate function that depends on the weights in the network.This perspective gives some advantages in the development of effective learning algorithms because the problem of minimizing a function is well known in other fields of science, such as conventional numerical analysis [8].Since learning in realistic neural network applications often involves adjustment of several thousand weights only optimization methods that are applicable to large-scale problems, are relevant as alternative learning algorithms.The general opinion in the numerical analysis community is that only one class of optimization methods exists that are able to handle large-scale problems in an effective way.These methods are often referred to as the Conjugate Gradient(CG) Methods [8].Several conjugate gradient algorithms have recently been introduced as learning algorithms in neural networks [5,9,10].

2.CONJUGATE GRADIENT METHDS
Conjugate gradient methods are probably the most famous iterative methods for efficiently training neural networks due to their simplicity, numerical efficiency and their very low memory requirements.These methods generate a sequence of weights {w k } using the iterative formula.
where k is the current iteration usually called epoch, is the learning rate and d k is a descent search direction (by Descent, we mean Conjugate gradient methods differ in their way of defining the multiplier k  .The most famous approaches were proposed by Fletcher-Reeves (FR) and Polak-Ribere (PR) : or strong Wolfe line search: . Moreover, an important issue of CG algorithms is that when the search direction (4) fails to be descent directions we restart the algorithm using the negative gradient direction to grantee convergence .A more sophisticated and popular restarting is the (10) where denotes to the Euclidean norm.Other important issue for the CG methods is that the search directions generated from equation ( 4) are conjugate if the objective function is convex and line search is exact i.e: where, G is the Hessian matrix for the objective function .the conjugacy condition given in (11) can be replaced [12] to the following equation: which is called pure conjugacy.[13] show that if k  is not exact the condition in (12) can be written as , 0 ,

3.SCALED CONJUGATE GRADIENT ALGORITHMS (SCG)
This type of algorithms assumes more general form of CG search direction.It generates a sequence or an approximation of it, then we get Newton or quasi-Newton (QN) algorithms, respectively [14].Therefore, we see that in the general case, when Different scaled CG methods introduced [14], for example scaled Fletcher-Reeves (SFR) and scaled Polak-Ribere (SPR) :

3.1.New Scaled CG Method(Say N1SCG)
Abbo and Mohammed in [6] suggested a new CG algorithm NA  based on the Aitken's process, in this section we try to generalize the method to more general form known as scaled conjugate gradient methods.Consider the search direction of the form : If we multiply both sides of the equation ( 19) by k y we get : By using the pure cojugacy condition (12) we get: to avoid division to zero we can define Then the search direction for the new scaled conjugate gradient (N1SCG) algorithm can be written as: We summarize our scaled conjugate gradient (N1SCG ) algorithm as;

The Descent Property Of The Suggested Algorithm
In this section, we shall show our new conjugate gradient (N1SCG) algorithm satisfies the descent property with standard Wolfe conditions as stated in the following theorem:

Theorem(3.1)
Consider N1SCG method where the learning rate k  satisfies the standard Wolfe conditions equation ( 6) and ( 7 again by the second Wolfe condition (7) with

4.EXPERIMENTAL RESULTS
In this section, we will present experimental results in order to evaluate the performance of our proposed N1SCG in two problems the iris problem and continuous function approximation problem.The implementation code was written in Matlab 7.9 based on the SCG code of Birgin and Martınez [15].All methods are implemented with the line search proposed in CONMIN [16] which employs various polynomial interpolation schemes and safeguards in satisfying the strong Wolfe line search conditions.The heuristic parameters were set as ρ= 10 -4 and σ= 0.5 as in [10].All networks have received the same sequence of input patterns and the initial weights were generated using the Nguyen-Widrow method [17].The results have been averaged over 500 simulations.The cumulative total for a performance metric over all simulations does not seem to be too informative, since a small number of simulations can tend to dominate these results.For this reason, we use the performance profiles proposed by Dolan and More [18] to present perhaps the most complete information in terms of robustness, efficiency and solution quality.The performance profile plots the fraction P of simulations for which any given method is within a factor x of the best training method.The horizontal axis (x) of each plot shows the percentage of the simulations for which a method is the fastest (efficiency), while the vertical axis(p) gives the percentage of the simulations that the neural networks were successfully trained by each method (robustness).The reported performance profiles have been created using the Libopt environment [19] for measuring the efficiency and the robustness of our method in terms of computational time (CPU time) and epochs.The curves in the following figures have the following meaning: • "SPR" stands for the Scaled Polak-Ribiere CG method.

Iris Classification Problem
This benchmark is perhaps the most best known to be found in the pattern recognition literature [20].The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.The network architectures constitute of 1 hidden layer with 7 neurons and an output layer of 3 neurons.The training goal was set to E G ≤ 0.01 within the limit of 1000 epochs and all networks were tested using 10-fold cross-validation [10].The network architecture for this problem is 1-15-1 FNN, the network is trained until the sum of the squares of the errors becomes less than the error goal 0.001.The network is based on hidden neurons of logistic activations with biases and on a linear output neuron with bias.Web Site: www.kujss.comEmail: kirkukjoursci@yahoo.com, kirkukjoursci@gmail.com 042

CONCLUSIONS
It can be seen that if the scaling parameter θ contains two positive Terms a void the small multiplier for the gradient vector and hence Maintain the descent property and performance better than with scaling parameter with only one.

O
are the desired (target) and the actual output of the i-th neuron,

FR
update were shown to be globally convergent[8].However the corresponding methods using PR  or HS  update are generally more efficient ever without satisfying the global convergence property.In the convergence analysis and implementations of CG methods, one often requires the inexact line search such as the Wolfe line search.The standard Wolfe line search requires k  satisfying:


is a scalar.The iterative process is initialized with an initial point 1 the classical CG algorithm according to the value of k ) represents a combination between the QN and CG methods.

Volume 10 ,
Issue 3, March 2015 , p.p(230-241) ISSN 1992 -0849 Web Site: www.kujss.comEmail: kirkukjoursci@yahoo.com, kirkukjoursci@gmail.com 035 The main object of this work is to find a new and efficient scaled conjugate gradient method with search direction 1  k d having the simple form (16).For this purpose, we use the pure conjugacy condition (11) and NA  .
. Else go to Step 3. Step3.Line search.Compute k  satisfying the Wolfe line search conditions (


conjugate gradient parameter computation.Compute NA k  from equation(14)

Figure ( 1 )Figure ( 1 ):
Figure(1) presents the performance profiles for the iris classification problem, regarding both performance metrics.N1SCG illustrates the best performance in terms of efficiency and robustness, significantly out-performing the scaled training methods SPR and SFR.Furthermore, the performance profiles show that N1SCG is the only method reporting excellent (100%) probability of being the optimal training method.

Figure ( 2 )
Figure(2) shows the performance profiles for the continuous function approximation problem, investigating the efficiency and robustness of each training method.Clearly, our proposed method N1SCG significantly out-performs the scaled conjugate gradient methods SPR and SFR since the curves of the former lie above the curves of the latter, regarding both performance metrics.More analytically, the performance profiles show that the probability of N1SCG to successfully train a neural network within a factor 3.41 of the best solver is 94%, in contrast with SPR and SFR which have probability 84.3% and 85%, respectively.