Steepest descent suffers from a bad reputation with researchers in optimization. From the literature (e.g., [Gill:81a]), we found a wide selection of more appropriate optimization techniques. Following the ``decision tree'' and considering the characteristics of large supervised learning problems (large memory requirements and time-consuming calculations of the energy and the gradient), the Broyden-Fletcher-Goldfarb-Shanno one-step memoryless quasi-Newton method (all adjectives are necessary to define it) is a good candidate in the competition and performed very efficiently on different problems.
Let's define the following vectors: ,
and
. The one-dimensional search
direction for the nth iteration is a modification of the gradient
, as follows:
Every N steps (N being the number of weights in the network), the search is restarted in the direction of the negative gradient.
The coefficients and
are combinations of scalar products:
The one-dimensional minimization used in this work is based on quadratic interpolation and tuned to back-propagation where, in a single step, both the energy value and the negative gradient can be efficiently obtained. Details on this step are contained in [Williams:87b].
The computation during each step requires operations (the same
behavior as standard batch back-propagation), while the CPU time for
each step increases by an average factor of three for the problems
considered. Because the total number of steps for
convergence is much smaller, we measured a large net
benefit in computing time.
Last but not least, this method can be efficiently implemented on MIMD parallel computers.