Higher-Order Corrections to Optimisers based on Newton’s Method (2024)

Stephen Brookssbrooks@bnl.gov

Abstract

The Newton, Gauss–Newton and Levenberg–Marquardt methods all use the first derivative of a vector function (the Jacobian) to minimise its sum of squares. When the Jacobian matrix is ill-conditioned, the function varies much faster in some directions than others and the space of possible improvement in sum of squares becomes a long narrow ellipsoid in the linear model. This means that even a small amount of nonlinearity in the problem parameters can cause a proposed point far down the long axis of the ellipsoid to fall outside of the actual curved valley of improved values, even though it is quite nearby. This paper presents a differential equation that ‘follows’ these valleys, based on the technique of geodesic acceleration, which itself provides a 2nd order improvement to the Levenberg–Marquardt iteration step. Higher derivatives of this equation are computed that allow nthsuperscript𝑛thn^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT order improvements to the optimisation methods to be derived. These higher-order accelerated methods up to 4th order are tested numerically and shown to provide substantial reduction of both number of steps and computation time.

journal: Journal of Computational and Applied Mathematics

\affiliation

[SJB]organization=Collider–Accelerator Department,addressline=Brookhaven National Laboratory,city=Upton,state=NY,postcode=11973,country=USA

1 Definitions and Introduction

Consider finding the value of a vector 𝐱𝐱\mathbf{x}bold_x such that the vector-valued function 𝐟(𝐱)=𝟎𝐟𝐱0\mathbf{f}(\mathbf{x})=\mathbf{0}bold_f ( bold_x ) = bold_0, noting the input and output of 𝐟𝐟\mathbf{f}bold_f might have different dimensions.

Newton’s method solves J(𝐱)(𝐱new𝐱)=𝐟(𝐱)𝐽𝐱subscript𝐱new𝐱𝐟𝐱J(\mathbf{x})(\mathbf{x}_{\mathrm{new}}-\mathbf{x})=-\mathbf{f}(\mathbf{x})italic_J ( bold_x ) ( bold_x start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT - bold_x ) = - bold_f ( bold_x ) where J𝐽Jitalic_J is the Jacobian matrix. The Gauss–Newton algorithm generalises this to rectangular J𝐽Jitalic_J using pseudo-inverses that may be calculated using Singular Value Decomposition (SVD). The Levenberg–Marquardt algorithm [1, 2] introduces a damping factor into this pseudo-inverse, which allows progress along ‘easier’ directions without having to go far in ‘difficult’ directions that may exhibit nonlinearity.

The remainder of this paper will be written for the simpler Newton method, where J1superscript𝐽1J^{-1}italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the inverse of the square Jacobian matrix. However, the algorithms derived also work for the Gauss-Newton pseudo-inverse [J1]GN=(JTJ)1JTsubscriptdelimited-[]superscript𝐽1𝐺𝑁superscriptsuperscript𝐽𝑇𝐽1superscript𝐽𝑇[J^{-1}]_{GN}=(J^{T}J)^{-1}J^{T}[ italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_G italic_N end_POSTSUBSCRIPT = ( italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the damped Levenberg–Marquardt version [J1]LM(λ)=(JTJ+λI)1JTsubscriptdelimited-[]superscript𝐽1𝐿𝑀𝜆superscriptsuperscript𝐽𝑇𝐽𝜆𝐼1superscript𝐽𝑇[J^{-1}]_{LM(\lambda)}=(J^{T}J+\lambda I)^{-1}J^{T}[ italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_L italic_M ( italic_λ ) end_POSTSUBSCRIPT = ( italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J + italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The latter is used in numerical tests.

One common source of slow convergence is that J𝐽Jitalic_J is ill-conditioned, so the optimisation valley is much narrower in some directions than others, while the problem contains some nonlinearity, which may seem small but is amplified once you change into coordinates where J𝐽Jitalic_J is well-conditioned. This is because even a small amount of nonlinearity can make the long narrow valley stop overlapping with its approximation in the linear model. This reduced range of validity of the linear model means many small steps have to be taken.

2 Natural Optimisation Pathway

The goal of the Newton step is to reduce the error vector 𝐟𝐟\mathbf{f}bold_f, ideally to zero. For a nonlinear function, the optimisation follows a curved pathway [3, 4] and one natural such pathway is 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) defined implicitly by

𝐟(𝐱(t))=(1t)𝐟(𝐱(0))𝐟𝐱𝑡1𝑡𝐟𝐱0\mathbf{f}(\mathbf{x}(t))=(1-t)\mathbf{f}(\mathbf{x}(0))bold_f ( bold_x ( italic_t ) ) = ( 1 - italic_t ) bold_f ( bold_x ( 0 ) )

for t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. This scales down all components of the error equally and at t=1𝑡1t=1italic_t = 1 it reaches the true solution.

Taking the first derivative of this equation gives

ii𝐟(𝐱(t))x˙i(t)=J(𝐱(t))𝐱˙(t)=𝐟(𝐱(0))=𝐟(𝐱(t))1t,subscript𝑖subscript𝑖𝐟𝐱𝑡subscript˙𝑥𝑖𝑡𝐽𝐱𝑡˙𝐱𝑡𝐟𝐱0𝐟𝐱𝑡1𝑡\sum_{i}\partial_{i}\mathbf{f}(\mathbf{x}(t))\dot{x}_{i}(t)=J(\mathbf{x}(t))%\dot{\mathbf{x}}(t)=-\mathbf{f}(\mathbf{x}(0))=-\frac{\mathbf{f}(\mathbf{x}(t)%)}{1-t},∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_f ( bold_x ( italic_t ) ) over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_J ( bold_x ( italic_t ) ) over˙ start_ARG bold_x end_ARG ( italic_t ) = - bold_f ( bold_x ( 0 ) ) = - divide start_ARG bold_f ( bold_x ( italic_t ) ) end_ARG start_ARG 1 - italic_t end_ARG ,

which at t=0𝑡0t=0italic_t = 0 makes 𝐱˙˙𝐱\dot{\mathbf{x}}over˙ start_ARG bold_x end_ARG equal to the Newton step and to a scaling of it for all 0<t<10𝑡10<t<10 < italic_t < 1. So this pathway is always tangent to the Newton step direction and corresponds to the limit of a Newton algorithm run with steps scaled down to be infinitesimally small.

3 Higher-Order Derivatives

If the pathway curves, one may wonder if longer steps can be taken if the curvature is taken into account. The second and higher derivatives of the equation defining the natural pathway have the form

dndtn𝐟(𝐱(t))=𝟎superscriptd𝑛dsuperscript𝑡𝑛𝐟𝐱𝑡0\frac{\mathrm{d}^{n}}{\mathrm{d}t^{n}}\mathbf{f}(\mathbf{x}(t))=\mathbf{0}divide start_ARG roman_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG bold_f ( bold_x ( italic_t ) ) = bold_0

for n2𝑛2n\geq 2italic_n ≥ 2. Multiple derivatives of a function composition (𝐟𝐱𝐟𝐱\mathbf{f}\circ\mathbf{x}bold_f ∘ bold_x here) are given by Faà di Bruno’s formula [7, 8, 9]

dndtn𝐟(𝐱(t))=πΠn𝐟(|π|)(𝐱(t))pπ𝐱(|p|)(t),superscriptd𝑛dsuperscript𝑡𝑛𝐟𝐱𝑡subscript𝜋subscriptΠ𝑛superscript𝐟𝜋𝐱𝑡subscripttensor-product𝑝𝜋superscript𝐱𝑝𝑡\frac{\mathrm{d}^{n}}{\mathrm{d}t^{n}}\mathbf{f}(\mathbf{x}(t))=\sum_{\pi\in%\Pi_{n}}\mathbf{f}^{(|\pi|)}(\mathbf{x}(t))\bigotimes_{p\in\pi}\mathbf{x}^{(|p%|)}(t),divide start_ARG roman_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG bold_f ( bold_x ( italic_t ) ) = ∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUPERSCRIPT ( | italic_π | ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) ⨂ start_POSTSUBSCRIPT italic_p ∈ italic_π end_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ( | italic_p | ) end_POSTSUPERSCRIPT ( italic_t ) ,

where ΠnsubscriptΠ𝑛\Pi_{n}roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the set of all partitions of {1,2,,n}12𝑛\{1,2,...,n\}{ 1 , 2 , … , italic_n }. The dthsuperscript𝑑thd^{\mathrm{th}}italic_d start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT derivative of the vector function 𝐟𝐟\mathbf{f}bold_f is a tensor that takes d𝑑ditalic_d vectors as input and outputs a vector, with elements defined by

f(d)(𝐱)j1j2jdi=dfi(𝐱)xj1xj2xjd.superscript𝑓𝑑subscriptsuperscript𝐱𝑖subscript𝑗1subscript𝑗2subscript𝑗𝑑superscript𝑑subscript𝑓𝑖𝐱subscript𝑥subscript𝑗1subscript𝑥subscript𝑗2subscript𝑥subscript𝑗𝑑f^{(d)}(\mathbf{x})^{i}_{j_{1}j_{2}...j_{d}}=\frac{\partial^{d}f_{i}(\mathbf{x%})}{\partial x_{j_{1}}\partial x_{j_{2}}...\partial x_{j_{d}}}.italic_f start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG ∂ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∂ italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG .

Note that 𝐟(1)=Jsuperscript𝐟1𝐽\mathbf{f}^{(1)}=Jbold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_J. This paper will adopt compact notation where tensor products of vectors 𝐮𝐯𝐰tensor-product𝐮𝐯𝐰\mathbf{u}\otimes\mathbf{v}\otimes\mathbf{w}bold_u ⊗ bold_v ⊗ bold_w will be written 𝐮𝐯𝐰𝐮𝐯𝐰\mathbf{u}\mathbf{v}\mathbf{w}bold_uvw so that (𝐮𝐯𝐰)ijk=uivjwksubscript𝐮𝐯𝐰𝑖𝑗𝑘subscript𝑢𝑖subscript𝑣𝑗subscript𝑤𝑘(\mathbf{u}\mathbf{v}\mathbf{w})_{ijk}=u_{i}v_{j}w_{k}( bold_uvw ) start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. These may be contracted with the derivative tensor to give a vector written in the form 𝐟(3)𝐮𝐯𝐰superscript𝐟3𝐮𝐯𝐰\mathbf{f}^{(3)}\mathbf{u}\mathbf{v}\mathbf{w}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_uvw, where (𝐟(3)𝐮𝐯𝐰)n=i,j,kfijk(3)nuivjwksubscriptsuperscript𝐟3𝐮𝐯𝐰𝑛subscript𝑖𝑗𝑘subscriptsuperscript𝑓3𝑛𝑖𝑗𝑘subscript𝑢𝑖subscript𝑣𝑗subscript𝑤𝑘(\mathbf{f}^{(3)}\mathbf{u}\mathbf{v}\mathbf{w})_{n}=\sum_{i,j,k}f^{(3)n}_{ijk%}u_{i}v_{j}w_{k}( bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_uvw ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( 3 ) italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

3.1 Second Order

For n=2𝑛2n=2italic_n = 2, Π2={{{1},{2}},{{1,2}}}subscriptΠ21212\Pi_{2}=\{\{\{1\},\{2\}\},\{\{1,2\}\}\}roman_Π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { { { 1 } , { 2 } } , { { 1 , 2 } } } and

d2dt2𝐟(𝐱(t))=𝐟(2)(𝐱(t))𝐱˙(t)𝐱˙(t)+𝐟(1)(𝐱(t))𝐱¨(t)=𝟎superscriptd2dsuperscript𝑡2𝐟𝐱𝑡superscript𝐟2𝐱𝑡˙𝐱𝑡˙𝐱𝑡superscript𝐟1𝐱𝑡¨𝐱𝑡0\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}\mathbf{f}(\mathbf{x}(t))=\mathbf{f}^{(2%)}(\mathbf{x}(t))\dot{\mathbf{x}}(t)\dot{\mathbf{x}}(t)+\mathbf{f}^{(1)}(%\mathbf{x}(t))\ddot{\mathbf{x}}(t)=\mathbf{0}divide start_ARG roman_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_f ( bold_x ( italic_t ) ) = bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) over˙ start_ARG bold_x end_ARG ( italic_t ) over˙ start_ARG bold_x end_ARG ( italic_t ) + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) over¨ start_ARG bold_x end_ARG ( italic_t ) = bold_0
𝐱¨(t)=J1(𝐱(t))𝐟(2)(𝐱(t))𝐱˙(t)𝐱˙(t).¨𝐱𝑡superscript𝐽1𝐱𝑡superscript𝐟2𝐱𝑡˙𝐱𝑡˙𝐱𝑡\Rightarrow\qquad\ddot{\mathbf{x}}(t)=-J^{-1}(\mathbf{x}(t))\mathbf{f}^{(2)}(%\mathbf{x}(t))\dot{\mathbf{x}}(t)\dot{\mathbf{x}}(t).⇒ over¨ start_ARG bold_x end_ARG ( italic_t ) = - italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) over˙ start_ARG bold_x end_ARG ( italic_t ) over˙ start_ARG bold_x end_ARG ( italic_t ) .

This agrees with the well-known [3, 4, 5, 6] quadratic acceleration term for Levenberg–Marquardt if the J1superscript𝐽1J^{-1}italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is replaced by a damped pseudo-inverse [J1]LM(λ)subscriptdelimited-[]superscript𝐽1𝐿𝑀𝜆[J^{-1}]_{LM(\lambda)}[ italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_L italic_M ( italic_λ ) end_POSTSUBSCRIPT.

3.2 Third Order

For conciseness, 𝐱𝐱\mathbf{x}bold_x and its derivatives will be evaluated at t=0𝑡0t=0italic_t = 0 unless otherwise stated and 𝐟𝐟\mathbf{f}bold_f and its derivatives at 𝐱𝐱\mathbf{x}bold_x. For n=3𝑛3n=3italic_n = 3,

Π3={{{1},{2},{3}},{{1},{2,3}},{{2},{1,3}},{{3},{1,2}},{{1,2,3}}}subscriptΠ3123123213312123\Pi_{3}=\{\{\{1\},\{2\},\{3\}\},\{\{1\},\{2,3\}\},\{\{2\},\{1,3\}\},\{\{3\},\{%1,2\}\},\{\{1,2,3\}\}\}roman_Π start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { { { 1 } , { 2 } , { 3 } } , { { 1 } , { 2 , 3 } } , { { 2 } , { 1 , 3 } } , { { 3 } , { 1 , 2 } } , { { 1 , 2 , 3 } } }

and

d3dt3𝐟=𝐟(3)𝐱˙𝐱˙𝐱˙+3𝐟(2)𝐱˙𝐱¨+𝐟(1)𝐱(3)=𝟎.superscriptd3dsuperscript𝑡3𝐟superscript𝐟3˙𝐱˙𝐱˙𝐱3superscript𝐟2˙𝐱¨𝐱superscript𝐟1superscript𝐱30\frac{\mathrm{d}^{3}}{\mathrm{d}t^{3}}\mathbf{f}=\mathbf{f}^{(3)}\dot{\mathbf{%x}}\dot{\mathbf{x}}\dot{\mathbf{x}}+3\mathbf{f}^{(2)}\dot{\mathbf{x}}\ddot{%\mathbf{x}}+\mathbf{f}^{(1)}\mathbf{x}^{(3)}=\mathbf{0}.divide start_ARG roman_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG bold_f = bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = bold_0 .

This gives the third derivative of 𝐱𝐱\mathbf{x}bold_x as

𝐱(3)=J1(𝐟(3)𝐱˙𝐱˙𝐱˙+3𝐟(2)𝐱˙𝐱¨).superscript𝐱3superscript𝐽1superscript𝐟3˙𝐱˙𝐱˙𝐱3superscript𝐟2˙𝐱¨𝐱\mathbf{x}^{(3)}=-J^{-1}(\mathbf{f}^{(3)}\dot{\mathbf{x}}\dot{\mathbf{x}}\dot{%\mathbf{x}}+3\mathbf{f}^{(2)}\dot{\mathbf{x}}\ddot{\mathbf{x}}).bold_x start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = - italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG ) .

3.3 Fourth Order, Recurrence and General Case

Higher-order expressions can be obtained either from the set partitions ΠnsubscriptΠ𝑛\Pi_{n}roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT or the equivalent differentiation chain and product rules that obtain dn+1dtn+1𝐟superscriptd𝑛1dsuperscript𝑡𝑛1𝐟\frac{\mathrm{d}^{n+1}}{\mathrm{d}t^{n+1}}\mathbf{f}divide start_ARG roman_d start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG bold_f from dndtn𝐟superscriptd𝑛dsuperscript𝑡𝑛𝐟\frac{\mathrm{d}^{n}}{\mathrm{d}t^{n}}\mathbf{f}divide start_ARG roman_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG bold_f. The formulae

ddt𝐟(n)=𝐟(n+1)𝐱˙andddt𝐱(n)=𝐱(n+1)formulae-sequencedd𝑡superscript𝐟𝑛superscript𝐟𝑛1˙𝐱anddd𝑡superscript𝐱𝑛superscript𝐱𝑛1\frac{\mathrm{d}}{\mathrm{d}t}\mathbf{f}^{(n)}=\mathbf{f}^{(n+1)}\dot{\mathbf{%x}}\qquad\mathrm{and}\qquad\frac{\mathrm{d}}{\mathrm{d}t}\mathbf{x}^{(n)}=%\mathbf{x}^{(n+1)}divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG bold_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = bold_f start_POSTSUPERSCRIPT ( italic_n + 1 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG roman_and divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ( italic_n + 1 ) end_POSTSUPERSCRIPT

together with the product rule are enough to generate the full sequence. Starting from n=2𝑛2n=2italic_n = 2,

𝐟(2)𝐱˙𝐱˙+𝐟(1)𝐱¨superscript𝐟2˙𝐱˙𝐱superscript𝐟1¨𝐱\displaystyle\mathbf{f}^{(2)}\dot{\mathbf{x}}\dot{\mathbf{x}}+\mathbf{f}^{(1)}%\ddot{\mathbf{x}}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT over¨ start_ARG bold_x end_ARG=\displaystyle==𝟎0\displaystyle\mathbf{0}bold_0
𝐟(3)𝐱˙𝐱˙𝐱˙+3𝐟(2)𝐱˙𝐱¨+𝐟(1)𝐱(3)superscript𝐟3˙𝐱˙𝐱˙𝐱3superscript𝐟2˙𝐱¨𝐱superscript𝐟1superscript𝐱3\displaystyle\mathbf{f}^{(3)}\dot{\mathbf{x}}\dot{\mathbf{x}}\dot{\mathbf{x}}+%3\mathbf{f}^{(2)}\dot{\mathbf{x}}\ddot{\mathbf{x}}+\mathbf{f}^{(1)}\mathbf{x}^%{(3)}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT=\displaystyle==𝟎0\displaystyle\mathbf{0}bold_0
𝐟(4)𝐱˙𝐱˙𝐱˙𝐱˙+6𝐟(3)𝐱˙𝐱˙𝐱¨+4𝐟(2)𝐱˙𝐱(3)+3𝐟(2)𝐱¨𝐱¨+𝐟(1)𝐱(4)superscript𝐟4˙𝐱˙𝐱˙𝐱˙𝐱6superscript𝐟3˙𝐱˙𝐱¨𝐱4superscript𝐟2˙𝐱superscript𝐱33superscript𝐟2¨𝐱¨𝐱superscript𝐟1superscript𝐱4\displaystyle\mathbf{f}^{(4)}\dot{\mathbf{x}}\dot{\mathbf{x}}\dot{\mathbf{x}}%\dot{\mathbf{x}}+6\mathbf{f}^{(3)}\dot{\mathbf{x}}\dot{\mathbf{x}}\ddot{%\mathbf{x}}+4\mathbf{f}^{(2)}\dot{\mathbf{x}}\mathbf{x}^{(3)}+3\mathbf{f}^{(2)%}\ddot{\mathbf{x}}\ddot{\mathbf{x}}+\mathbf{f}^{(1)}\mathbf{x}^{(4)}bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + 6 bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG + 4 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG bold_x start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over¨ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT=\displaystyle==𝟎0\displaystyle\mathbf{0}bold_0

and so on. A computer algebra system can generate these terms based on a rule like

ddt𝐟(n)𝐱(a)𝐱(b)𝐱(c)=dd𝑡superscript𝐟𝑛superscript𝐱𝑎superscript𝐱𝑏superscript𝐱𝑐absent\frac{\mathrm{d}}{\mathrm{d}t}\mathbf{f}^{(n)}\mathbf{x}^{(a)}\mathbf{x}^{(b)}%\mathbf{x}^{(c)}=divide start_ARG roman_d end_ARG start_ARG roman_d italic_t end_ARG bold_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT =
𝐟(n+1)𝐱(1)𝐱(a)𝐱(b)𝐱(c)+𝐟(n)𝐱(a+1)𝐱(b)𝐱(c)+𝐟(n)𝐱(a)𝐱(b+1)𝐱(c)+𝐟(n)𝐱(a)𝐱(b)𝐱(c+1)superscript𝐟𝑛1superscript𝐱1superscript𝐱𝑎superscript𝐱𝑏superscript𝐱𝑐superscript𝐟𝑛superscript𝐱𝑎1superscript𝐱𝑏superscript𝐱𝑐superscript𝐟𝑛superscript𝐱𝑎superscript𝐱𝑏1superscript𝐱𝑐superscript𝐟𝑛superscript𝐱𝑎superscript𝐱𝑏superscript𝐱𝑐1\mathbf{f}^{(n+1)}\mathbf{x}^{(1)}\mathbf{x}^{(a)}\mathbf{x}^{(b)}\mathbf{x}^{%(c)}+\mathbf{f}^{(n)}\mathbf{x}^{(a+1)}\mathbf{x}^{(b)}\mathbf{x}^{(c)}+%\mathbf{f}^{(n)}\mathbf{x}^{(a)}\mathbf{x}^{(b+1)}\mathbf{x}^{(c)}+\mathbf{f}^%{(n)}\mathbf{x}^{(a)}\mathbf{x}^{(b)}\mathbf{x}^{(c+1)}bold_f start_POSTSUPERSCRIPT ( italic_n + 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT + bold_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_a + 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT + bold_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_b + 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT + bold_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_c + 1 ) end_POSTSUPERSCRIPT

and collecting like terms, for example by sorting the 𝐱𝐱\mathbf{x}bold_x derivatives in increasing order.

The highest derivative 𝐱(n)superscript𝐱𝑛\mathbf{x}^{(n)}bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT may be moved to the other side to get a formula like

𝐱(4)=J1(𝐟(4)𝐱˙𝐱˙𝐱˙𝐱˙+6𝐟(3)𝐱˙𝐱˙𝐱¨+4𝐟(2)𝐱˙𝐱(3)+3𝐟(2)𝐱¨𝐱¨),superscript𝐱4superscript𝐽1superscript𝐟4˙𝐱˙𝐱˙𝐱˙𝐱6superscript𝐟3˙𝐱˙𝐱¨𝐱4superscript𝐟2˙𝐱superscript𝐱33superscript𝐟2¨𝐱¨𝐱\mathbf{x}^{(4)}=-J^{-1}(\mathbf{f}^{(4)}\dot{\mathbf{x}}\dot{\mathbf{x}}\dot{%\mathbf{x}}\dot{\mathbf{x}}+6\mathbf{f}^{(3)}\dot{\mathbf{x}}\dot{\mathbf{x}}%\ddot{\mathbf{x}}+4\mathbf{f}^{(2)}\dot{\mathbf{x}}\mathbf{x}^{(3)}+3\mathbf{f%}^{(2)}\ddot{\mathbf{x}}\ddot{\mathbf{x}}),bold_x start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT = - italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG + 6 bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG over˙ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG + 4 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over˙ start_ARG bold_x end_ARG bold_x start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over¨ start_ARG bold_x end_ARG over¨ start_ARG bold_x end_ARG ) ,

shown for the n=4𝑛4n=4italic_n = 4 case, which expresses it in terms of lower derivatives of 𝐱𝐱\mathbf{x}bold_x.

4 Taking Finite Steps

The derivatives 𝐱(n)superscript𝐱𝑛\mathbf{x}^{(n)}bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT calculated above can produce a corrected higher-order step using the Taylor series of 𝐱𝐱\mathbf{x}bold_x around t=0𝑡0t=0italic_t = 0

𝐱(ϵ)=n=01n!ϵn𝐱(n),𝐱italic-ϵsuperscriptsubscript𝑛01𝑛superscriptitalic-ϵ𝑛superscript𝐱𝑛\mathbf{x}(\epsilon)=\sum_{n=0}^{\infty}\frac{1}{n!}\epsilon^{n}\mathbf{x}^{(n%)},bold_x ( italic_ϵ ) = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n ! end_ARG italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ,

where the step is thought of as stopping at a time t=ϵ𝑡italic-ϵt=\epsilonitalic_t = italic_ϵ in the parameterisation of the natural pathway. This unknown ϵitalic-ϵ\epsilonitalic_ϵ may seem like a problem but it can be made to cancel. Define the correction at order n𝑛nitalic_n to be the nthsuperscript𝑛thn^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT term of the Taylor series:

𝐜n=1n!ϵn𝐱(n).subscript𝐜𝑛1𝑛superscriptitalic-ϵ𝑛superscript𝐱𝑛\mathbf{c}_{n}=\frac{1}{n!}\epsilon^{n}\mathbf{x}^{(n)}.bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n ! end_ARG italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT .

The step begins at 𝐜0=𝐱subscript𝐜0𝐱\mathbf{c}_{0}=\mathbf{x}bold_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_x and the first order uncorrected step ends at 𝐜0+𝐜1subscript𝐜0subscript𝐜1\mathbf{c}_{0}+\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, so has length 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.Now recall that for n2𝑛2n\geq 2italic_n ≥ 2, the derivatives of 𝐟𝐱𝐟𝐱\mathbf{f}\circ\mathbf{x}bold_f ∘ bold_x are zero and use Faà di Bruno’s formula as before:

dndtn𝐟(𝐱(t))=πΠn𝐟(|π|)(𝐱(t))pπ𝐱(|p|)(t)=𝟎.superscriptd𝑛dsuperscript𝑡𝑛𝐟𝐱𝑡subscript𝜋subscriptΠ𝑛superscript𝐟𝜋𝐱𝑡subscripttensor-product𝑝𝜋superscript𝐱𝑝𝑡0\frac{\mathrm{d}^{n}}{\mathrm{d}t^{n}}\mathbf{f}(\mathbf{x}(t))=\sum_{\pi\in%\Pi_{n}}\mathbf{f}^{(|\pi|)}(\mathbf{x}(t))\bigotimes_{p\in\pi}\mathbf{x}^{(|p%|)}(t)=\mathbf{0}.divide start_ARG roman_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG bold_f ( bold_x ( italic_t ) ) = ∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUPERSCRIPT ( | italic_π | ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) ⨂ start_POSTSUBSCRIPT italic_p ∈ italic_π end_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ( | italic_p | ) end_POSTSUPERSCRIPT ( italic_t ) = bold_0 .

Multiplying both sides by ϵnsuperscriptitalic-ϵ𝑛\epsilon^{n}italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT gives

πΠn𝐟(|π|)(𝐱(t))pπϵ|p|𝐱(|p|)(t)=𝟎,subscript𝜋subscriptΠ𝑛superscript𝐟𝜋𝐱𝑡subscripttensor-product𝑝𝜋superscriptitalic-ϵ𝑝superscript𝐱𝑝𝑡0\sum_{\pi\in\Pi_{n}}\mathbf{f}^{(|\pi|)}(\mathbf{x}(t))\bigotimes_{p\in\pi}%\epsilon^{|p|}\mathbf{x}^{(|p|)}(t)=\mathbf{0},∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUPERSCRIPT ( | italic_π | ) end_POSTSUPERSCRIPT ( bold_x ( italic_t ) ) ⨂ start_POSTSUBSCRIPT italic_p ∈ italic_π end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT | italic_p | end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( | italic_p | ) end_POSTSUPERSCRIPT ( italic_t ) = bold_0 ,

using the fact that π𝜋\piitalic_π is a partition of {1,2,,n}12𝑛\{1,2,...,n\}{ 1 , 2 , … , italic_n }, so the sum of sizes |p|𝑝|p|| italic_p | of all its elements is n𝑛nitalic_n. Noting that ϵn𝐱(n)=n!𝐜nsuperscriptitalic-ϵ𝑛superscript𝐱𝑛𝑛subscript𝐜𝑛\epsilon^{n}\mathbf{x}^{(n)}=n!\mathbf{c}_{n}italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = italic_n ! bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and evaluating at t=0𝑡0t=0italic_t = 0 gives

πΠn𝐟(|π|)pπ|p|!𝐜|p|=𝟎.subscript𝜋subscriptΠ𝑛superscript𝐟𝜋subscripttensor-product𝑝𝜋𝑝subscript𝐜𝑝0\sum_{\pi\in\Pi_{n}}\mathbf{f}^{(|\pi|)}\bigotimes_{p\in\pi}|p|!\mathbf{c}_{|p%|}=\mathbf{0}.∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUPERSCRIPT ( | italic_π | ) end_POSTSUPERSCRIPT ⨂ start_POSTSUBSCRIPT italic_p ∈ italic_π end_POSTSUBSCRIPT | italic_p | ! bold_c start_POSTSUBSCRIPT | italic_p | end_POSTSUBSCRIPT = bold_0 .

This formula is the basis for calculating corrections 𝐜nsubscript𝐜𝑛\mathbf{c}_{n}bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for finite steps in the following sections.

4.1 The Meaning of ϵitalic-ϵ\epsilonitalic_ϵ

Observant readers might have noticed that 𝐜1=ϵ𝐱˙subscript𝐜1italic-ϵ˙𝐱\mathbf{c}_{1}=\epsilon\dot{\mathbf{x}}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ over˙ start_ARG bold_x end_ARG and in an earlier section, 𝐱˙=J1𝐟˙𝐱superscript𝐽1𝐟\dot{\mathbf{x}}=-J^{-1}\mathbf{f}over˙ start_ARG bold_x end_ARG = - italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_f, so taking a full Newton step would imply ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1. This paper treats ϵitalic-ϵ\epsilonitalic_ϵ as a small value because when experiencing slow convergence from the ‘narrow curving valleys’ problem, the area of validity for the local linear model (the trust region) is much smaller than what is required to go all the way to the model minimum. This means the steps taken that succeed in reducing the function sum of squares would only be a fraction of the Newton step, for example a Levenberg–Marquardt step with λ𝜆\lambdaitalic_λ chosen large enough to damp away the longest-range movement axes of the exact Newton scheme.

5 Finite Difference Schemes

The higher-order corrections 𝐜nsubscript𝐜𝑛\mathbf{c}_{n}bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are expressible in terms of multiple directional derivatives of 𝐟𝐟\mathbf{f}bold_f. For a numerical method, these derivatives must be calculated from function values, or at most, the Jacobian used by the algorithm. In this paper finite difference schemes are used, some of which have their ‘stencils’ of sampled points spread in multiple axes to give mixed derivatives.

It would also be possible to use an automatic differentiation scheme here, provided it supports higher order and mixed derivatives. However, because each 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is O(ϵi)𝑂superscriptitalic-ϵ𝑖O(\epsilon^{i})italic_O ( italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) by definition and is never divided by ϵitalic-ϵ\epsilonitalic_ϵ, finite difference schemes are not expected to result in a dramatic loss of accuracy in the following algorithms.

5.1 Second Order

For n=2𝑛2n=2italic_n = 2, the general formula gives

𝐟(2)𝐜1𝐜1+𝐟(1)2𝐜2=𝟎superscript𝐟2subscript𝐜1subscript𝐜1superscript𝐟12subscript𝐜20\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}+\mathbf{f}^{(1)}2\mathbf{c}_{2}=%\mathbf{0}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT 2 bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_0
𝐜2=12J1𝐟(2)𝐜1𝐜1.subscript𝐜212superscript𝐽1superscript𝐟2subscript𝐜1subscript𝐜1\Rightarrow\qquad\mathbf{c}_{2}=-\tfrac{1}{2}J^{-1}\mathbf{f}^{(2)}\mathbf{c}_%{1}\mathbf{c}_{1}.⇒ bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Taylor expansion of 𝐟𝐟\mathbf{f}bold_f in the direction 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the original uncorrected step gives

𝐟(𝐱+𝐜1)=𝐟+J𝐜1+12𝐟(2)𝐜1𝐜1+O(ϵ3)𝐟𝐱subscript𝐜1𝐟𝐽subscript𝐜112superscript𝐟2subscript𝐜1subscript𝐜1𝑂superscriptitalic-ϵ3\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})=\mathbf{f}+J\mathbf{c}_{1}+\tfrac{1}{2}%\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}+O(\epsilon^{3})bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_f + italic_J bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
12𝐟(2)𝐜1𝐜1=𝐟(𝐱+𝐜1)(𝐟+J𝐜1)+O(ϵ3).12superscript𝐟2subscript𝐜1subscript𝐜1𝐟𝐱subscript𝐜1𝐟𝐽subscript𝐜1𝑂superscriptitalic-ϵ3\Rightarrow\qquad\tfrac{1}{2}\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}=%\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})-(\mathbf{f}+J\mathbf{c}_{1})+O(\epsilon^%{3}).⇒ divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ( bold_f + italic_J bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

In other words, the difference between 𝐟(𝐱+𝐜1)𝐟𝐱subscript𝐜1\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and a linear estimate using the 𝐟𝐟\mathbf{f}bold_f and J𝐽Jitalic_J already calculated at 𝐱(0)𝐱0\mathbf{x}(0)bold_x ( 0 ), is to leading order a second derivative term similar to the one required for calculating 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus,

𝐜2=J1(𝐟(𝐱+𝐜1)(𝐟+J𝐜1))+O(ϵ3).subscript𝐜2superscript𝐽1𝐟𝐱subscript𝐜1𝐟𝐽subscript𝐜1𝑂superscriptitalic-ϵ3\mathbf{c}_{2}=-J^{-1}(\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})-(\mathbf{f}+J%\mathbf{c}_{1}))+O(\epsilon^{3}).bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ( bold_f + italic_J bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

The evaluations required for this calculation are shown in Figure 1. In this case, only one other point besides the evaluations of 𝐟𝐟\mathbf{f}bold_f and J𝐽Jitalic_J at 𝐱𝐱\mathbf{x}bold_x is needed.

Higher-Order Corrections to Optimisers based on Newton’s Method (1)

5.2 Third Order

For n=3𝑛3n=3italic_n = 3, the general formula gives

𝐟(3)𝐜1𝐜1𝐜1+3𝐟(2)𝐜12𝐜2+𝐟(1)6𝐜3=𝟎.superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜13superscript𝐟2subscript𝐜12subscript𝐜2superscript𝐟16subscript𝐜30\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}+3\mathbf{f}^{(2)}%\mathbf{c}_{1}2\mathbf{c}_{2}+\mathbf{f}^{(1)}6\mathbf{c}_{3}=\mathbf{0}.bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT 6 bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = bold_0 .
𝐜3=16J1(𝐟(3)𝐜1𝐜1𝐜1+6𝐟(2)𝐜1𝐜2).subscript𝐜316superscript𝐽1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜16superscript𝐟2subscript𝐜1subscript𝐜2\Rightarrow\qquad\mathbf{c}_{3}=-\tfrac{1}{6}J^{-1}(\mathbf{f}^{(3)}\mathbf{c}%_{1}\mathbf{c}_{1}\mathbf{c}_{1}+6\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}).⇒ bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

There are a few differences from the second order case:

  • 1.

    There is a third order derivative 𝐟(3)𝐜1𝐜1𝐜1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜1\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which will require an additional stencil point in the direction of 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

  • 2.

    Errors will now have to be O(ϵ4)𝑂superscriptitalic-ϵ4O(\epsilon^{4})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) as the main terms have size O(ϵ3)𝑂superscriptitalic-ϵ3O(\epsilon^{3})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

  • 3.

    There is a mixed derivative 𝐟(2)𝐜1𝐜2superscript𝐟2subscript𝐜1subscript𝐜2\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, requiring a two dimensional stencil pattern.

The mixed derivative requires knowledge of the direction 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which must be evaluated first. The second order stencil for 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT had error O(ϵ3)𝑂superscriptitalic-ϵ3O(\epsilon^{3})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) and now O(ϵ4)𝑂superscriptitalic-ϵ4O(\epsilon^{4})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) is needed, so even the lower-order derivative 𝐟(2)𝐜1𝐜1superscript𝐟2subscript𝐜1subscript𝐜1\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will have to be evaluated using a third order stencil. Fortunately, this stencil is also needed for evaluating 𝐟(3)𝐜1𝐜1𝐜1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜1\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, so all coefficients of a cubic approximation to 𝐟𝐟\mathbf{f}bold_f in this direction can be known.

5.2.1 Phase One: Calculating 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

The additional stencil point in the 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT direction is chosen to be 𝐱+12𝐜1𝐱12subscript𝐜1\mathbf{x}+\frac{1}{2}\mathbf{c}_{1}bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT here, although other choices are possible. To third order,

𝐟(𝐱+12𝐜1)=𝐟+12J𝐜1+18𝐟(2)𝐜1𝐜1+148𝐟(3)𝐜1𝐜1𝐜1+O(ϵ4)𝐟𝐱12subscript𝐜1𝐟12𝐽subscript𝐜118superscript𝐟2subscript𝐜1subscript𝐜1148superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜1𝑂superscriptitalic-ϵ4\mathbf{f}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})=\mathbf{f}+\tfrac{1}{2}J%\mathbf{c}_{1}+\tfrac{1}{8}\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}+\tfrac%{1}{48}\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}+O(\epsilon^{%4})bold_f ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_f + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_J bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 8 end_ARG bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 48 end_ARG bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT )
𝐟(𝐱+𝐜1)=𝐟+J𝐜1+12𝐟(2)𝐜1𝐜1+16𝐟(3)𝐜1𝐜1𝐜1+O(ϵ4).𝐟𝐱subscript𝐜1𝐟𝐽subscript𝐜112superscript𝐟2subscript𝐜1subscript𝐜116superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜1𝑂superscriptitalic-ϵ4\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})=\mathbf{f}+J\mathbf{c}_{1}+\tfrac{1}{2}%\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}+\tfrac{1}{6}\mathbf{f}^{(3)}%\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}+O(\epsilon^{4}).bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_f + italic_J bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 6 end_ARG bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) .

Writing the nonlinear part of 𝐟𝐟\mathbf{f}bold_f as 𝐟nl(𝐱+𝐚)=𝐟(𝐱+𝐚)(𝐟+J𝐚)subscript𝐟𝑛𝑙𝐱𝐚𝐟𝐱𝐚𝐟𝐽𝐚\mathbf{f}_{nl}(\mathbf{x}+\mathbf{a})=\mathbf{f}(\mathbf{x}+\mathbf{a})-(%\mathbf{f}+J\mathbf{a})bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_a ) = bold_f ( bold_x + bold_a ) - ( bold_f + italic_J bold_a ), the derivatives in the 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT direction can be expressed

𝐟(2)𝐜1𝐜1=16𝐟nl(𝐱+12𝐜1)2𝐟nl(𝐱+𝐜1)+O(ϵ4)superscript𝐟2subscript𝐜1subscript𝐜116subscript𝐟𝑛𝑙𝐱12subscript𝐜12subscript𝐟𝑛𝑙𝐱subscript𝐜1𝑂superscriptitalic-ϵ4\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}=16\mathbf{f}_{nl}(\mathbf{x}+%\tfrac{1}{2}\mathbf{c}_{1})-2\mathbf{f}_{nl}(\mathbf{x}+\mathbf{c}_{1})+O(%\epsilon^{4})bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 16 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 2 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT )
𝐟(3)𝐜1𝐜1𝐜1=12𝐟nl(𝐱+𝐜1)48𝐟nl(𝐱+12𝐜1)+O(ϵ4)superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜112subscript𝐟𝑛𝑙𝐱subscript𝐜148subscript𝐟𝑛𝑙𝐱12subscript𝐜1𝑂superscriptitalic-ϵ4\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}=12\mathbf{f}_{nl}(%\mathbf{x}+\mathbf{c}_{1})-48\mathbf{f}_{nl}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}%_{1})+O(\epsilon^{4})bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 12 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 48 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT )

and 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT calculated from the formula𝐜2=12J1𝐟(2)𝐜1𝐜1subscript𝐜212superscript𝐽1superscript𝐟2subscript𝐜1subscript𝐜1\mathbf{c}_{2}=-\frac{1}{2}J^{-1}\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTwith O(ϵ4)𝑂superscriptitalic-ϵ4O(\epsilon^{4})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) error.

5.2.2 Phase Two: Calculating 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

This step requires the mixed derivative 𝐟(2)𝐜1𝐜2superscript𝐟2subscript𝐜1subscript𝐜2\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Expressions of the form 𝐟(3)𝐮𝐯𝐰=(𝐮)(𝐯)(𝐰)𝐟superscript𝐟3𝐮𝐯𝐰𝐮𝐯𝐰𝐟\mathbf{f}^{(3)}\mathbf{u}\mathbf{v}\mathbf{w}=(\mathbf{u}\cdot\nabla)(\mathbf%{v}\cdot\nabla)(\mathbf{w}\cdot\nabla)\mathbf{f}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_uvw = ( bold_u ⋅ ∇ ) ( bold_v ⋅ ∇ ) ( bold_w ⋅ ∇ ) bold_f are iterated directional derivatives. Each directional derivative can be approximated to leading order as

(𝐮)𝐟(𝐱)𝐮𝐟𝐱\displaystyle(\mathbf{u}\cdot\nabla)\mathbf{f}(\mathbf{x})( bold_u ⋅ ∇ ) bold_f ( bold_x )=\displaystyle==𝐟(𝐱+ϵ𝐮)𝐟(𝐱)ϵ+O(ϵ)𝐟𝐱italic-ϵ𝐮𝐟𝐱italic-ϵ𝑂italic-ϵ\displaystyle\frac{\mathbf{f}(\mathbf{x}+\epsilon\mathbf{u})-\mathbf{f}(%\mathbf{x})}{\epsilon}+O(\epsilon)divide start_ARG bold_f ( bold_x + italic_ϵ bold_u ) - bold_f ( bold_x ) end_ARG start_ARG italic_ϵ end_ARG + italic_O ( italic_ϵ )
(ϵ𝐮)𝐟(𝐱)italic-ϵ𝐮𝐟𝐱\displaystyle\Rightarrow\qquad(\epsilon\mathbf{u}\cdot\nabla)\mathbf{f}(%\mathbf{x})⇒ ( italic_ϵ bold_u ⋅ ∇ ) bold_f ( bold_x )=\displaystyle==𝐟(𝐱+ϵ𝐮)𝐟(𝐱)+O(ϵ2)𝐟𝐱italic-ϵ𝐮𝐟𝐱𝑂superscriptitalic-ϵ2\displaystyle\mathbf{f}(\mathbf{x}+\epsilon\mathbf{u})-\mathbf{f}(\mathbf{x})+%O(\epsilon^{2})bold_f ( bold_x + italic_ϵ bold_u ) - bold_f ( bold_x ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(ϵn𝐮)𝐟(𝐱)superscriptitalic-ϵ𝑛𝐮𝐟𝐱\displaystyle\Rightarrow\qquad(\epsilon^{n}\mathbf{u}\cdot\nabla)\mathbf{f}(%\mathbf{x})⇒ ( italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_u ⋅ ∇ ) bold_f ( bold_x )=\displaystyle==𝐟(𝐱+ϵn𝐮)𝐟(𝐱)+O(ϵ2n).𝐟𝐱superscriptitalic-ϵ𝑛𝐮𝐟𝐱𝑂superscriptitalic-ϵ2𝑛\displaystyle\mathbf{f}(\mathbf{x}+\epsilon^{n}\mathbf{u})-\mathbf{f}(\mathbf{%x})+O(\epsilon^{2n}).bold_f ( bold_x + italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_u ) - bold_f ( bold_x ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) .

In the last formula above, ϵn𝐮superscriptitalic-ϵ𝑛𝐮\epsilon^{n}\mathbf{u}italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_u represents an O(ϵn)𝑂superscriptitalic-ϵ𝑛O(\epsilon^{n})italic_O ( italic_ϵ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) sized term such as 𝐜nsubscript𝐜𝑛\mathbf{c}_{n}bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Using this multiple times allows mixed derivatives to be expressed to leading order as combinations of function evaluations at different points (i.e. finite difference stencils). For example,

𝐟(2)𝐜1𝐜2superscript𝐟2subscript𝐜1subscript𝐜2\displaystyle\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=\displaystyle==(𝐜1)(𝐜2)𝐟subscript𝐜1subscript𝐜2𝐟\displaystyle(\mathbf{c}_{1}\cdot\nabla)(\mathbf{c}_{2}\cdot\nabla)\mathbf{f}( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ∇ ) ( bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∇ ) bold_f
similar-to-or-equals\displaystyle\simeq(𝐜1)(𝐟(𝐱+𝐜2)𝐟(𝐱))subscript𝐜1𝐟𝐱subscript𝐜2𝐟𝐱\displaystyle(\mathbf{c}_{1}\cdot\nabla)(\mathbf{f}(\mathbf{x}+\mathbf{c}_{2})%-\mathbf{f}(\mathbf{x}))( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ∇ ) ( bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - bold_f ( bold_x ) )
similar-to-or-equals\displaystyle\simeq𝐟(𝐱+𝐜2+𝐜1)𝐟(𝐱+𝐜1)(𝐟(𝐱+𝐜2)𝐟(𝐱)).𝐟𝐱subscript𝐜2subscript𝐜1𝐟𝐱subscript𝐜1𝐟𝐱subscript𝐜2𝐟𝐱\displaystyle\mathbf{f}(\mathbf{x}+\mathbf{c}_{2}+\mathbf{c}_{1})-\mathbf{f}(%\mathbf{x}+\mathbf{c}_{1})-(\mathbf{f}(\mathbf{x}+\mathbf{c}_{2})-\mathbf{f}(%\mathbf{x})).bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ( bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - bold_f ( bold_x ) ) .

Here, all terms have size O(ϵ3)𝑂superscriptitalic-ϵ3O(\epsilon^{3})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) and all approximations are leading order accurate meaning the error is no worse than O(ϵ4)𝑂superscriptitalic-ϵ4O(\epsilon^{4})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), as required.

In general, a dthsuperscript𝑑thd^{\mathrm{th}}italic_d start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT derivative of different directions would require 2dsuperscript2𝑑2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT evaluations. An n𝑛nitalic_n times repeated derivative in the same direction only requires n+1𝑛1n+1italic_n + 1 as some of the evaluation points are coincident. A mixture like 𝐟(a+b+c)𝐮a𝐯b𝐰csuperscript𝐟𝑎𝑏𝑐superscript𝐮tensor-productabsent𝑎superscript𝐯tensor-productabsent𝑏superscript𝐰tensor-productabsent𝑐\mathbf{f}^{(a+b+c)}\mathbf{u}^{\otimes a}\mathbf{v}^{\otimes b}\mathbf{w}^{%\otimes c}bold_f start_POSTSUPERSCRIPT ( italic_a + italic_b + italic_c ) end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT ⊗ italic_a end_POSTSUPERSCRIPT bold_v start_POSTSUPERSCRIPT ⊗ italic_b end_POSTSUPERSCRIPT bold_w start_POSTSUPERSCRIPT ⊗ italic_c end_POSTSUPERSCRIPT would require evaluation at (a+1)(b+1)(c+1)𝑎1𝑏1𝑐1(a+1)(b+1)(c+1)( italic_a + 1 ) ( italic_b + 1 ) ( italic_c + 1 ) points.

Some efficiencies may be gained from coincident evaluation points and the fact that the full first derivative J𝐽Jitalic_J is usually evaluated at 𝐱𝐱\mathbf{x}bold_x already. This was used in the previous ‘second order’ section, which only required one additional evaluation point for 𝐟(2)𝐜1𝐜1superscript𝐟2subscript𝐜1subscript𝐜1\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rather than three.

Now everything is in place to evaluate 𝐜3=16J1(𝐟(3)𝐜1𝐜1𝐜1+6𝐟(2)𝐜1𝐜2)subscript𝐜316superscript𝐽1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜16superscript𝐟2subscript𝐜1subscript𝐜2\mathbf{c}_{3}=-\frac{1}{6}J^{-1}(\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}%\mathbf{c}_{1}+6\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2})bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The full step will be corrected from 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝐜1+𝐜2+𝐜3subscript𝐜1subscript𝐜2subscript𝐜3\mathbf{c}_{1}+\mathbf{c}_{2}+\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The evaluations required are shown in Figure 2.

Higher-Order Corrections to Optimisers based on Newton’s Method (2)

5.3 Fourth Order

For n=4𝑛4n=4italic_n = 4, the general formula gives

𝐟(4)𝐜1𝐜1𝐜1𝐜1+6𝐟(3)𝐜1𝐜12𝐜2+4𝐟(2)𝐜16𝐜3+3𝐟(2)2𝐜22𝐜2+𝐟(1)24𝐜4=𝟎superscript𝐟4subscript𝐜1subscript𝐜1subscript𝐜1subscript𝐜16superscript𝐟3subscript𝐜1subscript𝐜12subscript𝐜24superscript𝐟2subscript𝐜16subscript𝐜33superscript𝐟22subscript𝐜22subscript𝐜2superscript𝐟124subscript𝐜40\mathbf{f}^{(4)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}+6%\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}2\mathbf{c}_{2}+4\mathbf{f}^{(2)}%\mathbf{c}_{1}6\mathbf{c}_{3}+3\mathbf{f}^{(2)}2\mathbf{c}_{2}2\mathbf{c}_{2}+%\mathbf{f}^{(1)}24\mathbf{c}_{4}=\mathbf{0}bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 4 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 6 bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 3 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT 2 bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2 bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT 24 bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = bold_0
𝐜4=124J1(𝐟(4)𝐜1𝐜1𝐜1𝐜1+12𝐟(3)𝐜1𝐜1𝐜2+24𝐟(2)𝐜1𝐜3+12𝐟(2)𝐜2𝐜2).subscript𝐜4124superscript𝐽1superscript𝐟4subscript𝐜1subscript𝐜1subscript𝐜1subscript𝐜112superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜224superscript𝐟2subscript𝐜1subscript𝐜312superscript𝐟2subscript𝐜2subscript𝐜2\Rightarrow\qquad\mathbf{c}_{4}=-\tfrac{1}{24}J^{-1}(\mathbf{f}^{(4)}\mathbf{c%}_{1}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}+12\mathbf{f}^{(3)}\mathbf{c}_{%1}\mathbf{c}_{1}\mathbf{c}_{2}+24\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{3}+%12\mathbf{f}^{(2)}\mathbf{c}_{2}\mathbf{c}_{2}).⇒ bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 24 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 12 bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 24 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 12 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

As expected, there are more higher-order and mixed derivatives. The double derivative 𝐟(2)𝐜2𝐜2superscript𝐟2subscript𝐜2subscript𝐜2\mathbf{f}^{(2)}\mathbf{c}_{2}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can take advantage of the Jacobian to eliminate a point from the stencil, just as previous unidirectional derivatives did. The direction 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is now involved in the derivatives, so three evaluation phases are required. All errors have to be O(ϵ5)𝑂superscriptitalic-ϵ5O(\epsilon^{5})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ) including those of 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

5.3.1 Phase One: Calculating 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

An additional stencil point 𝐱+32𝐜1𝐱32subscript𝐜1\mathbf{x}+\frac{3}{2}\mathbf{c}_{1}bold_x + divide start_ARG 3 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will be added to increase the order of accuracy in the 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT direction. Defining 𝐟nl(𝐱+𝐚)=𝐟(𝐱+𝐚)(𝐟+J𝐚)subscript𝐟𝑛𝑙𝐱𝐚𝐟𝐱𝐚𝐟𝐽𝐚\mathbf{f}_{nl}(\mathbf{x}+\mathbf{a})=\mathbf{f}(\mathbf{x}+\mathbf{a})-(%\mathbf{f}+J\mathbf{a})bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_a ) = bold_f ( bold_x + bold_a ) - ( bold_f + italic_J bold_a ) as before,

𝐟(2)𝐜1𝐜1superscript𝐟2subscript𝐜1subscript𝐜1\displaystyle\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq24𝐟nl(𝐱+12𝐜1)6𝐟nl(𝐱+𝐜1)+89𝐟(𝐱+32𝐜1)24subscript𝐟𝑛𝑙𝐱12subscript𝐜16subscript𝐟𝑛𝑙𝐱subscript𝐜189𝐟𝐱32subscript𝐜1\displaystyle 24\mathbf{f}_{nl}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})-6%\mathbf{f}_{nl}(\mathbf{x}+\mathbf{c}_{1})+\tfrac{8}{9}\mathbf{f}(\mathbf{x}+%\tfrac{3}{2}\mathbf{c}_{1})24 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 6 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + divide start_ARG 8 end_ARG start_ARG 9 end_ARG bold_f ( bold_x + divide start_ARG 3 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
𝐟(3)𝐜1𝐜1𝐜1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜1\displaystyle\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq120𝐟nl(𝐱+12𝐜1)+48𝐟nl(𝐱+𝐜1)8𝐟(𝐱+32𝐜1)120subscript𝐟𝑛𝑙𝐱12subscript𝐜148subscript𝐟𝑛𝑙𝐱subscript𝐜18𝐟𝐱32subscript𝐜1\displaystyle-120\mathbf{f}_{nl}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})+48%\mathbf{f}_{nl}(\mathbf{x}+\mathbf{c}_{1})-8\mathbf{f}(\mathbf{x}+\tfrac{3}{2}%\mathbf{c}_{1})- 120 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + 48 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 8 bold_f ( bold_x + divide start_ARG 3 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
𝐟(4)𝐜1𝐜1𝐜1𝐜1superscript𝐟4subscript𝐜1subscript𝐜1subscript𝐜1subscript𝐜1\displaystyle\mathbf{f}^{(4)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf%{c}_{1}bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq192𝐟nl(𝐱+12𝐜1)96𝐟nl(𝐱+𝐜1)+643𝐟(𝐱+32𝐜1),192subscript𝐟𝑛𝑙𝐱12subscript𝐜196subscript𝐟𝑛𝑙𝐱subscript𝐜1643𝐟𝐱32subscript𝐜1\displaystyle 192\mathbf{f}_{nl}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})-96%\mathbf{f}_{nl}(\mathbf{x}+\mathbf{c}_{1})+\tfrac{64}{3}\mathbf{f}(\mathbf{x}+%\tfrac{3}{2}\mathbf{c}_{1}),192 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 96 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + divide start_ARG 64 end_ARG start_ARG 3 end_ARG bold_f ( bold_x + divide start_ARG 3 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

all with errors O(ϵ5)𝑂superscriptitalic-ϵ5O(\epsilon^{5})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ). These formulae came from writing out the Taylor expansions and inverting the system of equations, which can also be done by inverting a matrix as the equations are linear in 𝐟𝐟\mathbf{f}bold_f and its derivatives. Calculate 𝐜2=12J1𝐟(2)𝐜1𝐜1subscript𝐜212superscript𝐽1superscript𝐟2subscript𝐜1subscript𝐜1\mathbf{c}_{2}=-\frac{1}{2}J^{-1}\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using the first formula above.

5.3.2 Phase Two: Calculating 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

The mixed derivative 𝐟(3)𝐜1𝐜1𝐜2superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜2\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT needs a grid with three points in the 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT direction and two in the 𝐜2subscript𝐜2\mathbf{c}_{2}bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT direction for a total of six. Many previous points can be re-used, with the only new point for fourth order in this phase being 𝐱+12𝐜1+𝐜2𝐱12subscript𝐜1subscript𝐜2\mathbf{x}+\frac{1}{2}\mathbf{c}_{1}+\mathbf{c}_{2}bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

𝐟(3)𝐜1𝐜1𝐜2superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜2\displaystyle\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq(4𝐟(𝐱+𝐜2)8𝐟(𝐱+12𝐜1+𝐜2)+4𝐟(𝐱+𝐜1+𝐜2))limit-from4𝐟𝐱subscript𝐜28𝐟𝐱12subscript𝐜1subscript𝐜24𝐟𝐱subscript𝐜1subscript𝐜2\displaystyle(4\mathbf{f}(\mathbf{x}+\mathbf{c}_{2})-8\mathbf{f}(\mathbf{x}+%\tfrac{1}{2}\mathbf{c}_{1}+\mathbf{c}_{2})+4\mathbf{f}(\mathbf{x}+\mathbf{c}_{%1}+\mathbf{c}_{2}))-( 4 bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - 8 bold_f ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 4 bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) -
(4𝐟8𝐟(𝐱+12𝐜1)+4𝐟(𝐱+𝐜1))4𝐟8𝐟𝐱12subscript𝐜14𝐟𝐱subscript𝐜1\displaystyle(4\mathbf{f}-8\mathbf{f}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})+4%\mathbf{f}(\mathbf{x}+\mathbf{c}_{1}))( 4 bold_f - 8 bold_f ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + 4 bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
𝐟(2)𝐜1𝐜2superscript𝐟2subscript𝐜1subscript𝐜2\displaystyle\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq(3𝐟(𝐱+𝐜2)+4𝐟(𝐱+12𝐜1+𝐜2)𝐟(𝐱+𝐜1+𝐜2))limit-from3𝐟𝐱subscript𝐜24𝐟𝐱12subscript𝐜1subscript𝐜2𝐟𝐱subscript𝐜1subscript𝐜2\displaystyle(-3\mathbf{f}(\mathbf{x}+\mathbf{c}_{2})+4\mathbf{f}(\mathbf{x}+%\tfrac{1}{2}\mathbf{c}_{1}+\mathbf{c}_{2})-\mathbf{f}(\mathbf{x}+\mathbf{c}_{1%}+\mathbf{c}_{2}))-( - 3 bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 4 bold_f ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) -
(3𝐟+4𝐟(𝐱+12𝐜1)𝐟(𝐱+𝐜1))3𝐟4𝐟𝐱12subscript𝐜1𝐟𝐱subscript𝐜1\displaystyle(-3\mathbf{f}+4\mathbf{f}(\mathbf{x}+\tfrac{1}{2}\mathbf{c}_{1})-%\mathbf{f}(\mathbf{x}+\mathbf{c}_{1}))( - 3 bold_f + 4 bold_f ( bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
𝐟(2)𝐜2𝐜2superscript𝐟2subscript𝐜2subscript𝐜2\displaystyle\mathbf{f}^{(2)}\mathbf{c}_{2}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq2𝐟nl(𝐱+𝐜2).2subscript𝐟𝑛𝑙𝐱subscript𝐜2\displaystyle 2\mathbf{f}_{nl}(\mathbf{x}+\mathbf{c}_{2}).2 bold_f start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ( bold_x + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Again, all errors are O(ϵ5)𝑂superscriptitalic-ϵ5O(\epsilon^{5})italic_O ( italic_ϵ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ) and 𝐜3=16J1(𝐟(3)𝐜1𝐜1𝐜1+6𝐟(2)𝐜1𝐜2)subscript𝐜316superscript𝐽1superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜16superscript𝐟2subscript𝐜1subscript𝐜2\mathbf{c}_{3}=-\frac{1}{6}J^{-1}(\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}%\mathbf{c}_{1}+6\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2})bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 6 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can now be calculated to the required accuracy.

5.3.3 Phase Three: Calculating 𝐜4subscript𝐜4\mathbf{c}_{4}bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

This phase requires the single mixed derivative 𝐟(2)𝐜1𝐜3superscript𝐟2subscript𝐜1subscript𝐜3\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{3}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which can be handled analogously to when 𝐟(2)𝐜1𝐜2superscript𝐟2subscript𝐜1subscript𝐜2\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{2}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT first appeared, by extending the stencil in the 𝐜3subscript𝐜3\mathbf{c}_{3}bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT direction with the two points 𝐱+𝐜3𝐱subscript𝐜3\mathbf{x}+\mathbf{c}_{3}bold_x + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and 𝐱+𝐜1+𝐜3𝐱subscript𝐜1subscript𝐜3\mathbf{x}+\mathbf{c}_{1}+\mathbf{c}_{3}bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

𝐟(2)𝐜1𝐜3superscript𝐟2subscript𝐜1subscript𝐜3\displaystyle\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{3}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTsimilar-to-or-equals\displaystyle\simeq𝐟(𝐱+𝐜1+𝐜3)𝐟(𝐱+𝐜3)(𝐟(𝐱+𝐜1)𝐟(𝐱)).𝐟𝐱subscript𝐜1subscript𝐜3𝐟𝐱subscript𝐜3𝐟𝐱subscript𝐜1𝐟𝐱\displaystyle\mathbf{f}(\mathbf{x}+\mathbf{c}_{1}+\mathbf{c}_{3})-\mathbf{f}(%\mathbf{x}+\mathbf{c}_{3})-(\mathbf{f}(\mathbf{x}+\mathbf{c}_{1})-\mathbf{f}(%\mathbf{x})).bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - ( bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_f ( bold_x ) ) .

Now all values are available to evaluate the fourth order correction 𝐜4=124J1(𝐟(4)𝐜1𝐜1𝐜1𝐜1+12𝐟(3)𝐜1𝐜1𝐜2+24𝐟(2)𝐜1𝐜3+12𝐟(2)𝐜2𝐜2)subscript𝐜4124superscript𝐽1superscript𝐟4subscript𝐜1subscript𝐜1subscript𝐜1subscript𝐜112superscript𝐟3subscript𝐜1subscript𝐜1subscript𝐜224superscript𝐟2subscript𝐜1subscript𝐜312superscript𝐟2subscript𝐜2subscript𝐜2\mathbf{c}_{4}=-\tfrac{1}{24}J^{-1}(\mathbf{f}^{(4)}\mathbf{c}_{1}\mathbf{c}_{%1}\mathbf{c}_{1}\mathbf{c}_{1}+12\mathbf{f}^{(3)}\mathbf{c}_{1}\mathbf{c}_{1}%\mathbf{c}_{2}+24\mathbf{f}^{(2)}\mathbf{c}_{1}\mathbf{c}_{3}+12\mathbf{f}^{(2%)}\mathbf{c}_{2}\mathbf{c}_{2})bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 24 end_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_f start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 12 bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 24 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 12 bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The full step will be corrected from 𝐜1subscript𝐜1\mathbf{c}_{1}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝐜1+𝐜2+𝐜3+𝐜4subscript𝐜1subscript𝐜2subscript𝐜3subscript𝐜4\mathbf{c}_{1}+\mathbf{c}_{2}+\mathbf{c}_{3}+\mathbf{c}_{4}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and the evaluations required are shown in Figure 3.

Higher-Order Corrections to Optimisers based on Newton’s Method (3)

It is clear that this process could be continued to even higher orders, although the stencils would require more and more points. Practically, automated selection of points and calculations of the stencil coefficients would also be required.

6 Numerical Test Problem

The higher-order algorithms were tested on a simple function to verify their performance. This function had to exhibit the ‘narrow curving valleys’ problem in its sum of squares, so a very anisotropic function (x,y)(x,Ky)maps-to𝑥𝑦𝑥𝐾𝑦(x,y)\mapsto(x,Ky)( italic_x , italic_y ) ↦ ( italic_x , italic_K italic_y ) for K1much-greater-than𝐾1K\gg 1italic_K ≫ 1 was chosen and then some nonlinearity added in parameter space. The resulting function is

𝐟(x,y)=(x+y2,K(yx2)).𝐟𝑥𝑦𝑥superscript𝑦2𝐾𝑦superscript𝑥2\mathbf{f}(x,y)=(x+y^{2},K(y-x^{2})).bold_f ( italic_x , italic_y ) = ( italic_x + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_K ( italic_y - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) .

Typical values of K=106𝐾superscript106K=10^{6}italic_K = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT were used and the iteration was started from the arbitrary point (x,y)=(π,e)𝑥𝑦𝜋𝑒(x,y)=(\pi,e)( italic_x , italic_y ) = ( italic_π , italic_e ), moving towards the minimum sum of squares at 𝐟(0,0)=(0,0)𝐟0000\mathbf{f}(0,0)=(0,0)bold_f ( 0 , 0 ) = ( 0 , 0 ).

Figure 4 shows the improved performance of the higher order corrected methods on the test problem with K=106𝐾superscript106K=10^{6}italic_K = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT. The error norm in the plot is the value of |𝐟(x,y)|𝐟𝑥𝑦|\mathbf{f}(x,y)|| bold_f ( italic_x , italic_y ) | after each iteration. There is a slow convergence region for 0.1|𝐟|100.1𝐟100.1\leq|\mathbf{f}|\leq 100.1 ≤ | bold_f | ≤ 10 after which the algorithm converges rapidly. When the full Newton step using the linear model becomes a valid, convergence should be quadratic with the error norm roughly squaring on each iteration. This rapid convergence appears as the near-vertical descending lines on the graph for |𝐟|<0.1𝐟0.1|\mathbf{f}|<0.1| bold_f | < 0.1.

Higher-Order Corrections to Optimisers based on Newton’s Method (4)

Source code in C for this example is available at [10].

6.1 Varying Valley Anisotropy

Varying K𝐾Kitalic_K should show the relative performance of the different order algorithms as the valley gets narrower, while the curvature is kept constant. Table 1 shows the number of iterations required to converge as K𝐾Kitalic_K is varied between 1111 and 1012superscript101210^{12}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT.

Anisotropy K𝐾Kitalic_K1st order2nd order3rd order4th order
18655
1015865
100471698
1000196301811
10000880682418
10000040411625027
106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT187333978843
107superscript10710^{7}10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT>>>2000097116670
108superscript10810^{8}10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT>>>200002432312110
109superscript10910^{9}10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT>>>200005828631243
1010superscript101010^{10}10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT>>>20000>>>200002876968
1011superscript101110^{11}10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT>>>20000>>>20000108862706
1012superscript101210^{12}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT>>>20000>>>20000>>>200009159

A simplified model is that the valley has width 1/K1𝐾1/K1 / italic_K while the nthsuperscript𝑛thn^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT order method has error term O(ϵn+1)𝑂superscriptitalic-ϵ𝑛1O(\epsilon^{n+1})italic_O ( italic_ϵ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ), so this error would push the proposed step out of the valley when O(ϵn+1)=1/K𝑂superscriptitalic-ϵ𝑛11𝐾O(\epsilon^{n+1})=1/Kitalic_O ( italic_ϵ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) = 1 / italic_K or ϵ=O(K1n+1)italic-ϵ𝑂superscript𝐾1𝑛1\epsilon=O(K^{-\frac{1}{n+1}})italic_ϵ = italic_O ( italic_K start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG end_POSTSUPERSCRIPT ). This would give O(1/ϵ)=O(K1n+1)𝑂1italic-ϵ𝑂superscript𝐾1𝑛1O(1/\epsilon)=O(K^{\frac{1}{n+1}})italic_O ( 1 / italic_ϵ ) = italic_O ( italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG end_POSTSUPERSCRIPT ) steps to convergence.

Higher-Order Corrections to Optimisers based on Newton’s Method (5)

Plotting the data on a log-log plot in Figure 5 reveals straight lines in parts of the data that suggest a power law relationship. For K109𝐾superscript109K\geq 10^{9}italic_K ≥ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT there is an additional increase in convergence time, which may be from the limits of double precision used for the calculation. Taking the gradient through the last three available points with K108𝐾superscript108K\leq 10^{8}italic_K ≤ 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT gives power law exponents of 0.660,0.392,0.265,0.2030.6600.3920.2650.2030.660,0.392,0.265,0.2030.660 , 0.392 , 0.265 , 0.203 for the first through fourth order methods. This is somewhat similar to the simplified model’s 12,13,14,1512131415\frac{1}{2},\frac{1}{3},\frac{1}{4},\frac{1}{5}divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 3 end_ARG , divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 5 end_ARG although lower-order methods seem slower, particularly the first order.

6.2 Combination with Levenburg–Marquardt Method

These numerical experiments were performed with a Levenburg–Marquardt method enhanced with the higher-order corrections. Step length ϵitalic-ϵ\epsilonitalic_ϵ is controlled by choosing the damping parameter λ0𝜆0\lambda\geq 0italic_λ ≥ 0 in the pseudo-inverse [J1]LM(λ)=(JTJ+λI)1JTsubscriptdelimited-[]superscript𝐽1𝐿𝑀𝜆superscriptsuperscript𝐽𝑇𝐽𝜆𝐼1superscript𝐽𝑇[J^{-1}]_{LM(\lambda)}=(J^{T}J+\lambda I)^{-1}J^{T}[ italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_L italic_M ( italic_λ ) end_POSTSUBSCRIPT = ( italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J + italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The note [11] shows that λ=0𝜆0\lambda=0italic_λ = 0 gives the full Gauss–Newton step, while λ𝜆\lambda\rightarrow\inftyitalic_λ → ∞ produces infinitesimal steepest gradient steps, with the values of λ𝜆\lambdaitalic_λ in between producing optimal reductions in the linear model for a given step size.

A good choice of λ𝜆\lambdaitalic_λ should be chosen at each step. In this study the values

λn=λold10000(n/10)3for10n10,formulae-sequencesubscript𝜆𝑛subscript𝜆oldsuperscript10000superscript𝑛103for10𝑛10\lambda_{n}=\lambda_{\mathrm{old}}10000^{(n/10)^{3}}\qquad\mbox{for}\quad-10%\leq n\leq 10,italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT 10000 start_POSTSUPERSCRIPT ( italic_n / 10 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for - 10 ≤ italic_n ≤ 10 ,

where λoldsubscript𝜆old\lambda_{\mathrm{old}}italic_λ start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT is the value from the previous step, are run in parallel and the one that produces the best reduction in |𝐟|𝐟|\mathbf{f}|| bold_f | chosen. The initial step uses λold=1subscript𝜆old1\lambda_{\mathrm{old}}=1italic_λ start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT = 1.

The fact these steps are run in parallel means that for the higher order methods, many initial Levenburg–Marquardt steps 𝐜1(λn)subscript𝐜1subscript𝜆𝑛\mathbf{c}_{1}(\lambda_{n})bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are calculated, each of which has higher order corrections 𝐜i(λn)subscript𝐜𝑖subscript𝜆𝑛\mathbf{c}_{i}(\lambda_{n})bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). The function 𝐟𝐟\mathbf{f}bold_f is evaluated at all the corrected step points 𝐱out(λn)=i=0order𝐜i(λn)subscript𝐱outsubscript𝜆𝑛superscriptsubscript𝑖0𝑜𝑟𝑑𝑒𝑟subscript𝐜𝑖subscript𝜆𝑛\mathbf{x}_{\mathrm{out}}(\lambda_{n})=\sum_{i=0}^{order}\mathbf{c}_{i}(%\lambda_{n})bold_x start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_r italic_d italic_e italic_r end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and the one with lowest |𝐟|𝐟|\mathbf{f}|| bold_f | and its corresponding λnsubscript𝜆𝑛\lambda_{n}italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is chosen.

The higher order methods enable longer steps and thus smaller values of λ𝜆\lambdaitalic_λ to be used. The scheme above is somewhat wasteful by trying 21 values of λ𝜆\lambdaitalic_λ each step, but on modern computers these can be parallelised, unlike the slow progress along the narrow optimisation valley, which is a serial calculation.

6.3 Combination with Broyden’s Method

As evaluating the Jacobian is more expensive than evaluating the function, several methods exist for estimating the Jacobian based on discovered function values. Broyden’s method [12] updates the approximate Jacobian each iteration via

Jnew=J+Δ𝐟JΔ𝐱|Δ𝐱|2Δ𝐱T,subscript𝐽new𝐽Δ𝐟𝐽Δ𝐱superscriptΔ𝐱2Δsuperscript𝐱𝑇J_{\mathrm{new}}=J+\frac{\Delta\mathbf{f}-J\Delta\mathbf{x}}{|\Delta\mathbf{x}%|^{2}}\Delta\mathbf{x}^{T},italic_J start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT = italic_J + divide start_ARG roman_Δ bold_f - italic_J roman_Δ bold_x end_ARG start_ARG | roman_Δ bold_x | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Δ bold_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

where Δ𝐱Δ𝐱\Delta\mathbf{x}roman_Δ bold_x and Δ𝐟Δ𝐟\Delta\mathbf{f}roman_Δ bold_f are the change in variables and function value, respectively, over the previous step.

Higher-Order Corrections to Optimisers based on Newton’s Method (6)

Figure6 shows the results of Broyden’s method on the simple test problem with K=106𝐾superscript106K=10^{6}italic_K = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT. The solid lines are the same as in Figure4, while the dashed lines only evaluate the Jacobian at the start and use the Broyden update each step thereafter. The higher-order corrections and Levenburg–Marquardt logic discussed in previous sections are still present, with the approximate Jacobian being used as input rather than the exact one.

OrderIterationsEvaluations per iterationEvaluations
1st36652136652
2nd21571243142
3rd6211531055
4th77596975
4th+3rd376103760

As expected, using an inexact Jacobian results in convergence taking more iterations, but the iterations will be faster by a problem-dependent factor since the Jacobian is not evaluated. Table2 compares the convergence times for the Broyden method with different higher-order corrections. The 4th order method significantly reduces not only the number of iterations taken, but the number of function evaluations performed (each 4th order step takes 9 evaluations). A variant of this method that also samples the 3rd order point 𝐟(𝐱+𝐜1+𝐜2+𝐜3)𝐟𝐱subscript𝐜1subscript𝐜2subscript𝐜3\mathbf{f}(\mathbf{x}+\mathbf{c}_{1}+\mathbf{c}_{2}+\mathbf{c}_{3})bold_f ( bold_x + bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), which is not present in the 4th order stencil, performs even better. The best of the two points is taken each iteration.

7 Performance on a Practical Problem

These algorithms have also been used on a more complex optimisation problem (which motivated their development). This problem has 180 parameters and 1200 output variables and features levels of successively more difficult narrow curved valleys in its optimisation space.

The full details of this problem are not the point of this paper but a brief summary will be given here (for physics background, see [13]). An initial distribution of Nions=400subscript𝑁ions400N_{\mathrm{ions}}=400italic_N start_POSTSUBSCRIPT roman_ions end_POSTSUBSCRIPT = 400 Ca+ ions is accelerated through a potential of 1 kV and given a ±plus-or-minus\pm±2% energy chirp. It is transported through a curved electrostatic channel, where the electric field is produced by 15 rings of 12 configurable electrodes. Each electrode is modelled as a point charge and these 180 charges are the optimisation variables. The output vector whose norm should be minimised contains the (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) position coordinates of the ions on exiting the channel (so 3Nions3subscript𝑁ions3N_{\mathrm{ions}}3 italic_N start_POSTSUBSCRIPT roman_ions end_POSTSUBSCRIPT entries in all), with the bunch centroid subtracted. The ions do not interact in this model and their trajectories are calculated by the 4th order Runge–Kutta method [14, 15] with a timestep δt=107𝛿𝑡superscript107\delta t=10^{-7}italic_δ italic_t = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT seconds.

This problem is interesting because focussing the ions to a point in the linear dynamics approximation can be done with standard optics, but optical aberrations at higher order will remain, for example spherical aberration from large angle effects. These higher order aberrations can also be corrected by careful choice of the electrode voltages, although this gets more difficult the smaller the focal point becomes and the more aberrations have to be cancelled simultaneously.

The figure of merit is the RMS focal size of the ion bunch, which is 1Nions1subscript𝑁ions\frac{1}{\sqrt{N_{\mathrm{ions}}}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT roman_ions end_POSTSUBSCRIPT end_ARG end_ARG of the norm |𝐟|𝐟|\mathbf{f}|| bold_f |. Figure 7 shows how this is reduced by the Levenburg–Marquardt optimisation method with various levels of higher order correction. To maintain accuracy, the Jacobian was calculated with the chain rule (through all Runge–Kutta steps) here, rather than finite differences.

Higher-Order Corrections to Optimisers based on Newton’s Method (7)

Figure 7 shows that higher order methods significantly accelerate the optimisation progress, even on this more complex problem. All methods eventually slow down here, which is suspected to be because the valley anisotropy continues to increase at lower focal sizes, leading to more steps taken (as in Figure 5).

Higher-Order Corrections to Optimisers based on Newton’s Method (8)

One potential concern with these higher-order methods is that the additional evaluations of 𝐟𝐟\mathbf{f}bold_f in the stencil will cost too much time to make the method worth using. To measure this effect, the optimised focal size is plotted as a function of calculation time in Figure 8. The differences in total execution time can be seen at the right-hand end of each line, which is after 1000 iterations. The fourth order method takes 44% longer per step but still manages to pull ahead of the other methods in real time for focal sizes below 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT m. Table 3 gives the amount of time for each method to reach certain RMS focal sizes.

OrderIterations
to <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT mto <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT mto <108absentsuperscript108<10^{-8}< 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT mto <109absentsuperscript109<10^{-9}< 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT m
1st252>>>1000>>>1000>>>1000
2nd105189344877
3rd5893160654
4th5876101247
OrderCalculation Time (s)
to <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT mto <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT mto <108absentsuperscript108<10^{-8}< 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT mto <109absentsuperscript109<10^{-9}< 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT m
1st1707.285
2nd742.9921337.1782470.1016384.886
3rd461.045739.6701308.1155497.770
4th525.800690.696934.5392358.116

7.1 Practical Problem in Combination with Broyden’s Method

The more complex optimisation problem was attempted using Broyden’s method, only evaluating the Jacobian on the first step. This approach was unsuccessful, resulting in a 5% decrease in figure of merit in the first 5–10 iterations, followed by either coming to a complete halt, or a sequence of tiny steps decreasing by similar-to\sim1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT each time. These steps are not long enough for the higher order corrections to have any effect, nor are they likely to improve the Jacobian accuracy much. It is suspected that the Jacobian estimate became too far away from the real value to be useful.

Higher-Order Corrections to Optimisers based on Newton’s Method (9)

The optimisation was attempted again with the full Jacobian being evaluated periodically (every 16th iteration here) and using Broyden’s method for the other steps. Figure9 shows that in this case, the intermediate Broyden’s method steps do provide some benefit over a direct iteration with the same number of Jacobian evaluations. It can also be seen that higher-order corrections improve this combined method significantly, although here 3rd order wins out slightly over 4th. Whether this approach is advantageous will depend on the time taken to evaluate the Jacobian relative to individual function values. The downward jumps in the graph caused by the Jacobian evaluations suggest they are more important in the earlier stages of the optimisation, where the jumps appear larger relative to the smooth decrease.

8 Conclusion

Methods such as Levenburg–Marquardt are not only for curve fitting, but also powerful optimisers where the function to be minimised is a sum of squares. Like many optimisers, they can get stuck for long periods of time in ‘curved narrow valleys’. This paper derives higher-order corrections beyond the already known second order [3, 4] that further accelerate the optimiser performance in these situations. A general formula for deriving the nthsuperscript𝑛thn^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT order correction is given, with suggestions on how to build finite difference stencils to evaluate it.

These successful methods have been derived using the concept of a ‘natural pathway’ for the optimisation, which is an ordinary differential equation (ODE) that is meant to follow the valleys. The form of this ODE is chosen somewhat arbitrarily here but it appears to work well, perhaps because it is a continuous version of the optimiser’s path in the limit where step size ϵ0italic-ϵ0\epsilon\rightarrow 0italic_ϵ → 0. Using a higher-order step method on this ODE thus makes the optimiser behave ‘as if’ it had done a large number of very small steps.

This link with ODEs also suggests potential future work in applying well-known higher-order methods for ODEs such as Runge–Kutta [14, 15] or Bulirsch–Stoer [16] to difficult optimisation problems, as well as methods for stiff ODEs. In this paper the RK4 method was not preferred because it would have required four evaluations of the Jacobian, which is still much more expensive than the eight additional function points in the stencil of the fourth order method.

References

  • [1] “A Method for the Solution of Certain Non-Linear Problems in Least Squares”, Kenneth Levenberg, Quarterly of Applied Mathematics 2 (2), pp.164–168 (1944) doi:10.1090/qam/10666
  • [2] “An Algorithm for Least-Squares Estimation of Nonlinear Parameters”, Donald Marquardt, SIAM Journal on Applied Mathematics 11 (2), pp.431–441 (1963) doi:10.1137/0111030
  • [3] “Why are Nonlinear Fits to Data so Challenging?”, Mark K. Transtrum, Benjamin B. Machta and James P. Sethna, Phys. Rev. Lett. 104, 060201 (2010) doi:10.1103/PhysRevLett.104.060201
  • [4] “Geometry of nonlinear least squares with applications to sloppy models and optimization”, Mark K. Transtrum, Benjamin B. Machta, and James P. Sethna, Phys. Rev. E 83, 036701 (2011) doi:10.1103/PhysRevE.83.036701
  • [5] “Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization”, Mark K. Transtrum and James P. Sethna, arXiv:1201.5885v1 [physics.data-an] (2012).
  • [6] “Geodesic acceleration and the small-curvature approximation for nonlinear least squares”, Mark K. Transtrum and James P. Sethna, arXiv:1207.4999v1 [math.OC] (2012).
  • [7] “Du calcul des derivations” [On the calculus of derivatives] (in French), L.F.A. Arbogast, Strasbourg: Levrault, pp. xxiii+404 (1800). Entirely freely available from Google books https://books.google.com/books?id=YoPq8uCy5Y8C
  • [8] “Sullo sviluppo delle funzioni” [On the development of the functions] (in Italian), F. Faà di Bruno, Annali di Scienze Matematiche e Fisiche, 6: 479–480, LCCN 06036680 (1855). Entirely freely available from Google books https://books.google.com/books?id=ddE3AAAAMAAJ&pg=PA479
  • [9] “Note sur une nouvelle formule de calcul differentiel” [On a new formula of differential calculus] (in French), F. Faà di Bruno, The Quarterly Journal of Pure and Applied Mathematics, 1: 359–360 (1857). Entirely freely available from Google books https://books.google.com/books?id=7BELAAAAYAAJ&pg=PA359
  • [10] https://stephenbrooks.org/ap/src/optimtest.zip contains the source code for the simple optimisation problem, and https://stephenbrooks.org/ap/src/tracker.zip the complex optimisation.
  • [11] “Bounded Approximate Solutions of Linear Systems using SVD”, Stephen Brooks, tech note BNL-223624-2022-TECH, available from https://technotes.bnl.gov/ or https://stephenbrooks.org/ap/report/2015-3/svdboundedsolve.pdf (2015).
  • [12] “A Class of Methods for Solving Nonlinear Simultaneous Equations”, C.G. Broyden, Mathematics of Computation 19, pp.577–593 (1965). doi:10.1090/S0025-5718-1965-0198670-6
  • [13] “Ultra-Low Emittance Bunches from Laser Cooled Ion Traps for Intense Focal Points”, Stephen Brooks, Proc. HB2023. doi:10.18429/JACoW-HB2023-TUC3I1
  • [14] Runge, C., Math. Ann. 46, p.167 (1895); and Kutta, M.W.Z., für Math. u. Phys. 46, p.435 (1901).
  • [15] Numerical Recipes in C, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vettering, chapter 15.1: Integration of Ordinary Differential Equations - Runge–Kutta Method.
  • [16] Introduction to Numerical Analysis, J. Stoer and R. Bulirsch, section 7.2.14, New York: Springer-Verlag (1980).
Higher-Order Corrections to Optimisers based on Newton’s Method (2024)

References

Top Articles
Latest Posts
Article information

Author: Lilliana Bartoletti

Last Updated:

Views: 5251

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lilliana Bartoletti

Birthday: 1999-11-18

Address: 58866 Tricia Spurs, North Melvinberg, HI 91346-3774

Phone: +50616620367928

Job: Real-Estate Liaison

Hobby: Graffiti, Astronomy, Handball, Magic, Origami, Fashion, Foreign language learning

Introduction: My name is Lilliana Bartoletti, I am a adventurous, pleasant, shiny, beautiful, handsome, zealous, tasty person who loves writing and wants to share my knowledge and understanding with you.