Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Derivative-free separable quadratic modeling and cubic regularization for unconstrained optimization

Derivative-free separable quadratic modeling and cubic regularization for unconstrained optimization We present a derivative-free separable quadratic modeling and cubic regularization technique for solving smooth unconstrained minimization problems. The derivative- free approach is mainly concerned with building a quadratic model that could be generated by numerical interpolation or using a minimum Frobenius norm approach, when the number of points available does not allow to build a complete quadratic model. This model plays a key role to generate an approximated gradient vector and Hessian matrix of the objective function at every iteration. We add a specialized cubic regularization strategy to minimize the quadratic model at each iteration, that makes use of separability. We discuss convergence results, including worst case complexity, of the proposed schemes to first-order stationary points. Some preliminary numerical results are presented to illustrate the robustness of the specialized separable cubic algorithm. Keywords Derivative-free optimization · Fully-linear models · Fully-quadratic models · Cubic regularization · Worst-case complexity Mathematics Subject Classification 90C30 · 65K05 · 90C56 · 65D05 B M. Raydan m.raydan@fct.unl.pt A. L. Custódio alcustodio@fct.unl.pt R. Garmanjani r.garmanjani@fct.unl.pt Center for Mathematics and Applications (NOVA Math), FCT NOVA, 2829-516 Caparica, Portugal Department of Mathematics, FCT NOVA, 2829-516 Caparica, Portugal 123 A. L. Custódio et al. 1 Introduction We consider unconstrained minimization problems of the form min f (x ), (1) x ∈R n n where the objective function f : R → R is continuously differentiable in R . However, we assume that the derivatives of f are not available and that cannot be easily approximated by finite difference methods. This situation frequently arises when f must be evaluated through black-box simulation packages, and each function evaluation may be costly and/or contaminated with noise (Conn et al. 2009b). Recently (Brás et al. 2020; Martínez and Raydan 2015, 2017), in a derivative-based context, several separable models combined with either a variable-norm trust-region strategy or with a cubic regularization scheme were proposed for solving (1), and their standard asymptotic convergence results were established. The main idea of these separable model approaches is to minimize a quadratic (or a cubic) model at each iteration, in which the quadratic part is the second-order Taylor approximation of the objective function. With a suitable change of variables, based on the Schur factorization, the solution of these subproblems is trivialized and an adequate choice of the norm at each iteration permits the employment of a trust-region reduction procedure that ensures the fulfillment of global convergence to second-order stationary points (Brás et al. 2020; Martínez and Raydan 2015). In that case, the separable model method with a trust-region strategy has the same asymptotic convergence properties as the trust-region Newton method. Later in Martínez and Raydan (2017), starting with the same modeling introduced in Martínez and Raydan (2015), the trust-region scheme was replaced with a separable cubic regularization strategy. Adding convenient regularization terms, the standard asymptotic convergence results were retained, and moreover the complexity of the cubic strategy for finding approximate first-order −3/2 stationary points became O(ε ). For the separable cubic regularization approach used in Martínez and Raydan (2017), complexity results with respect to second-order stationarity were also established. We note that regularization procedures serve to the same purpose and are strongly related to trust-region schemes, with the advantage of possessing improved worst-case complexity (WCC) bounds; see, e.g., (Bellavia et al. 2021; Birgin et al. 2017; Cartis et al. 2011a, b; Cartis and Scheinberg 2018; Grapiglia et al. 2015; Karas et al. 2015;Luetal. 2012; Martínez 2017; Nesterov and Polyak 2006;Xuetal. 2020). However, as previously mentioned, the separable cubic approaches developed in Brás et al. (2020), Martínez and Raydan (2015), Martínez and Raydan (2017)are based on the availability of the exact gradient vector and the exact Hessian matrix at every iteration. When exact derivatives are not available, quadratic models which are based only on the objective function values, computed at sample points, can be obtained retaining good quality of approximation of the gradient and the Hessian of the objective function. These derivative-free models can be constructed by means of polynomial interpolation or regression or by any other approximation technique. 123 Derivative-free separable quadratic... These models are called, depending on their accuracy, fully-linear or fully-quadratic; see (Conn et al. 2008a, b, 2009b) for details. Fully-linear and fully-quadratic models are the basis for derivative-free optimization trust-region methods (Conn et al. 2009a, b; Scheinberg and Toint 2010) and have also been successfully used in the definition of a search step for unconstrained directional direct search algorithms (Custódio et al. 2010). In the latter, minimum Frobenious norm approaches are adopted, when the number of points available does not allow the computation of a determined interpolation model. This state of affairs motivated us to develop a derivative-free separable version of the regularized method introduced in Martínez and Raydan (2017). This means that we will start with a derivative-free quadratic model, which can be obtained by different schemes, to obtain an approximated gradient vector and Hessian matrix per iteration, and then we will add the separable regularization cubic terms associated with an adaptive regularization parameter to guarantee convergence to stationary points. The paper is organized as follows. In Sect. 2 we present the main ideas behind the derivative-based separable modeling approaches. Section 3 revises several derivative- free schemes for building quadratic models. In Sect. 4 we describe our proposed derivative-free separable cubic regularization strategy, and discuss the associated con- vergence properties. Section 5 reports numerical results to give further insight into the proposed approach. Finally, in Sect. 6 we present some concluding remarks. Throughout, unless otherwise specified, we will use the Euclidean norm x= 1/2 n  2 (x x ) on R , where the inner product x x = x . For a given > 0we i =1 will denote the closed ball B(x ; ) ={y ∈ R |y − x≤ }. 2 Separable cubic modeling In the standard derivative-based quadratic modeling approach, for solving (1), a quadratic model of f (x ) around x is constructed by defining the model of the objective function as m (s) = f + g s + s H s, (2) k k k where f = f (x ), g =∇ f (x ) is the gradient vector at x , and H is either the k k k k k k Hessian of f at x , ∇ f (x ), or a symmetric approximation of it. The step s is the k k k minimizer of m (s). In Martínez and Raydan (2015), instead of using the standard quadratic model associated with Newton’s method, the equivalent separable quadratic model m (y) = f + (Q g ) y + y D y (3) k k k was considered to approximate the objective function f around the iterate x .In(3), the change of variables y = Q s is used, where the spectral (or Schur) factorization of H : H = Q D Q , (4) k k k 123 A. L. Custódio et al. is computed at every iteration. In (4), Q is an orthogonal n ×n matrix whose columns are the eigenvectors of H , and D is a real diagonal n × n matrix whose diagonal k k entries are the eigenvalues of H . Let us note that since H is symmetric then (4) k k is well-defined for all k. We also note that (3) may be non-convex, i.e., some of the diagonal entries of D could be negative. For the separable regularization counterpart in Martínez and Raydan (2017), the model (3) is kept and a cubic regularization term is added: 1 1 SR    3 m (y) = f + (Q g ) y + y D y + σ |y | , (5) k k k k i k k 2 6 i =1 where σ ≥ 0 is dynamically obtained. Note that a 1/6 factor is included in the last term of (5) to simplify derivative expressions. Notice also that, since D is a diagonal matrix, models (3) and (5) are indeed separable. As a consequence, at every iteration k the subproblem SR min m (y) y∈R is solved to compute the vector y , and then the step will be recovered as s = Q y . k k k k SR The gradient of the model m (y), given by (5), can be written as follows: SR ∇m (y) = Q g + D y +  u , k k k k k where the i-th entry of the n-dimensional vector  u is equal to |y |y . Similarly, the i i Hessian of (5) is given by 2 SR ∇ m (y) = D + σ diag(|y |). k k i SR To solve ∇m (y) = 0, and find the critical points, we only need to independently minimize n one-dimensional special functions. These special one-variable functions are of the following form 2 3 h(z) = c + c z + c z + c |z| . 0 1 2 3 The details on how to find the global minimizer of h(z) are fully described in (Martínez and Raydan 2017, Sect. 3). In the next section, we will describe several derivative-free alternatives to compute a model of type (2), to be incorporated in the separable regularized model (5). 3 Fully-linear and fully-quadratic derivative-free models Interpolation or regression based models are commonly used in derivative-free opti- mization as surrogates of the objective function. In particular, quadratic interpolation 123 Derivative-free separable quadratic... models are used as replacement of Taylor models in derivative-free trust-region approaches (Conn et al. 2009a; Scheinberg and Toint 2010). The terminology fully-linear and fully-quadratic, to describe a derivative-free model that retains Taylor-like bounds, was first proposed in Conn et al. (2009b). Defini- tions 3.1 and 3.2 provide a slightly modified version of it, suited for the present work. Throughout this section,  is a given positive constant that represents an upper max bound on the radii of the regions in which the models are built. Assumption 3.1 Let f be a continuously differentiable function with Lipschitz con- tinuous gradient (with constant L ). Definition 3.1 (Conn et al. 2009b, Definition 6.1) Let a function f : R → R, that satisfies Assumption 3.1, be given. A set of model functions M ={m : R → R, m ∈ C } is called a fully-linear class of models if: 1. There exist positive constants κ and κ such that for any x ∈ R and  ∈ ef eg (0, ] there exists a model function m(s) in M, with Lipschitz continuous max gradient, and such that • the error between the gradient of the model and the gradient of the function satisfies ∇ f (x + s) −∇m(s)≤ κ , ∀s ∈ B(0; ), (6) eg and • the error between the model and the function satisfies | f (x + s) − m(s)|≤ κ  , ∀s ∈ B(0; ). ef Such a model m is called fully-linear on B(x ; ). 2. For this class M there exists an algorithm, which we will call a ‘model- improvement’ algorithm, that in a finite, uniformly bounded (with respect to x and ) number of steps can • either establish that a given model m ∈ M is fully-linear on B(x ; ) (we will say that a certificate has been provided), • or find a model m ∈ M that is fully-linear on B(x ; ). For fully-quadratic models, stronger assumptions on the smoothness of the objective function are required. Assumption 3.2 Let f be a twice continuously differentiable function with Lipschitz continuous Hessian (with constant L ). Definition 3.2 (Conn et al. 2009b, Definition 6.2) Let a function f : R → R, that satisfies Assumption 3.2, be given. A set of model functions M ={m : R → R, m ∈ C } is called a fully-quadratic class of models if: 1. There exist positive constants κ , κ , and κ such that for any x ∈ R and ef eg eh ∈ (0, ] there exists a model function m(s) in M, with Lipschitz continuous max Hessian, and such that 123 A. L. Custódio et al. • the error between the Hessian of the model and the Hessian of the function satisfies 2 2 ∇ f (x + s) −∇ m(s)≤ κ , ∀s ∈ B(0; ), (7) eh • the error between the gradient of the model and the gradient of the function satisfies ∇ f (x + s) −∇m(s)≤ κ  , ∀s ∈ B(0; ), (8) eg and • the error between the model and the function satisfies | f (x + s) − m(s)|≤ κ  , ∀s ∈ B(0; ). ef Such a model m is called fully-quadratic on B(x ; ). 2. For this class M there exists an algorithm, which we will call a ‘model- improvement’ algorithm, that in a finite, uniformly bounded (with respect to x and ) number of steps can • either establish that a given model m ∈ M is fully-quadratic on B(x ; ) (we will say that a certificate has been provided), • or find a model m ∈ M that is fully-quadratic on B(x ; ). Algorithms for model certification or for improving the quality of a given model can be found in Conn et al. (2009b). This quality is directly related to the geometry of the sample set used in its computation (Conn et al. 2008a, b). However, some practical approaches have reported good numerical results related to implementations that do not consider a strict geometry control (Bandeira et al. 2012; Fasano et al. 2009). 4 Derivative-free separable cubic regularization approach In a derivative-free optimization setting, instead of (2), we will consider the following quadratic model m ˜ (s) = f +˜ g s + s H s, k k k where g ˜ =∇m˜ (x ) and H =∇ m ˜ (x ) are good quality approximations of g k k k k k k k and H , respectively, built using interpolation or a minimum Frobenius norm approach (see Chapters 3 and 5 in Conn et al. (2009b)). Hence, analogous to the discussion in ˜ ˜ ˜ ˜ ˜ ˜ Sect. 2, by using the change of variables y = Q s, where H = Q D Q , with Q k k k k k k ˜ ˜ an orthogonal n ×n matrix whose columns are the eigenvectors of H , and D is a real k k diagonal n ×n matrix whose diagonal entries are the eigenvalues of H , the equivalent separable quadratic model ˜ ˜ m ˜ (y) = f + (Q g ˜ ) y + y D y (9) k k k k k is used for the approximation of the objective function f around the iterate x .We then regularize (9) by adding a cubic or a quadratic term, depending on having been 123 Derivative-free separable quadratic... able to compute a fully-quadratic or a fully-linear model, respectively: 1 1 SR    p ˜ ˜ m ˜ (y) = f + (Q g ˜ ) y + y D y + σ |y | , k k k k i k k 2 p! i =1 where p ∈{2, 3} and σ ≥ 0 is dynamically obtained. As a consequence, at every iteration k the subproblem SR min m ˜ (y) subject to ≤y ≤ , (10) y∈R σ is solved to compute the vector y , and then the step will be recovered as s = Q y . k k k k The constraint y ≤ , where > 0 is a fixed given parameter for all k,is necessary to ensure the existence of a solution of problem (10) in some cases. Indeed, since some diagonal entries of D might be negative, for p = 2 the existence of an unconstrained minimizer of the objective function in (10) is not guaranteed. In the case of p = 3 and any σ > 0, the existence of an unconstrained minimizer of the same function is guaranteed. Nevertheless, if some diagonal entries of D are negative, and σ is still close to zero, imposing the constraint y ≤  prevents the obtained k ∞ vector y from being too large, and therefore avoids unnecessary numerical difficulties when solving (10). The additional constraint y ≥ relates the stepsize with the regularization parameter and is required to establish WCC results. A similar strategy has been used in Cartis and Scheinberg (2018) when building models using a probabilistic approach. As we will see in Section 4, this additional lower bound does not prohibit the iterative process to drive the first-order stationarity measure below any given small positive threshold. In this case, by solving n one-dimensional independent minimization problems in the closed intervals [−, −ξ/σ ] and [ξ/σ, ], we are being more demanding than the original constraint. These one-variable functions are of the form 2 3 h(z) = c + c z + c z + c |z| . 0 1 2 3 The details on how to find the global minimizer of h(z) on the closed and bounded intervals [−, −ξ/σ ] and [ξ/σ, ],for > 0 and ξ/σ > 0, are similar to the ones described in (Martínez and Raydan 2017, Sect. 3). A practical approach for the resolution of (10) will be suggested and tested in Sect. 5. The following algorithm is an adaptation of Algorithm 2.1 in Martínez and Raydan (2017), for the derivative-free case. Algorithm 1 Let α> 0, σ > 0, η> 1, and ξ> 0 be algorithmic parameters. Assume that small x ∈ R is a given initial approximation to the solution of problem (1). Initialize k ← 0. Step 1: Choose σ = σ and > . k small 123 A. L. Custódio et al. Step 2: Build a quadratic polynomial model m ˜ (s) = f +˜ g s + s H s, by selecting k k k points in B(x , ) (fully-linear, minimum Frobenious norm models or fully-quadratic polynomial models can be considered, depending on the number of points available for reuse or on the effort allowed in terms of number of function evaluations). Set p = 2 (respectively p = 3) if the computed model is fully-linear (respectively fully- quadratic). Step 3: Compute a solution s of trial 1 σ ξ ˜ ˜ ˜ Minimize g ˜ s + s H s + |[Q s] | subject to ≤Q s ≤ , k i ∞ k k k 2 p! σ i =1 (11) ˜ ˜ ˜ ˜ ˜ where H = Q D Q is a Schur factorization of H . k k k k Step 4: Test the sufficient decrease condition f (x + s ) ≤ f (x ) − α |[Q s ] | . (12) k trial k trial i i =1 If (12) is fulfilled, define s = s , x = x + s , update k ← k + 1 and go to k trial k+1 k k Step 1. Otherwise set σ = ησ , update σ ← σ , and go to Step 2. new k k new Remark 4.1 The upper bound constraint in (11) does not affect the separability nature of Step 3, since it can be equivalently replaced by |(Q s) |≤  for all i . However, the lower bound in (11) affects the separability of Step 3. Two strategies have been devel- oped to impose the lower bound constraint in (11) while maintaining the separability approach. These strategies will be described in Sect. 5. In the following subsections, the convergence and worst-case behavior of Algorithm 1 will be analyzed independently for the fully-linear and fully-quadratic cases. 4.1 Fully-linear approach This subsection will be devoted to the analysis of the WCC of Algorithm 1 when fully-linear models are used. For that, we need the following technical lemma. Lemma 4.1 (Nesterov 2004, Lemma 1.2.3) Let Assumption 3.1 hold. Then, we have f (x + s) − f (x ) −∇ f (x ) s ≤ s . (13) As it is common in nonlinear optimization, we assume that the norm of the Hessian of each model is bounded. 123 Derivative-free separable quadratic... Assumption 4.1 Assume that the norm of the Hessian of the model is bounded, i.e., H ≤ κ , ∀k ≥0(14) k ˜ for some κ > 0. We also assume that the trial point provides decrease to the current model, i.e., that for p = 2 the value of the objective function of (11)at s is less than or equal to trial its value at s = 0. Assumption 4.2 Assume that 1 σ ˜ ˜ g ˜ s + s H s + [Q s ] ≤ 0. (15) trial k trial trial k trial k i 2 2 i =1 Clearly, (15) holds if s is a global solution of (11)for p = 2. Hence, taking trial advantage of our separability approach, the vector s obtained at Step 3 of Algorithm trial 1 satisfies (15). In the following lemma, we will derive an upper bound on the number of function evaluations required to satisfy the sufficient decrease condition (12), which in turn guarantees that every iteration of Algorithm 1 is well-defined. Moreover, we also obtain an upper bound for the regularization parameter. Lemma 4.2 Let Assumptions 3.1, 4.1, and 4.2 hold and assume that at Step 2 of Algorithm 1 a fully-linear model is always used. In order to satisfy condition (12), with p = 2, Algorithm 1 needs at most ⎡ ⎤ L κ g ˜ log 2 α + + κ + /σ eg small 2 2 ⎢ ⎥ + 1 (16) ⎢ ⎥ logη ⎢ ⎥ function evaluations, not accounting for the ones required for model computation. In addition, the maximum value of σ for which (12) is satisfied, is given by g ˜ σ = max σ , 2η α + + κ + . (17) max small eg 2 2 Proof First, we will show that if g ˜ σ ≥ 2 α + + κ + (18) k eg 2 2 then the sufficient decrease condition (12) of Algorithm 1 is satisfied for p = 2. In view of (15), we have f (x + s ) − f (x ) ≤ f (x + s ) − f (x ) −˜ g s − s H s k trial k k trial k trial k trial k trial 123 A. L. Custódio et al. − [Q s ] trial k i i =1 ≤| f (x +s )− f (x )−∇ f (x ) s | k trial k k trial +|(∇ f (x ) −˜ g ) s | k k trial 1 σ ˜  ˜ + s H s − [Q s ] . k trial trial trial k i 2 2 i =1 Thus, by using (6), (13), (14), and s ≥ (due to Step 3 of Algorithm 1), we trial obtain f (x + s ) − f (x ) ≤ s  κ s k trial k trial eg trial 2 σ κ σ ˜ k H 2  2 + s  − [Q s ] trial trial k i 2 2 i =1 L σ g ˜ k H 2  2 ≤ + κ + s  − [Q s ] eg trial trial k i 2 2 2 i =1 L σ g ˜ k H  2 = + κ + − [Q s ] eg trial k i 2 2 2 i =1 ≤−α [Q s ] , trial k i i =1 where the equality in the third line follows from the fact that Q is an orthogonal n × n 2  2  2 ˜ ˜ matrix and so s  =Q s  = [Q s ] , and the last inequality trial trial trial k i =1 k i holds due to (18). Now, from the way σ is updated at Step 4 of Algorithm 1, it can be easily seen that for the fulfillment of (12) with p = 2 we need L κ g ˜ log 2 + + κ + /σ eg small 2 2 + 1 logη function evaluations, and, additionally, the upper bound on σ at (17) is derived from (18). The following assumption, which holds for global solutions of subproblem (11) (with p = 2), is central in establishing our WCC results. For similar assumptions required to obtain worst-case complexity bounds see (Birgin et al. 2017; Martínez 2017). 123 Derivative-free separable quadratic... Assumption 4.3 Assume that, for all k ≥ 0, ˜ ˜ Q s  = , or Q s  = , trial ∞ trial ∞ k k 1 σ ˜ ˜ or ∇ g ˜ s + s H s + [Q s] ≤ β s , (19) s k 1 trial k k i 2 2 i =1 s=s trial for some β > 0. Under this assumption, we are able to prove that, when the trial point is not on the boundary of the feasible region of (11) (with p = 2), then the norm of the gradient of the objective function at the new point is of the same order as the norm of the trial point. Lemma 4.3 Let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Then, we have ˜ ˜ Q s  = or Q s  = , (20) trial ∞ trial ∞ k k or ∇ f (x + s )≤ κ s , k trial 1 trial where κ = L + κ + κ + σ + β , and σ was defined in Lemma 4.2. 1 g eg max 1 max SR Proof Assume that none of the equalities at (20) hold. We have ∇ m ˜ (s ) = s trial g ˜ + H s + r (s ), where k k trial trial ˜ ˜ ˜ r (s ) = σ Q [Q s ] ,..., [Q s ] . trial k k trial 1 trial n k k Now, by using Assumption 3.1,(14), and (6), we have SR ∇ f (x +s )−∇ m ˜ (s )=∇ f (x +s )− g ˜ +H s +r (s ) k trial s trial k trial k k trial trial ≤ ∇ f (x + s )−∇ f (x ) + ∇ f (x )−˜ g k trial k k k +H s  + r (s ) k trial trial ≤ L + κ + κ + σ s . g eg ˜ max trial Therefore, in view of (19), we have SR SR ∇ f (x + s )≤∇ f (x + s ) −∇ m ˜ (s )+∇ m ˜ (s ) k trial k trial s trial s trial k k ≤ L + κ + κ + σ + β s , g eg ˜ max 1 trial which completes the proof. 123 A. L. Custódio et al. Now, we have all the ingredients to derive an upper bound on the number of iterations required by Algorithm 1 to find a point at which the norm of the gradient is below some given positive threshold. Theorem 4.1 Given > 0, let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Let {x } be the sequence of iterates generated by Algorithm 1, and f ≤ f (x ). Then the number min 0 of iterations such that ∇ f (x ) > and f (x )> f is bounded above by k+1 k+1 min f (x ) − f 0 min , (21) α min ,( ) σ κ max 1 where σ and κ were defined in Lemmas 4.2 and 4.3, respectively. max 1 Proof In view of Lemma 4.3,wehave ξ ∇ f (x ) k+1 s ≥ min ,, . σ κ k 1 Hence, since ∇ f (x ) > and > , we obtain k+1 s ≥ min , ≥ min , . σ κ σ κ k 1 max 1 On the other hand, due to the sufficient decrease condition (12), we obtain f (x ) ≤ f (x ) − α [Q s ] k+1 k k k i i =1 = f (x ) − αQ s k k 2 2 ≤ f (x ) − α min , . σ κ max 1 By summing up these inequalities, for 0, 1,..., k, we obtain f (x ) − f 0 min k + 1 ≤ , 2 2 α min , σ κ max 1 which concludes the proof. √ √ Since κ = O( n) (see Chapter 2 in Conn et al. 2009b), we have κ = O( n). eg 1 Now, if ξ is chosen such that = O( ), then the dependency of the upper bound σ κ max 1 given at (21)on n is O(n). Furthermore, for building a fully-linear model we need O(n) function evaluations. Combining these facts with Theorem 4.1, we can derive an upper bound on the number of function evaluations that Algorithm 1 needs for driving the first-order stationarity measure below some given positive threshold. 123 Derivative-free separable quadratic... Corollary 4.1 Given > 0, let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Let {x } be the sequence of iterates generated by Algorithm 1 and assume that ∇ f (x ) > and k+1 2 −2 f (x )> f . Then, Algorithm 1 needs at most O n function evaluations k+1 min for driving the norm of the gradient below The complexity bound derived here matches the one derived in Garmanjani et al. (2016) for derivative-free trust-region optimization methods and for direct search methods in Vicente (2013); see also Dodangeh et al. (2016). 4.2 Fully-quadratic approach In this subsection, we will analyze the WCC of Algorithm 1 when we build fully- quadratic models. The following lemma is essential for establishing such bounds. Lemma 4.4 (Nesterov 2004, Lemma 1.2.4) Let Assumption 3.2 hold. Then, we have 1 L 2 3 f (x + s) − f (x ) −∇ f (x ) s − s ∇ f (x )s ≤ s , (22) 2 6 and 2 2 ∇ f (x + s) −∇ f (x ) −∇ f (x )s ≤ s . (23) Similarly to the fully-linear case, we assume that the trial point provides decrease to the current model, i.e., that for p = 3 the value of the objective function of (11)at s is less than or equal to its value at s = 0. trial Assumption 4.4 Assume that 1 σ ˜ ˜ g ˜ s + s H s + |[Q s ] | ≤ 0. (24) trial k trial trial i k trial k 2 6 i =1 We note that (24) is clearly satisfied if s is a global solution of (11) when p = 3. trial Therefore, taking advantage of our separability approach, the vector s obtained at trial Step 3 of Algorithm 1 satisfies (24). With this assumption, we are able to obtain upper bounds on the number of function evaluations required to satisfy the sufficient decrease condition (12), and also on the regularization parameter. Lemma 4.5 Let Assumptions 3.2 and 4.4 hold and assume that at Step 2 of Algorithm 1 a fully-quadratic model is always used. In order to satisfy condition (12), with p = 3, Algorithm 1 needs at most ⎡ ⎤ L κ H eh log 6 α + n + κ + /σ eg small 6 2 ⎢ ⎥ + 1 ⎢ ⎥ logη ⎢ ⎥ 123 A. L. Custódio et al. function evaluations, not considering the ones required for model computation. In addition, the maximum value of σ for which (12) is satisfied, is given by L κ H eh σ = max σ , 6η α + n + κ + . (25) max small eg 6 2 Proof First, we will show that if L κ H eh σ ≥ 6 α + n + κ + (26) k eg 6 2 then the sufficient decrease condition (12) of Algorithm 1 is satisfied, with p = 3. In view of (22), we have 1 L 2 3 f (x + s ) − f (x ) ≤∇ f (x ) s + s ∇ f (x )s + s k trial k k trial k trial trial trial 2 6 1 L ≤˜ g s + s H s + s trial k trial trial k trial 2 6 +|(∇ f (x ) −˜ g ) s | k k trial + |s (∇ f (x ) − H )s |. k k trial trial Thus, by using (7), (8), and since s ≥ (due to Step 3 of Algorithm 1), we trial obtain 1 L f (x + s ) − f (x ) ≤˜ g s + s H s + s k trial k trial k trial trial k trial 2 6 ξ κ eh 2 3 + κ ( ) s + s eg trial trial σ 2 1 L κ H eh ≤˜ g s + s H s + +κ + s  . trial k trial eg trial k trial 2 6 2 Now, by applying (24), we have σ L κ k H eh 3 3 f (x + s ) − f (x ) ≤− |[Q s ] | + + κ + s  , k trial k trial i eg trial 6 6 2 i =1 −1/6 which, in view of the inequality · ≥ n · (see Theorem 16 on page 26 in 3 2 Hardy et al. (1934)), leads to σ L κ k H eh 3  3 ˜ ˜ f (x +s )− f (x )≤− |[Q s ] | + n +κ + Q s k trial k trial i eg trial k k 3 6 6 2 i =1 L κ σ H eh k = n + κ + − |[Q s ] | eg trial i 6 2 6 i =1 123 Derivative-free separable quadratic... ≤−α |[Q s ] | , trial i i =1 where the equality in the second line follows from the fact that, for any vector w ∈ n n n 3 3  3  3 ˜ ˜ R , w = |w | and so Q s  = |[Q s ] | , and the last i trial trial i 3 i =1 k 3 i =1 k inequality holds due to (26). Now, from the way σ is updated at Step 4 of Algorithm 1, it can easily be seen that for the fulfillment of (26) we need L κ H eh log 6 α + n + κ + /σ eg small 6 2 + 1 log η function evaluations, and, additionally, the upper bound on σ at (25) is derived from (26). The following assumption is quite similar to condition (14) given in Martínez and Raydan (2017), and it holds for global solutions of subproblem (11) (with p = 3). For similar assumptions, required to obtain worst-case complexity bounds associated with cubic regularization, see (Bellavia et al. 2021; Birgin et al. 2017; Cartis et al. 2011b; Cartis and Scheinberg 2018; Martínez 2017;Xuetal. 2020). Assumption 4.5 Assume that, for all k ≥ 0, ˜ ˜ Q s  = , or Q s  = , trial ∞ trial ∞ k k 1 σ 3 2 ˜ ˜ or ∇ g ˜ s + s H s + |[Q s] | ≤ β s  , (27) s k i 2 trial k k 2 6 i =1 s=s trial for some β > 0. Again, we are able to prove that, when the trial point is not on the boundary of the feasible region of (11) (with p = 3), then the norm of the gradient of the function computed at the new point is of the order of the squared norm of the trial point. Lemma 4.6 Let Assumptions 3.2, 4.4, and 4.5 hold. Then, we have ˜ ˜ Q s  = or Q s  = , (28) trial ∞ trial ∞ k k or ∇ f (x + s )≤ κ s  , k trial 2 trial L σ H max where κ = + κ + κ + + β , and σ was defined in Lemma 4.5. 2 eg eh 2 max 2 2 123 A. L. Custódio et al. SR Proof Assume that none of the equalities at (28) hold. We have ∇ m ˜ (s ) = s trial g ˜ + H s + r (s ), where k k trial trial ˜ ˜ ˜ ˜ r (s ) = Q sign [Q s ] [Q s ] ,..., sign [Q s ] trial k trial 1 trial trial n k k 1 k ×[Q s ] . trial k n Now, by using (23), (7), and (8), we have SR ∇ f (x +s )−∇ m ˜ (s ) = ∇ f (x +s )− g ˜ +H s +r (s ) k trial s trial k trial k k trial trial ≤ ∇ f (x +s )−∇ f (x )−∇ f (x )s k trial k k trial + ∇ f (x )−˜ g  + ∇ f (x )−H s k k k k trial + r (s ) trial L σ H max ≤ +κ +κ + s  . eg eh trial 2 2 Therefore, in view of (27), we have SR SR ∇ f (x + s )≤∇ f (x + s ) −∇ m ˜ (s )+∇ m ˜ (s ) k trial k trial s trial s trial k k L σ H max ≤ + κ + κ + + β s  , eg eh 2 trial 2 2 which completes the proof. Now, we have all the supporting results to establish the WCC bound of Algorithm 1 for the fully-quadratic case. Theorem 4.2 Given > 0, let Assumptions 3.2, 4.4, and 4.5 hold. Let {x } be the sequence of iterates generated by Algorithm 1, and f ≤ f (x ). Then the number min 0 of iterations such that ∇ f (x ) > and f (x )> f is bounded above by k+1 k+1 min n( f (x ) − f ) 0 min , (29) 3/2 α min ,( ) σ κ max 2 where σ and κ were defined in Lemmas 4.5 and 4.6, respectively. max 2 Proof In view of Lemma 4.6,wehave ξ ∇ f (x ) k+1 s ≥ min ,, . σ κ k 2 123 Derivative-free separable quadratic... Hence, since ∇ f (x ) > and > , we obtain k+1 s ≥ min , ≥ min , . σ κ σ κ k 2 max 2 On the other hand, due to the sufficient decrease condition (12) and the inequality −1/6 · ≥ n · , we obtain 3 2 f (x ) ≤ f (x ) − α |[Q s ] | k+1 k k i i =1 αQ s ≤ f (x ) − √ 3/2 α min ,( ) σ κ max 2 ≤ f (x ) − √ . By summing up these inequalities, for 0, 1,..., k, we obtain n( f (x ) − f ) 0 min k + 1 ≤   , 3/2 α min ,( ) σ κ max 2 which concludes the proof. Similarly to what we saw before for the fully-linear case, since κ = O(n) and eg 3/2 κ = O(n) (see Chapter 3 in Conn et al. (2009b)), we have κ = O(n ).By eh 2 choosing ξ in a way such that = O( ), the dependency of the upper bound σ κ max 2 11/4 given at (29)on n becomes of the order O(n ). Additionally, for building a fully- quadratic model we need O(n ) function evaluations. Combining these facts with Theorem 4.2, we can establish a WCC bound for driving the first-order stationarity measure below some given positive threshold. Corollary 4.2 Given > 0, let Assumptions 3.2, 4.4, and 4.5 hold. Let {x } be the sequence of iterates generated by Algorithm 1 and assume that ∇ f (x ) > and k+1 19/4 −3/2 f (x )> f . Then, Algorithm 1 needs at most O n function evaluations k+1 min for driving the norm of the gradient below In terms of , the derived complexity bound matches the one established in Cartis et al. (2012) for a derivative-free method with adaptive cubic regularization. The dependency of the bound derived here on n is worse than the one derived in Cartis et al. (2012). However, we have explicitly taken into account the dependency of the constants κ and κ on n. eg eh 123 A. L. Custódio et al. 5 Illustrative numerical experiments In this section we illustrate the different options to build the quadratic models at Step 2 of Algorithm 1 and two different strategies to address the subproblems (10). Model computation is a key issue for the success of Algorithm 1. However, in Derivative-free Optimization, saving in function evaluations by reusing previously evaluated points is a main concern. At each evaluation of a new point, the correspond- ing function value is stored in a list, of maximum size equal to (n + 1)(n + 2),for possible future use in model computation. If new points need to be generated with the sole purpose of model computation, the center, ‘extreme’ points and ‘mid-points’ of the set defined by x + [I − I ] are considered. Inspired by the works of Bandeira et al. (2012), Fasano et al. (2009), no explicit control of geometry is kept (in fact, we also tried the approach suggested by Scheinberg and Toint (2010), but the results did not improve). If a new point is evaluated and the maximum number of points allowed in the list has been reached, then the point farther away from the current iterate will be replaced by the new one. Points are always selected in B(x ; ) for model compu- tation. The option for a radius larger than , since in our numerical implementation −5 ξ = 10 , allows a better reuse of the function values previously computed, avoiding an excessive number of function evaluations just for model computation. Additionally, the definition of the radius as ensures that if the regularization parameter increases, the size of the neighborhood in which the points are selected decreases, a mechanism that resembles the behavior of trust-region radius in derivative-based optimization. Fully-linear and fully-quadratic models can be considered at all iterations, as well as hybrid versions, where depending on the number of points available for reuse inside B(x ; ) the option for a fully-linear or a fully-quadratic model is taken (thus, some iterations will use a fully-linear model and others a fully-quadratic model). Fully- quadratic models always require (n + 1)(n + 2)/2 points for computation. Fully-linear models are built using all the points available in B(x ; ), once that this number is at least n + 2 and does not exceed (n + 1)(n + 2)/2 − 1. In this case, a minimum Frobenius norm approach is taken to solve the linear system that provides the model coefficients (see Conn et al. 2009b, Section 5.3). Regarding the solution of subproblem (10), the imposed lower bound causes diffi- culties to the separability approach. Two strategies were considered to address it. In the first one, every one-dimensional problem considers the corresponding lower and upper bounds. This approach is not equivalent to the original formulation. It imposes a stronger condition since any vector y computed with this approach will satisfy ξ ξ y ≥ , but there could be a vector y satisfying y ≥ , which does not ∞ ∞ σ σ k k satisfy |y |≥ , ∀i ∈{1,..., n}. The second approach adopted disregards the lower bound condition, only considering y ≤  when solving subproblem (10). After computing y, the lower bound condition is tested and, if not satisfied, max |y | is i =1,...,n i set equal to to force the obtained vector y to also satisfy the lower bound constraint at (10). Algorithm 1 was implemented in Matlab 2021a. The experiments were executed in a laptop computer with CPU Intel core i7 1.99 GHz, RAM memory of 16 GB, running Windows 10 64-bits. As test sets, we considered the smooth collection of 123 Derivative-free separable quadratic... 44 problems proposed in Moré and Wild (2009) and 111 unconstrained problems with 40 or less variables from OPM, a subset of the CUTEst collection (Gould et al. 2015). Computational codes for the problems and the proposed initial points can be found at https://www.mcs.anl.gov/~more/df and https://github.com/gratton7/OPM, respectively. Parameters in Algorithm 1 were set to the following values:  = 10, for each −4 iteration k, σ = 0.1, η = 8, and α = 10 . At each iteration, the process is small initialized with the minimization of the quadratic model (9), with no regularization term, computed by selecting points in B(x ; 1). In this case, no lower bound is con- sidered when solving subproblem (10). If the sufficient decrease condition (12) is not satisfied by the computed solution, then the regularization process is initiated, con- sidering σ = σ . This approach allows to take advantage of the local properties k small of the “pure” quadratic models. As stopping criteria we consider ˜ g  < , where −5 = 10 , or a maximum of 1500 function evaluations. Regarding model computation, four variants were tested, depending on using fully- linear or fully-quadratic models and also on the value of p in the sufficient decrease condition used to accept new points at Step 4 of Algorithm 1. Fully-quadratic variant always computes a fully-quadratic model, built using (n + 1)(n + 2)/2 points, with a cubic sufficient decrease condition ( p = 3). Fully-linear always computes a quadratic model, using n +2 points, under a minimum Frobenious norm approach. In this case, the sufficient decrease condition considers p = 2. Hybrid versions compute fully-quadratic models, using (n + 1)(n + 2)/2 points or fully-linear minimum Frobe- nious norm models, with at least n + 2 points and a maximum of (n + 1)(n + 2)/2 − 1 points (depending on the number of points available in B(x ; 1/σ )). In this case, k k variant Hybrid_p3 always uses a cubic sufficient decrease condition to accept new points, whereas variant Hybrid_p23 selects a quadratic or cubic sufficient decrease condition, depending on the type of model that could be computed at the current iteration ( p = 2 for fully-linear and p = 3 for fully-quadratic). Results are reported using data profiles (Moré and Wild 2009) and performance pro- files (Dolan and Moré 2002). In a simplified way, a data profile provides the percentage of problems solved by a given algorithmic variant inside a given computational budget (expressed in sets of n + 1 function evaluations, where n denotes the dimension p p of problem p). Let S and P represent the set of solvers, associated to the different algorithmic variants considered, and the set of problems to be tested, respectively. If h represents the number of function evaluations required by algorithm s ∈ S to p,s solve problem p ∈ P (up to a certain accuracy), the data profile cumulative function is given by 1 h p,s d (ζ ) = p ∈ P : ≤ ζ . (30) |P| n + 1 With this purpose, a problem is considered to be solved to an accuracy level τ if the decrease obtained from the initial objective function value ( f (x ) − f (x ))isat least 1 − τ of the best decrease obtained for all the solvers considered ( f (x ) − f ), 0 L meaning: f (x ) − f (x ) ≥ (1 − τ)[ f (x ) − f ]. (31) 0 0 L −5 In the numerical experiments reported, the accuracy level was set equal to 10 . 123 A. L. Custódio et al. Data Profiles Performance Profiles 1 1 Fully-quadratic 0.9 Hybrid_p23 Fully-linear 0.8 0.8 Hybrid_p3 0.7 0.6 0.6 0.5 Fully-quadratic Hybrid_p23 0.4 0.4 Fully-linear Hybrid_p3 0.3 0.2 0.2 0.1 0 0 5 10152025303540455055 0 50 100 150 200 250 300 350 400 450 500 Fig. 1 Data and performance profiles comparing the use of different strategies for the computation of the quadratic models Performance profiles allow to evaluate the efficiency and the robustness of a given algorithmic variant. Let t be the number of function evaluations required p,s by solver s ∈ S to solve problem p ∈ P, according to the criterion (31). The cumula- tive distribution function, corresponding to the performance profile for solver s ∈ S is given by: ρ (ς ) = |{ p ∈ P : r ≤ ς}|, s p,s | P | with r = t / min{t :¯s ∈ S}. Thus, the value of ρ (1) represents the percentage p,s p,s p,s¯ s of problems where solver s required the minimum number of function evaluations, meaning it was the most efficient solver. Large values of ς allow to evaluate the capability of the algorithmic variants to solve the complete collection. Figure 1 reports the results obtained when considering different strategies for building the quadratic models. In this case, the stricter approach is used for solv- ing subproblem (10), always imposing the lower bound for each entry of the vector y at each one-dimensional minimization. It is clear that the hybrid version, that adequately adapts the sufficient decrease condition to the type of computed model, presents the best performance. The hybrid version that does not adapt the sufficient decrease condition is no better than the fully-linear approach. Even so, both are better than requiring the computation of a fully-quadratic model at every iteration. For the best variant, namely the hybrid version that adapts the sufficient decrease condition, we considered the second approach to the solution of problem (10), where the lower bound constraint is initially ignored, being the computed solution y modified a posteriori, if it does not satisfy the lower bound constraint. We denote this variant by adding the word projection. Results are very similar and can be found in Fig. 2. It is worth mentioning that the modification of the final solution, which was obtained by ignoring the lower bound constraint, was required only in seven problems, with a maximum of three times in two of those seven problems. d ( ) ( ) s Derivative-free separable quadratic... Data Profiles Performance Profiles 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 Hybrid_p23 Hybrid_p23 Hybrid_p23 projection 0.1 0.1 Hybrid_p23 projection 0 0 0 50 100 150 200 250 300 350 400 450 500 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Fig. 2 Data and performance profiles comparing two different strategies to address the solution of subprob- lem (10) 6 Conclusions and final remarks We present and analyze a derivative-free separable regularization approach for solving smooth unconstrained minimization problems. At each iteration we build a quadratic model of the objective function using only function evaluations. Several variants have been considered for this task, from a less expensive minimum Frobenius norm approach, to a more expensive fully-quadratic model, or a hybrid version that com- bines the previous approaches depending on the number of available useful points from previous iterations. For each one of the variants, we add to the model either a separable quadratic or a separable cubic regularization term to guarantee convergence to stationary points. Moreover, for each option we present a WCC analysis and we establish that, for driving the norm of the gradient below > 0, the fully-quadratic and the minimum 19/4 −3/2 2 −2 Frobenius norm regularized approaches need at most O n or O n function evaluations, respectively. The application of a convenient change of variables, based on the Schur factorization of the approximate Hessian matrices, trivializes the computation of the minimizer of the regularized models required at each iteration. In fact, the solution of the subproblem required at each iteration is reduced to the global minimization of n independent one- dimensional simple functions (a polynomial of degree 2 plus a term of the form |z| ) on a closed and bounded set on the real line. It is worth noticing that, for the typical low-dimensions used in DFO, the O(n ) computational cost of Schur factorizations is insignificant, as compared to the cost associated with the function evaluations required to build the quadratic model. Nevertheless, in addition to its use in Brás et al. (2020), this separability approach can be extended to be efficiently applied in other large-scale scenarios, for example in inexact or probabilistic adaptive cubic regularization; see, e.g., (Bellavia et al. 2021; Cartis and Scheinberg 2018). We would like to point out that the global minimizers of the regularized models can also be obtained using some other tractable schemes that, instead of solving n independent one-dimensional problems, solve at each iteration only one problem in n variables; see, e.g., (Cartis et al. 2011a; d ( ) ( ) s A. L. Custódio et al. Cristofari et al. 2019; Nesterov and Polyak 2006). These non-separable schemes, as well as our separable approach, require an O(n ) linear algebra computational cost. We also present a variety of numerical experiments to add understanding and illus- trate the behavior of all the different options considered for model computation. In general, we noticed that all the options show a robust performance. However, the worst behavior, concerning the required number of function evaluations, is consis- tently observed when using the fully-quadratic approach, and the best performance is observed when using the hybrid versions combined with a separable regularization term. Concerning the worst case complexity (WCC) results obtained for the considered approaches, a few comments are in order. Even though these results are of a theoretical nature and in general pessimistic in relation to the practical behavior of the methods, it is interesting to analyze which of the two considered approaches produces a better 19/4 −3/2 WCC result. For that, it is convenient to use their leading terms, i.e., n for 2 −2 the one using the fully-quadratic model and n for the one using the minimum Frobenius norm model. After some simple algebraic manipulations, we obtain that for the fully-quadratic approach to be better (that is, to require fewer function evaluations −2/11 11/2 in the worst case), it must hold that n < or equivalently that < 1/n . Therefore, if n is relatively small and is not very large (for example n < 9 and −5 ≈ 10 ) then the combined scheme that is based on the fully-quadratic model has a better WCC result than the scheme based on the minimum Frobenius norm approach. In our numerical experiments, the average dimension was 8.8 and for our −5 stopping criterion we fix = 10 , and hence from the theoretical WCC point of view, the best option is the one based on the fully-quadratic model. However, in our computational experiments the worst practical performance is clearly associated with the combination that uses the fully-quadratic model. We also note that if we choose a −2 more tolerant stopping criterion (say = 10 ), then for most of the same considered 11/2 small-dimensional problems we have that > 1/n , and so the scheme that uses the minimum Frobenius norm model exhibits simultaneously the best theoretical WCC result as well as the best practical performance. Finally, for future work, it would be interesting to study the practical behavior and the WCC results of the proposed derivative-free approach in the case of convex functions. Acknowledgements We are thankful to the comments and suggestions of two anonymous referees, which helped us to improve the presentation of our work. Author Contributions All authors contributed in the same way to the study, conception, design, data col- lection, conceptualization, methodology, formal analysis, and investigation. All authors read and approved the final manuscript. Funding Open access funding provided by FCT|FCCN (b-on). The first and second authors are funded by national funds through FCT - Fundação para a CiênciaeaTecnologiaI.P., under the scope of projects PTDC/MAT-APL/28400/2017, UIDP/MAT/00297/2020, and UIDB/MAT/00297/2020 (Center for Mathematics and Applications). The third author is funded by national funds through the FCT - Fundação para a CiênciaeaTecnologia, I.P., under the scope of the projects CEECIND/02211/2017, UIDP/MAT/00297/2020, and UIDB/MAT/00297/2020 (Center for Mathematics and Applications). 123 Derivative-free separable quadratic... Data availability The codes and datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. Declarations Conflict of interest No potential conflict of interest was reported by the authors. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Bellavia S, Gurioli G, Morini B (2021) Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization. IMA J Numer Anal 41:764–799 Birgin EG, Gardenghi JL, Martínez JM, Santos SA, Toint PhL (2017) Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math Program 163:359– Bandeira AS, Scheinberg K, Vicente LN (2012) Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Math Program 134:223–257 Brás CP, Martínez JM, Raydan M (2020) Large-scale unconstrained optimization using separable cubic modeling and matrix-free subspace minimization. Comput Optim Appl 75:169–205 Cartis C, Gould NIM, Toint Ph.L (2011a) Adaptive cubic regularisation methods for unconstrained opti- mization. Part I: motivation, convergence and numerical results. Math Program 127:245–295 Cartis C, Gould NIM, Toint PhL (2011b) Adaptive cubic regularisation methods for unconstrained optimiza- tion. Part II: worst-case function- and derivative-evaluation complexity. Math Program 130:295–319 Cartis C, Gould NIM, Toint PhL (2012) On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM J Optim 22:66–86 Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math Program 169:337–375 Cristofari A, Dehghan Niri T, Lucidi S (2019) On global minimizers of quadratic functions with cubic regularization. Optim Lett 13:1269–1283 Conn AR, Scheinberg K, Vicente LN (2008a) Geometry of interpolation sets in derivative free optimization. Math Program 111:141–172 Conn AR, Scheinberg K, Vicente LN (2008b) Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA J Numer Anal 28:721–748 Conn AR, Scheinberg K, Vicente LN (2009a) Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points. SIAM J Optim 20:387–415 Conn AR, Scheinberg K, Vicente LN (2009b) Introduction to derivative-free optimization. SIAM, Philadel- phia Custódio AL, Rocha H, Vicente LN (2010) Incorporating minimum Frobenius norm models in direct search. Comput Optim Appl 46:265–278 Dodangeh M, Vicente LN, Zhang Z (2016) On the optimal order of worst case complexity of direct search. Optim Lett 10:699–708 Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math Program 91:201–213 Fasano G, Morales JL, Nocedal J (2009) On the geometry phase in model-based algorithms for derivative- free optimization. Optim Methods Softw 24:145–154 Garmanjani R, Júdice D, Vicente LN (2016) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM J Optim 26:1987–2011 123 A. L. Custódio et al. Gould NIM, Orban D, Toint PhL (2015) CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comp Optim Appl 60:545–557 Grapiglia GN, Yuan J, Yuan Y-X (2015) On the convergence and worst-case complexity of trust-region and regularization methods for unconstrained optimization. Math Program 152:491–520 Hardy GH, Littlewood JE, Pólya G (1934) Inequalities. Cambridge University Press, New York Karas EW, Santos SA, Svaiter BF (2015) Algebraic rules for quadratic regularization of Newton’s method. Comput Optim Appl 60:343–376 Lu S, Wei Z, Li L (2012) A trust region algorithm with adaptive cubic regularization methods for nonsmooth convex minimization. Comput Optim Appl 51:551–573 Martínez JM (2017) On high-order model regularization for constrained optimization. SIAM J Optim 27:2447–2458 Martínez JM, Raydan M (2015) Separable cubic modeling and a trust-region strategy for unconstrained minimization with impact in global optimization. J Global Optim 53:319–342 Martínez JM, Raydan M (2017) Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J Global Optim 68:367–385 Moré JJ, Wild SM (2009) Benchmarking derivative-free optimization algorithms. SIAM J Optim 20:172– Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math Program 108:177–205 Nesterov Y (2004) Introductory lectures on convex optimization. Kluwer Academic Publishers, Dordrecht Scheinberg K, Toint PhL (2010) Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization. SIAM J Optim 20:3512–3532 Vicente LN (2013) Worst case complexity of direct search. EURO J Comput Optim 1:143–153 Xu P, Roosta F, Mahoney MW (2020) Newton-type methods for non-convex optimization under inexact Hessian information. Math Program 184:35–70 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png 4OR Springer Journals

Derivative-free separable quadratic modeling and cubic regularization for unconstrained optimization

4OR , Volume OnlineFirst – Apr 8, 2023

Loading next page...
 
/lp/springer-journals/derivative-free-separable-quadratic-modeling-and-cubic-regularization-osFVqtblVJ

References (47)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2023
ISSN
1619-4500
eISSN
1614-2411
DOI
10.1007/s10288-023-00541-9
Publisher site
See Article on Publisher Site

Abstract

We present a derivative-free separable quadratic modeling and cubic regularization technique for solving smooth unconstrained minimization problems. The derivative- free approach is mainly concerned with building a quadratic model that could be generated by numerical interpolation or using a minimum Frobenius norm approach, when the number of points available does not allow to build a complete quadratic model. This model plays a key role to generate an approximated gradient vector and Hessian matrix of the objective function at every iteration. We add a specialized cubic regularization strategy to minimize the quadratic model at each iteration, that makes use of separability. We discuss convergence results, including worst case complexity, of the proposed schemes to first-order stationary points. Some preliminary numerical results are presented to illustrate the robustness of the specialized separable cubic algorithm. Keywords Derivative-free optimization · Fully-linear models · Fully-quadratic models · Cubic regularization · Worst-case complexity Mathematics Subject Classification 90C30 · 65K05 · 90C56 · 65D05 B M. Raydan m.raydan@fct.unl.pt A. L. Custódio alcustodio@fct.unl.pt R. Garmanjani r.garmanjani@fct.unl.pt Center for Mathematics and Applications (NOVA Math), FCT NOVA, 2829-516 Caparica, Portugal Department of Mathematics, FCT NOVA, 2829-516 Caparica, Portugal 123 A. L. Custódio et al. 1 Introduction We consider unconstrained minimization problems of the form min f (x ), (1) x ∈R n n where the objective function f : R → R is continuously differentiable in R . However, we assume that the derivatives of f are not available and that cannot be easily approximated by finite difference methods. This situation frequently arises when f must be evaluated through black-box simulation packages, and each function evaluation may be costly and/or contaminated with noise (Conn et al. 2009b). Recently (Brás et al. 2020; Martínez and Raydan 2015, 2017), in a derivative-based context, several separable models combined with either a variable-norm trust-region strategy or with a cubic regularization scheme were proposed for solving (1), and their standard asymptotic convergence results were established. The main idea of these separable model approaches is to minimize a quadratic (or a cubic) model at each iteration, in which the quadratic part is the second-order Taylor approximation of the objective function. With a suitable change of variables, based on the Schur factorization, the solution of these subproblems is trivialized and an adequate choice of the norm at each iteration permits the employment of a trust-region reduction procedure that ensures the fulfillment of global convergence to second-order stationary points (Brás et al. 2020; Martínez and Raydan 2015). In that case, the separable model method with a trust-region strategy has the same asymptotic convergence properties as the trust-region Newton method. Later in Martínez and Raydan (2017), starting with the same modeling introduced in Martínez and Raydan (2015), the trust-region scheme was replaced with a separable cubic regularization strategy. Adding convenient regularization terms, the standard asymptotic convergence results were retained, and moreover the complexity of the cubic strategy for finding approximate first-order −3/2 stationary points became O(ε ). For the separable cubic regularization approach used in Martínez and Raydan (2017), complexity results with respect to second-order stationarity were also established. We note that regularization procedures serve to the same purpose and are strongly related to trust-region schemes, with the advantage of possessing improved worst-case complexity (WCC) bounds; see, e.g., (Bellavia et al. 2021; Birgin et al. 2017; Cartis et al. 2011a, b; Cartis and Scheinberg 2018; Grapiglia et al. 2015; Karas et al. 2015;Luetal. 2012; Martínez 2017; Nesterov and Polyak 2006;Xuetal. 2020). However, as previously mentioned, the separable cubic approaches developed in Brás et al. (2020), Martínez and Raydan (2015), Martínez and Raydan (2017)are based on the availability of the exact gradient vector and the exact Hessian matrix at every iteration. When exact derivatives are not available, quadratic models which are based only on the objective function values, computed at sample points, can be obtained retaining good quality of approximation of the gradient and the Hessian of the objective function. These derivative-free models can be constructed by means of polynomial interpolation or regression or by any other approximation technique. 123 Derivative-free separable quadratic... These models are called, depending on their accuracy, fully-linear or fully-quadratic; see (Conn et al. 2008a, b, 2009b) for details. Fully-linear and fully-quadratic models are the basis for derivative-free optimization trust-region methods (Conn et al. 2009a, b; Scheinberg and Toint 2010) and have also been successfully used in the definition of a search step for unconstrained directional direct search algorithms (Custódio et al. 2010). In the latter, minimum Frobenious norm approaches are adopted, when the number of points available does not allow the computation of a determined interpolation model. This state of affairs motivated us to develop a derivative-free separable version of the regularized method introduced in Martínez and Raydan (2017). This means that we will start with a derivative-free quadratic model, which can be obtained by different schemes, to obtain an approximated gradient vector and Hessian matrix per iteration, and then we will add the separable regularization cubic terms associated with an adaptive regularization parameter to guarantee convergence to stationary points. The paper is organized as follows. In Sect. 2 we present the main ideas behind the derivative-based separable modeling approaches. Section 3 revises several derivative- free schemes for building quadratic models. In Sect. 4 we describe our proposed derivative-free separable cubic regularization strategy, and discuss the associated con- vergence properties. Section 5 reports numerical results to give further insight into the proposed approach. Finally, in Sect. 6 we present some concluding remarks. Throughout, unless otherwise specified, we will use the Euclidean norm x= 1/2 n  2 (x x ) on R , where the inner product x x = x . For a given > 0we i =1 will denote the closed ball B(x ; ) ={y ∈ R |y − x≤ }. 2 Separable cubic modeling In the standard derivative-based quadratic modeling approach, for solving (1), a quadratic model of f (x ) around x is constructed by defining the model of the objective function as m (s) = f + g s + s H s, (2) k k k where f = f (x ), g =∇ f (x ) is the gradient vector at x , and H is either the k k k k k k Hessian of f at x , ∇ f (x ), or a symmetric approximation of it. The step s is the k k k minimizer of m (s). In Martínez and Raydan (2015), instead of using the standard quadratic model associated with Newton’s method, the equivalent separable quadratic model m (y) = f + (Q g ) y + y D y (3) k k k was considered to approximate the objective function f around the iterate x .In(3), the change of variables y = Q s is used, where the spectral (or Schur) factorization of H : H = Q D Q , (4) k k k 123 A. L. Custódio et al. is computed at every iteration. In (4), Q is an orthogonal n ×n matrix whose columns are the eigenvectors of H , and D is a real diagonal n × n matrix whose diagonal k k entries are the eigenvalues of H . Let us note that since H is symmetric then (4) k k is well-defined for all k. We also note that (3) may be non-convex, i.e., some of the diagonal entries of D could be negative. For the separable regularization counterpart in Martínez and Raydan (2017), the model (3) is kept and a cubic regularization term is added: 1 1 SR    3 m (y) = f + (Q g ) y + y D y + σ |y | , (5) k k k k i k k 2 6 i =1 where σ ≥ 0 is dynamically obtained. Note that a 1/6 factor is included in the last term of (5) to simplify derivative expressions. Notice also that, since D is a diagonal matrix, models (3) and (5) are indeed separable. As a consequence, at every iteration k the subproblem SR min m (y) y∈R is solved to compute the vector y , and then the step will be recovered as s = Q y . k k k k SR The gradient of the model m (y), given by (5), can be written as follows: SR ∇m (y) = Q g + D y +  u , k k k k k where the i-th entry of the n-dimensional vector  u is equal to |y |y . Similarly, the i i Hessian of (5) is given by 2 SR ∇ m (y) = D + σ diag(|y |). k k i SR To solve ∇m (y) = 0, and find the critical points, we only need to independently minimize n one-dimensional special functions. These special one-variable functions are of the following form 2 3 h(z) = c + c z + c z + c |z| . 0 1 2 3 The details on how to find the global minimizer of h(z) are fully described in (Martínez and Raydan 2017, Sect. 3). In the next section, we will describe several derivative-free alternatives to compute a model of type (2), to be incorporated in the separable regularized model (5). 3 Fully-linear and fully-quadratic derivative-free models Interpolation or regression based models are commonly used in derivative-free opti- mization as surrogates of the objective function. In particular, quadratic interpolation 123 Derivative-free separable quadratic... models are used as replacement of Taylor models in derivative-free trust-region approaches (Conn et al. 2009a; Scheinberg and Toint 2010). The terminology fully-linear and fully-quadratic, to describe a derivative-free model that retains Taylor-like bounds, was first proposed in Conn et al. (2009b). Defini- tions 3.1 and 3.2 provide a slightly modified version of it, suited for the present work. Throughout this section,  is a given positive constant that represents an upper max bound on the radii of the regions in which the models are built. Assumption 3.1 Let f be a continuously differentiable function with Lipschitz con- tinuous gradient (with constant L ). Definition 3.1 (Conn et al. 2009b, Definition 6.1) Let a function f : R → R, that satisfies Assumption 3.1, be given. A set of model functions M ={m : R → R, m ∈ C } is called a fully-linear class of models if: 1. There exist positive constants κ and κ such that for any x ∈ R and  ∈ ef eg (0, ] there exists a model function m(s) in M, with Lipschitz continuous max gradient, and such that • the error between the gradient of the model and the gradient of the function satisfies ∇ f (x + s) −∇m(s)≤ κ , ∀s ∈ B(0; ), (6) eg and • the error between the model and the function satisfies | f (x + s) − m(s)|≤ κ  , ∀s ∈ B(0; ). ef Such a model m is called fully-linear on B(x ; ). 2. For this class M there exists an algorithm, which we will call a ‘model- improvement’ algorithm, that in a finite, uniformly bounded (with respect to x and ) number of steps can • either establish that a given model m ∈ M is fully-linear on B(x ; ) (we will say that a certificate has been provided), • or find a model m ∈ M that is fully-linear on B(x ; ). For fully-quadratic models, stronger assumptions on the smoothness of the objective function are required. Assumption 3.2 Let f be a twice continuously differentiable function with Lipschitz continuous Hessian (with constant L ). Definition 3.2 (Conn et al. 2009b, Definition 6.2) Let a function f : R → R, that satisfies Assumption 3.2, be given. A set of model functions M ={m : R → R, m ∈ C } is called a fully-quadratic class of models if: 1. There exist positive constants κ , κ , and κ such that for any x ∈ R and ef eg eh ∈ (0, ] there exists a model function m(s) in M, with Lipschitz continuous max Hessian, and such that 123 A. L. Custódio et al. • the error between the Hessian of the model and the Hessian of the function satisfies 2 2 ∇ f (x + s) −∇ m(s)≤ κ , ∀s ∈ B(0; ), (7) eh • the error between the gradient of the model and the gradient of the function satisfies ∇ f (x + s) −∇m(s)≤ κ  , ∀s ∈ B(0; ), (8) eg and • the error between the model and the function satisfies | f (x + s) − m(s)|≤ κ  , ∀s ∈ B(0; ). ef Such a model m is called fully-quadratic on B(x ; ). 2. For this class M there exists an algorithm, which we will call a ‘model- improvement’ algorithm, that in a finite, uniformly bounded (with respect to x and ) number of steps can • either establish that a given model m ∈ M is fully-quadratic on B(x ; ) (we will say that a certificate has been provided), • or find a model m ∈ M that is fully-quadratic on B(x ; ). Algorithms for model certification or for improving the quality of a given model can be found in Conn et al. (2009b). This quality is directly related to the geometry of the sample set used in its computation (Conn et al. 2008a, b). However, some practical approaches have reported good numerical results related to implementations that do not consider a strict geometry control (Bandeira et al. 2012; Fasano et al. 2009). 4 Derivative-free separable cubic regularization approach In a derivative-free optimization setting, instead of (2), we will consider the following quadratic model m ˜ (s) = f +˜ g s + s H s, k k k where g ˜ =∇m˜ (x ) and H =∇ m ˜ (x ) are good quality approximations of g k k k k k k k and H , respectively, built using interpolation or a minimum Frobenius norm approach (see Chapters 3 and 5 in Conn et al. (2009b)). Hence, analogous to the discussion in ˜ ˜ ˜ ˜ ˜ ˜ Sect. 2, by using the change of variables y = Q s, where H = Q D Q , with Q k k k k k k ˜ ˜ an orthogonal n ×n matrix whose columns are the eigenvectors of H , and D is a real k k diagonal n ×n matrix whose diagonal entries are the eigenvalues of H , the equivalent separable quadratic model ˜ ˜ m ˜ (y) = f + (Q g ˜ ) y + y D y (9) k k k k k is used for the approximation of the objective function f around the iterate x .We then regularize (9) by adding a cubic or a quadratic term, depending on having been 123 Derivative-free separable quadratic... able to compute a fully-quadratic or a fully-linear model, respectively: 1 1 SR    p ˜ ˜ m ˜ (y) = f + (Q g ˜ ) y + y D y + σ |y | , k k k k i k k 2 p! i =1 where p ∈{2, 3} and σ ≥ 0 is dynamically obtained. As a consequence, at every iteration k the subproblem SR min m ˜ (y) subject to ≤y ≤ , (10) y∈R σ is solved to compute the vector y , and then the step will be recovered as s = Q y . k k k k The constraint y ≤ , where > 0 is a fixed given parameter for all k,is necessary to ensure the existence of a solution of problem (10) in some cases. Indeed, since some diagonal entries of D might be negative, for p = 2 the existence of an unconstrained minimizer of the objective function in (10) is not guaranteed. In the case of p = 3 and any σ > 0, the existence of an unconstrained minimizer of the same function is guaranteed. Nevertheless, if some diagonal entries of D are negative, and σ is still close to zero, imposing the constraint y ≤  prevents the obtained k ∞ vector y from being too large, and therefore avoids unnecessary numerical difficulties when solving (10). The additional constraint y ≥ relates the stepsize with the regularization parameter and is required to establish WCC results. A similar strategy has been used in Cartis and Scheinberg (2018) when building models using a probabilistic approach. As we will see in Section 4, this additional lower bound does not prohibit the iterative process to drive the first-order stationarity measure below any given small positive threshold. In this case, by solving n one-dimensional independent minimization problems in the closed intervals [−, −ξ/σ ] and [ξ/σ, ], we are being more demanding than the original constraint. These one-variable functions are of the form 2 3 h(z) = c + c z + c z + c |z| . 0 1 2 3 The details on how to find the global minimizer of h(z) on the closed and bounded intervals [−, −ξ/σ ] and [ξ/σ, ],for > 0 and ξ/σ > 0, are similar to the ones described in (Martínez and Raydan 2017, Sect. 3). A practical approach for the resolution of (10) will be suggested and tested in Sect. 5. The following algorithm is an adaptation of Algorithm 2.1 in Martínez and Raydan (2017), for the derivative-free case. Algorithm 1 Let α> 0, σ > 0, η> 1, and ξ> 0 be algorithmic parameters. Assume that small x ∈ R is a given initial approximation to the solution of problem (1). Initialize k ← 0. Step 1: Choose σ = σ and > . k small 123 A. L. Custódio et al. Step 2: Build a quadratic polynomial model m ˜ (s) = f +˜ g s + s H s, by selecting k k k points in B(x , ) (fully-linear, minimum Frobenious norm models or fully-quadratic polynomial models can be considered, depending on the number of points available for reuse or on the effort allowed in terms of number of function evaluations). Set p = 2 (respectively p = 3) if the computed model is fully-linear (respectively fully- quadratic). Step 3: Compute a solution s of trial 1 σ ξ ˜ ˜ ˜ Minimize g ˜ s + s H s + |[Q s] | subject to ≤Q s ≤ , k i ∞ k k k 2 p! σ i =1 (11) ˜ ˜ ˜ ˜ ˜ where H = Q D Q is a Schur factorization of H . k k k k Step 4: Test the sufficient decrease condition f (x + s ) ≤ f (x ) − α |[Q s ] | . (12) k trial k trial i i =1 If (12) is fulfilled, define s = s , x = x + s , update k ← k + 1 and go to k trial k+1 k k Step 1. Otherwise set σ = ησ , update σ ← σ , and go to Step 2. new k k new Remark 4.1 The upper bound constraint in (11) does not affect the separability nature of Step 3, since it can be equivalently replaced by |(Q s) |≤  for all i . However, the lower bound in (11) affects the separability of Step 3. Two strategies have been devel- oped to impose the lower bound constraint in (11) while maintaining the separability approach. These strategies will be described in Sect. 5. In the following subsections, the convergence and worst-case behavior of Algorithm 1 will be analyzed independently for the fully-linear and fully-quadratic cases. 4.1 Fully-linear approach This subsection will be devoted to the analysis of the WCC of Algorithm 1 when fully-linear models are used. For that, we need the following technical lemma. Lemma 4.1 (Nesterov 2004, Lemma 1.2.3) Let Assumption 3.1 hold. Then, we have f (x + s) − f (x ) −∇ f (x ) s ≤ s . (13) As it is common in nonlinear optimization, we assume that the norm of the Hessian of each model is bounded. 123 Derivative-free separable quadratic... Assumption 4.1 Assume that the norm of the Hessian of the model is bounded, i.e., H ≤ κ , ∀k ≥0(14) k ˜ for some κ > 0. We also assume that the trial point provides decrease to the current model, i.e., that for p = 2 the value of the objective function of (11)at s is less than or equal to trial its value at s = 0. Assumption 4.2 Assume that 1 σ ˜ ˜ g ˜ s + s H s + [Q s ] ≤ 0. (15) trial k trial trial k trial k i 2 2 i =1 Clearly, (15) holds if s is a global solution of (11)for p = 2. Hence, taking trial advantage of our separability approach, the vector s obtained at Step 3 of Algorithm trial 1 satisfies (15). In the following lemma, we will derive an upper bound on the number of function evaluations required to satisfy the sufficient decrease condition (12), which in turn guarantees that every iteration of Algorithm 1 is well-defined. Moreover, we also obtain an upper bound for the regularization parameter. Lemma 4.2 Let Assumptions 3.1, 4.1, and 4.2 hold and assume that at Step 2 of Algorithm 1 a fully-linear model is always used. In order to satisfy condition (12), with p = 2, Algorithm 1 needs at most ⎡ ⎤ L κ g ˜ log 2 α + + κ + /σ eg small 2 2 ⎢ ⎥ + 1 (16) ⎢ ⎥ logη ⎢ ⎥ function evaluations, not accounting for the ones required for model computation. In addition, the maximum value of σ for which (12) is satisfied, is given by g ˜ σ = max σ , 2η α + + κ + . (17) max small eg 2 2 Proof First, we will show that if g ˜ σ ≥ 2 α + + κ + (18) k eg 2 2 then the sufficient decrease condition (12) of Algorithm 1 is satisfied for p = 2. In view of (15), we have f (x + s ) − f (x ) ≤ f (x + s ) − f (x ) −˜ g s − s H s k trial k k trial k trial k trial k trial 123 A. L. Custódio et al. − [Q s ] trial k i i =1 ≤| f (x +s )− f (x )−∇ f (x ) s | k trial k k trial +|(∇ f (x ) −˜ g ) s | k k trial 1 σ ˜  ˜ + s H s − [Q s ] . k trial trial trial k i 2 2 i =1 Thus, by using (6), (13), (14), and s ≥ (due to Step 3 of Algorithm 1), we trial obtain f (x + s ) − f (x ) ≤ s  κ s k trial k trial eg trial 2 σ κ σ ˜ k H 2  2 + s  − [Q s ] trial trial k i 2 2 i =1 L σ g ˜ k H 2  2 ≤ + κ + s  − [Q s ] eg trial trial k i 2 2 2 i =1 L σ g ˜ k H  2 = + κ + − [Q s ] eg trial k i 2 2 2 i =1 ≤−α [Q s ] , trial k i i =1 where the equality in the third line follows from the fact that Q is an orthogonal n × n 2  2  2 ˜ ˜ matrix and so s  =Q s  = [Q s ] , and the last inequality trial trial trial k i =1 k i holds due to (18). Now, from the way σ is updated at Step 4 of Algorithm 1, it can be easily seen that for the fulfillment of (12) with p = 2 we need L κ g ˜ log 2 + + κ + /σ eg small 2 2 + 1 logη function evaluations, and, additionally, the upper bound on σ at (17) is derived from (18). The following assumption, which holds for global solutions of subproblem (11) (with p = 2), is central in establishing our WCC results. For similar assumptions required to obtain worst-case complexity bounds see (Birgin et al. 2017; Martínez 2017). 123 Derivative-free separable quadratic... Assumption 4.3 Assume that, for all k ≥ 0, ˜ ˜ Q s  = , or Q s  = , trial ∞ trial ∞ k k 1 σ ˜ ˜ or ∇ g ˜ s + s H s + [Q s] ≤ β s , (19) s k 1 trial k k i 2 2 i =1 s=s trial for some β > 0. Under this assumption, we are able to prove that, when the trial point is not on the boundary of the feasible region of (11) (with p = 2), then the norm of the gradient of the objective function at the new point is of the same order as the norm of the trial point. Lemma 4.3 Let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Then, we have ˜ ˜ Q s  = or Q s  = , (20) trial ∞ trial ∞ k k or ∇ f (x + s )≤ κ s , k trial 1 trial where κ = L + κ + κ + σ + β , and σ was defined in Lemma 4.2. 1 g eg max 1 max SR Proof Assume that none of the equalities at (20) hold. We have ∇ m ˜ (s ) = s trial g ˜ + H s + r (s ), where k k trial trial ˜ ˜ ˜ r (s ) = σ Q [Q s ] ,..., [Q s ] . trial k k trial 1 trial n k k Now, by using Assumption 3.1,(14), and (6), we have SR ∇ f (x +s )−∇ m ˜ (s )=∇ f (x +s )− g ˜ +H s +r (s ) k trial s trial k trial k k trial trial ≤ ∇ f (x + s )−∇ f (x ) + ∇ f (x )−˜ g k trial k k k +H s  + r (s ) k trial trial ≤ L + κ + κ + σ s . g eg ˜ max trial Therefore, in view of (19), we have SR SR ∇ f (x + s )≤∇ f (x + s ) −∇ m ˜ (s )+∇ m ˜ (s ) k trial k trial s trial s trial k k ≤ L + κ + κ + σ + β s , g eg ˜ max 1 trial which completes the proof. 123 A. L. Custódio et al. Now, we have all the ingredients to derive an upper bound on the number of iterations required by Algorithm 1 to find a point at which the norm of the gradient is below some given positive threshold. Theorem 4.1 Given > 0, let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Let {x } be the sequence of iterates generated by Algorithm 1, and f ≤ f (x ). Then the number min 0 of iterations such that ∇ f (x ) > and f (x )> f is bounded above by k+1 k+1 min f (x ) − f 0 min , (21) α min ,( ) σ κ max 1 where σ and κ were defined in Lemmas 4.2 and 4.3, respectively. max 1 Proof In view of Lemma 4.3,wehave ξ ∇ f (x ) k+1 s ≥ min ,, . σ κ k 1 Hence, since ∇ f (x ) > and > , we obtain k+1 s ≥ min , ≥ min , . σ κ σ κ k 1 max 1 On the other hand, due to the sufficient decrease condition (12), we obtain f (x ) ≤ f (x ) − α [Q s ] k+1 k k k i i =1 = f (x ) − αQ s k k 2 2 ≤ f (x ) − α min , . σ κ max 1 By summing up these inequalities, for 0, 1,..., k, we obtain f (x ) − f 0 min k + 1 ≤ , 2 2 α min , σ κ max 1 which concludes the proof. √ √ Since κ = O( n) (see Chapter 2 in Conn et al. 2009b), we have κ = O( n). eg 1 Now, if ξ is chosen such that = O( ), then the dependency of the upper bound σ κ max 1 given at (21)on n is O(n). Furthermore, for building a fully-linear model we need O(n) function evaluations. Combining these facts with Theorem 4.1, we can derive an upper bound on the number of function evaluations that Algorithm 1 needs for driving the first-order stationarity measure below some given positive threshold. 123 Derivative-free separable quadratic... Corollary 4.1 Given > 0, let Assumptions 3.1, 4.1, 4.2, and 4.3 hold. Let {x } be the sequence of iterates generated by Algorithm 1 and assume that ∇ f (x ) > and k+1 2 −2 f (x )> f . Then, Algorithm 1 needs at most O n function evaluations k+1 min for driving the norm of the gradient below The complexity bound derived here matches the one derived in Garmanjani et al. (2016) for derivative-free trust-region optimization methods and for direct search methods in Vicente (2013); see also Dodangeh et al. (2016). 4.2 Fully-quadratic approach In this subsection, we will analyze the WCC of Algorithm 1 when we build fully- quadratic models. The following lemma is essential for establishing such bounds. Lemma 4.4 (Nesterov 2004, Lemma 1.2.4) Let Assumption 3.2 hold. Then, we have 1 L 2 3 f (x + s) − f (x ) −∇ f (x ) s − s ∇ f (x )s ≤ s , (22) 2 6 and 2 2 ∇ f (x + s) −∇ f (x ) −∇ f (x )s ≤ s . (23) Similarly to the fully-linear case, we assume that the trial point provides decrease to the current model, i.e., that for p = 3 the value of the objective function of (11)at s is less than or equal to its value at s = 0. trial Assumption 4.4 Assume that 1 σ ˜ ˜ g ˜ s + s H s + |[Q s ] | ≤ 0. (24) trial k trial trial i k trial k 2 6 i =1 We note that (24) is clearly satisfied if s is a global solution of (11) when p = 3. trial Therefore, taking advantage of our separability approach, the vector s obtained at trial Step 3 of Algorithm 1 satisfies (24). With this assumption, we are able to obtain upper bounds on the number of function evaluations required to satisfy the sufficient decrease condition (12), and also on the regularization parameter. Lemma 4.5 Let Assumptions 3.2 and 4.4 hold and assume that at Step 2 of Algorithm 1 a fully-quadratic model is always used. In order to satisfy condition (12), with p = 3, Algorithm 1 needs at most ⎡ ⎤ L κ H eh log 6 α + n + κ + /σ eg small 6 2 ⎢ ⎥ + 1 ⎢ ⎥ logη ⎢ ⎥ 123 A. L. Custódio et al. function evaluations, not considering the ones required for model computation. In addition, the maximum value of σ for which (12) is satisfied, is given by L κ H eh σ = max σ , 6η α + n + κ + . (25) max small eg 6 2 Proof First, we will show that if L κ H eh σ ≥ 6 α + n + κ + (26) k eg 6 2 then the sufficient decrease condition (12) of Algorithm 1 is satisfied, with p = 3. In view of (22), we have 1 L 2 3 f (x + s ) − f (x ) ≤∇ f (x ) s + s ∇ f (x )s + s k trial k k trial k trial trial trial 2 6 1 L ≤˜ g s + s H s + s trial k trial trial k trial 2 6 +|(∇ f (x ) −˜ g ) s | k k trial + |s (∇ f (x ) − H )s |. k k trial trial Thus, by using (7), (8), and since s ≥ (due to Step 3 of Algorithm 1), we trial obtain 1 L f (x + s ) − f (x ) ≤˜ g s + s H s + s k trial k trial k trial trial k trial 2 6 ξ κ eh 2 3 + κ ( ) s + s eg trial trial σ 2 1 L κ H eh ≤˜ g s + s H s + +κ + s  . trial k trial eg trial k trial 2 6 2 Now, by applying (24), we have σ L κ k H eh 3 3 f (x + s ) − f (x ) ≤− |[Q s ] | + + κ + s  , k trial k trial i eg trial 6 6 2 i =1 −1/6 which, in view of the inequality · ≥ n · (see Theorem 16 on page 26 in 3 2 Hardy et al. (1934)), leads to σ L κ k H eh 3  3 ˜ ˜ f (x +s )− f (x )≤− |[Q s ] | + n +κ + Q s k trial k trial i eg trial k k 3 6 6 2 i =1 L κ σ H eh k = n + κ + − |[Q s ] | eg trial i 6 2 6 i =1 123 Derivative-free separable quadratic... ≤−α |[Q s ] | , trial i i =1 where the equality in the second line follows from the fact that, for any vector w ∈ n n n 3 3  3  3 ˜ ˜ R , w = |w | and so Q s  = |[Q s ] | , and the last i trial trial i 3 i =1 k 3 i =1 k inequality holds due to (26). Now, from the way σ is updated at Step 4 of Algorithm 1, it can easily be seen that for the fulfillment of (26) we need L κ H eh log 6 α + n + κ + /σ eg small 6 2 + 1 log η function evaluations, and, additionally, the upper bound on σ at (25) is derived from (26). The following assumption is quite similar to condition (14) given in Martínez and Raydan (2017), and it holds for global solutions of subproblem (11) (with p = 3). For similar assumptions, required to obtain worst-case complexity bounds associated with cubic regularization, see (Bellavia et al. 2021; Birgin et al. 2017; Cartis et al. 2011b; Cartis and Scheinberg 2018; Martínez 2017;Xuetal. 2020). Assumption 4.5 Assume that, for all k ≥ 0, ˜ ˜ Q s  = , or Q s  = , trial ∞ trial ∞ k k 1 σ 3 2 ˜ ˜ or ∇ g ˜ s + s H s + |[Q s] | ≤ β s  , (27) s k i 2 trial k k 2 6 i =1 s=s trial for some β > 0. Again, we are able to prove that, when the trial point is not on the boundary of the feasible region of (11) (with p = 3), then the norm of the gradient of the function computed at the new point is of the order of the squared norm of the trial point. Lemma 4.6 Let Assumptions 3.2, 4.4, and 4.5 hold. Then, we have ˜ ˜ Q s  = or Q s  = , (28) trial ∞ trial ∞ k k or ∇ f (x + s )≤ κ s  , k trial 2 trial L σ H max where κ = + κ + κ + + β , and σ was defined in Lemma 4.5. 2 eg eh 2 max 2 2 123 A. L. Custódio et al. SR Proof Assume that none of the equalities at (28) hold. We have ∇ m ˜ (s ) = s trial g ˜ + H s + r (s ), where k k trial trial ˜ ˜ ˜ ˜ r (s ) = Q sign [Q s ] [Q s ] ,..., sign [Q s ] trial k trial 1 trial trial n k k 1 k ×[Q s ] . trial k n Now, by using (23), (7), and (8), we have SR ∇ f (x +s )−∇ m ˜ (s ) = ∇ f (x +s )− g ˜ +H s +r (s ) k trial s trial k trial k k trial trial ≤ ∇ f (x +s )−∇ f (x )−∇ f (x )s k trial k k trial + ∇ f (x )−˜ g  + ∇ f (x )−H s k k k k trial + r (s ) trial L σ H max ≤ +κ +κ + s  . eg eh trial 2 2 Therefore, in view of (27), we have SR SR ∇ f (x + s )≤∇ f (x + s ) −∇ m ˜ (s )+∇ m ˜ (s ) k trial k trial s trial s trial k k L σ H max ≤ + κ + κ + + β s  , eg eh 2 trial 2 2 which completes the proof. Now, we have all the supporting results to establish the WCC bound of Algorithm 1 for the fully-quadratic case. Theorem 4.2 Given > 0, let Assumptions 3.2, 4.4, and 4.5 hold. Let {x } be the sequence of iterates generated by Algorithm 1, and f ≤ f (x ). Then the number min 0 of iterations such that ∇ f (x ) > and f (x )> f is bounded above by k+1 k+1 min n( f (x ) − f ) 0 min , (29) 3/2 α min ,( ) σ κ max 2 where σ and κ were defined in Lemmas 4.5 and 4.6, respectively. max 2 Proof In view of Lemma 4.6,wehave ξ ∇ f (x ) k+1 s ≥ min ,, . σ κ k 2 123 Derivative-free separable quadratic... Hence, since ∇ f (x ) > and > , we obtain k+1 s ≥ min , ≥ min , . σ κ σ κ k 2 max 2 On the other hand, due to the sufficient decrease condition (12) and the inequality −1/6 · ≥ n · , we obtain 3 2 f (x ) ≤ f (x ) − α |[Q s ] | k+1 k k i i =1 αQ s ≤ f (x ) − √ 3/2 α min ,( ) σ κ max 2 ≤ f (x ) − √ . By summing up these inequalities, for 0, 1,..., k, we obtain n( f (x ) − f ) 0 min k + 1 ≤   , 3/2 α min ,( ) σ κ max 2 which concludes the proof. Similarly to what we saw before for the fully-linear case, since κ = O(n) and eg 3/2 κ = O(n) (see Chapter 3 in Conn et al. (2009b)), we have κ = O(n ).By eh 2 choosing ξ in a way such that = O( ), the dependency of the upper bound σ κ max 2 11/4 given at (29)on n becomes of the order O(n ). Additionally, for building a fully- quadratic model we need O(n ) function evaluations. Combining these facts with Theorem 4.2, we can establish a WCC bound for driving the first-order stationarity measure below some given positive threshold. Corollary 4.2 Given > 0, let Assumptions 3.2, 4.4, and 4.5 hold. Let {x } be the sequence of iterates generated by Algorithm 1 and assume that ∇ f (x ) > and k+1 19/4 −3/2 f (x )> f . Then, Algorithm 1 needs at most O n function evaluations k+1 min for driving the norm of the gradient below In terms of , the derived complexity bound matches the one established in Cartis et al. (2012) for a derivative-free method with adaptive cubic regularization. The dependency of the bound derived here on n is worse than the one derived in Cartis et al. (2012). However, we have explicitly taken into account the dependency of the constants κ and κ on n. eg eh 123 A. L. Custódio et al. 5 Illustrative numerical experiments In this section we illustrate the different options to build the quadratic models at Step 2 of Algorithm 1 and two different strategies to address the subproblems (10). Model computation is a key issue for the success of Algorithm 1. However, in Derivative-free Optimization, saving in function evaluations by reusing previously evaluated points is a main concern. At each evaluation of a new point, the correspond- ing function value is stored in a list, of maximum size equal to (n + 1)(n + 2),for possible future use in model computation. If new points need to be generated with the sole purpose of model computation, the center, ‘extreme’ points and ‘mid-points’ of the set defined by x + [I − I ] are considered. Inspired by the works of Bandeira et al. (2012), Fasano et al. (2009), no explicit control of geometry is kept (in fact, we also tried the approach suggested by Scheinberg and Toint (2010), but the results did not improve). If a new point is evaluated and the maximum number of points allowed in the list has been reached, then the point farther away from the current iterate will be replaced by the new one. Points are always selected in B(x ; ) for model compu- tation. The option for a radius larger than , since in our numerical implementation −5 ξ = 10 , allows a better reuse of the function values previously computed, avoiding an excessive number of function evaluations just for model computation. Additionally, the definition of the radius as ensures that if the regularization parameter increases, the size of the neighborhood in which the points are selected decreases, a mechanism that resembles the behavior of trust-region radius in derivative-based optimization. Fully-linear and fully-quadratic models can be considered at all iterations, as well as hybrid versions, where depending on the number of points available for reuse inside B(x ; ) the option for a fully-linear or a fully-quadratic model is taken (thus, some iterations will use a fully-linear model and others a fully-quadratic model). Fully- quadratic models always require (n + 1)(n + 2)/2 points for computation. Fully-linear models are built using all the points available in B(x ; ), once that this number is at least n + 2 and does not exceed (n + 1)(n + 2)/2 − 1. In this case, a minimum Frobenius norm approach is taken to solve the linear system that provides the model coefficients (see Conn et al. 2009b, Section 5.3). Regarding the solution of subproblem (10), the imposed lower bound causes diffi- culties to the separability approach. Two strategies were considered to address it. In the first one, every one-dimensional problem considers the corresponding lower and upper bounds. This approach is not equivalent to the original formulation. It imposes a stronger condition since any vector y computed with this approach will satisfy ξ ξ y ≥ , but there could be a vector y satisfying y ≥ , which does not ∞ ∞ σ σ k k satisfy |y |≥ , ∀i ∈{1,..., n}. The second approach adopted disregards the lower bound condition, only considering y ≤  when solving subproblem (10). After computing y, the lower bound condition is tested and, if not satisfied, max |y | is i =1,...,n i set equal to to force the obtained vector y to also satisfy the lower bound constraint at (10). Algorithm 1 was implemented in Matlab 2021a. The experiments were executed in a laptop computer with CPU Intel core i7 1.99 GHz, RAM memory of 16 GB, running Windows 10 64-bits. As test sets, we considered the smooth collection of 123 Derivative-free separable quadratic... 44 problems proposed in Moré and Wild (2009) and 111 unconstrained problems with 40 or less variables from OPM, a subset of the CUTEst collection (Gould et al. 2015). Computational codes for the problems and the proposed initial points can be found at https://www.mcs.anl.gov/~more/df and https://github.com/gratton7/OPM, respectively. Parameters in Algorithm 1 were set to the following values:  = 10, for each −4 iteration k, σ = 0.1, η = 8, and α = 10 . At each iteration, the process is small initialized with the minimization of the quadratic model (9), with no regularization term, computed by selecting points in B(x ; 1). In this case, no lower bound is con- sidered when solving subproblem (10). If the sufficient decrease condition (12) is not satisfied by the computed solution, then the regularization process is initiated, con- sidering σ = σ . This approach allows to take advantage of the local properties k small of the “pure” quadratic models. As stopping criteria we consider ˜ g  < , where −5 = 10 , or a maximum of 1500 function evaluations. Regarding model computation, four variants were tested, depending on using fully- linear or fully-quadratic models and also on the value of p in the sufficient decrease condition used to accept new points at Step 4 of Algorithm 1. Fully-quadratic variant always computes a fully-quadratic model, built using (n + 1)(n + 2)/2 points, with a cubic sufficient decrease condition ( p = 3). Fully-linear always computes a quadratic model, using n +2 points, under a minimum Frobenious norm approach. In this case, the sufficient decrease condition considers p = 2. Hybrid versions compute fully-quadratic models, using (n + 1)(n + 2)/2 points or fully-linear minimum Frobe- nious norm models, with at least n + 2 points and a maximum of (n + 1)(n + 2)/2 − 1 points (depending on the number of points available in B(x ; 1/σ )). In this case, k k variant Hybrid_p3 always uses a cubic sufficient decrease condition to accept new points, whereas variant Hybrid_p23 selects a quadratic or cubic sufficient decrease condition, depending on the type of model that could be computed at the current iteration ( p = 2 for fully-linear and p = 3 for fully-quadratic). Results are reported using data profiles (Moré and Wild 2009) and performance pro- files (Dolan and Moré 2002). In a simplified way, a data profile provides the percentage of problems solved by a given algorithmic variant inside a given computational budget (expressed in sets of n + 1 function evaluations, where n denotes the dimension p p of problem p). Let S and P represent the set of solvers, associated to the different algorithmic variants considered, and the set of problems to be tested, respectively. If h represents the number of function evaluations required by algorithm s ∈ S to p,s solve problem p ∈ P (up to a certain accuracy), the data profile cumulative function is given by 1 h p,s d (ζ ) = p ∈ P : ≤ ζ . (30) |P| n + 1 With this purpose, a problem is considered to be solved to an accuracy level τ if the decrease obtained from the initial objective function value ( f (x ) − f (x ))isat least 1 − τ of the best decrease obtained for all the solvers considered ( f (x ) − f ), 0 L meaning: f (x ) − f (x ) ≥ (1 − τ)[ f (x ) − f ]. (31) 0 0 L −5 In the numerical experiments reported, the accuracy level was set equal to 10 . 123 A. L. Custódio et al. Data Profiles Performance Profiles 1 1 Fully-quadratic 0.9 Hybrid_p23 Fully-linear 0.8 0.8 Hybrid_p3 0.7 0.6 0.6 0.5 Fully-quadratic Hybrid_p23 0.4 0.4 Fully-linear Hybrid_p3 0.3 0.2 0.2 0.1 0 0 5 10152025303540455055 0 50 100 150 200 250 300 350 400 450 500 Fig. 1 Data and performance profiles comparing the use of different strategies for the computation of the quadratic models Performance profiles allow to evaluate the efficiency and the robustness of a given algorithmic variant. Let t be the number of function evaluations required p,s by solver s ∈ S to solve problem p ∈ P, according to the criterion (31). The cumula- tive distribution function, corresponding to the performance profile for solver s ∈ S is given by: ρ (ς ) = |{ p ∈ P : r ≤ ς}|, s p,s | P | with r = t / min{t :¯s ∈ S}. Thus, the value of ρ (1) represents the percentage p,s p,s p,s¯ s of problems where solver s required the minimum number of function evaluations, meaning it was the most efficient solver. Large values of ς allow to evaluate the capability of the algorithmic variants to solve the complete collection. Figure 1 reports the results obtained when considering different strategies for building the quadratic models. In this case, the stricter approach is used for solv- ing subproblem (10), always imposing the lower bound for each entry of the vector y at each one-dimensional minimization. It is clear that the hybrid version, that adequately adapts the sufficient decrease condition to the type of computed model, presents the best performance. The hybrid version that does not adapt the sufficient decrease condition is no better than the fully-linear approach. Even so, both are better than requiring the computation of a fully-quadratic model at every iteration. For the best variant, namely the hybrid version that adapts the sufficient decrease condition, we considered the second approach to the solution of problem (10), where the lower bound constraint is initially ignored, being the computed solution y modified a posteriori, if it does not satisfy the lower bound constraint. We denote this variant by adding the word projection. Results are very similar and can be found in Fig. 2. It is worth mentioning that the modification of the final solution, which was obtained by ignoring the lower bound constraint, was required only in seven problems, with a maximum of three times in two of those seven problems. d ( ) ( ) s Derivative-free separable quadratic... Data Profiles Performance Profiles 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 Hybrid_p23 Hybrid_p23 Hybrid_p23 projection 0.1 0.1 Hybrid_p23 projection 0 0 0 50 100 150 200 250 300 350 400 450 500 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Fig. 2 Data and performance profiles comparing two different strategies to address the solution of subprob- lem (10) 6 Conclusions and final remarks We present and analyze a derivative-free separable regularization approach for solving smooth unconstrained minimization problems. At each iteration we build a quadratic model of the objective function using only function evaluations. Several variants have been considered for this task, from a less expensive minimum Frobenius norm approach, to a more expensive fully-quadratic model, or a hybrid version that com- bines the previous approaches depending on the number of available useful points from previous iterations. For each one of the variants, we add to the model either a separable quadratic or a separable cubic regularization term to guarantee convergence to stationary points. Moreover, for each option we present a WCC analysis and we establish that, for driving the norm of the gradient below > 0, the fully-quadratic and the minimum 19/4 −3/2 2 −2 Frobenius norm regularized approaches need at most O n or O n function evaluations, respectively. The application of a convenient change of variables, based on the Schur factorization of the approximate Hessian matrices, trivializes the computation of the minimizer of the regularized models required at each iteration. In fact, the solution of the subproblem required at each iteration is reduced to the global minimization of n independent one- dimensional simple functions (a polynomial of degree 2 plus a term of the form |z| ) on a closed and bounded set on the real line. It is worth noticing that, for the typical low-dimensions used in DFO, the O(n ) computational cost of Schur factorizations is insignificant, as compared to the cost associated with the function evaluations required to build the quadratic model. Nevertheless, in addition to its use in Brás et al. (2020), this separability approach can be extended to be efficiently applied in other large-scale scenarios, for example in inexact or probabilistic adaptive cubic regularization; see, e.g., (Bellavia et al. 2021; Cartis and Scheinberg 2018). We would like to point out that the global minimizers of the regularized models can also be obtained using some other tractable schemes that, instead of solving n independent one-dimensional problems, solve at each iteration only one problem in n variables; see, e.g., (Cartis et al. 2011a; d ( ) ( ) s A. L. Custódio et al. Cristofari et al. 2019; Nesterov and Polyak 2006). These non-separable schemes, as well as our separable approach, require an O(n ) linear algebra computational cost. We also present a variety of numerical experiments to add understanding and illus- trate the behavior of all the different options considered for model computation. In general, we noticed that all the options show a robust performance. However, the worst behavior, concerning the required number of function evaluations, is consis- tently observed when using the fully-quadratic approach, and the best performance is observed when using the hybrid versions combined with a separable regularization term. Concerning the worst case complexity (WCC) results obtained for the considered approaches, a few comments are in order. Even though these results are of a theoretical nature and in general pessimistic in relation to the practical behavior of the methods, it is interesting to analyze which of the two considered approaches produces a better 19/4 −3/2 WCC result. For that, it is convenient to use their leading terms, i.e., n for 2 −2 the one using the fully-quadratic model and n for the one using the minimum Frobenius norm model. After some simple algebraic manipulations, we obtain that for the fully-quadratic approach to be better (that is, to require fewer function evaluations −2/11 11/2 in the worst case), it must hold that n < or equivalently that < 1/n . Therefore, if n is relatively small and is not very large (for example n < 9 and −5 ≈ 10 ) then the combined scheme that is based on the fully-quadratic model has a better WCC result than the scheme based on the minimum Frobenius norm approach. In our numerical experiments, the average dimension was 8.8 and for our −5 stopping criterion we fix = 10 , and hence from the theoretical WCC point of view, the best option is the one based on the fully-quadratic model. However, in our computational experiments the worst practical performance is clearly associated with the combination that uses the fully-quadratic model. We also note that if we choose a −2 more tolerant stopping criterion (say = 10 ), then for most of the same considered 11/2 small-dimensional problems we have that > 1/n , and so the scheme that uses the minimum Frobenius norm model exhibits simultaneously the best theoretical WCC result as well as the best practical performance. Finally, for future work, it would be interesting to study the practical behavior and the WCC results of the proposed derivative-free approach in the case of convex functions. Acknowledgements We are thankful to the comments and suggestions of two anonymous referees, which helped us to improve the presentation of our work. Author Contributions All authors contributed in the same way to the study, conception, design, data col- lection, conceptualization, methodology, formal analysis, and investigation. All authors read and approved the final manuscript. Funding Open access funding provided by FCT|FCCN (b-on). The first and second authors are funded by national funds through FCT - Fundação para a CiênciaeaTecnologiaI.P., under the scope of projects PTDC/MAT-APL/28400/2017, UIDP/MAT/00297/2020, and UIDB/MAT/00297/2020 (Center for Mathematics and Applications). The third author is funded by national funds through the FCT - Fundação para a CiênciaeaTecnologia, I.P., under the scope of the projects CEECIND/02211/2017, UIDP/MAT/00297/2020, and UIDB/MAT/00297/2020 (Center for Mathematics and Applications). 123 Derivative-free separable quadratic... Data availability The codes and datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. Declarations Conflict of interest No potential conflict of interest was reported by the authors. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Bellavia S, Gurioli G, Morini B (2021) Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization. IMA J Numer Anal 41:764–799 Birgin EG, Gardenghi JL, Martínez JM, Santos SA, Toint PhL (2017) Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math Program 163:359– Bandeira AS, Scheinberg K, Vicente LN (2012) Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Math Program 134:223–257 Brás CP, Martínez JM, Raydan M (2020) Large-scale unconstrained optimization using separable cubic modeling and matrix-free subspace minimization. Comput Optim Appl 75:169–205 Cartis C, Gould NIM, Toint Ph.L (2011a) Adaptive cubic regularisation methods for unconstrained opti- mization. Part I: motivation, convergence and numerical results. Math Program 127:245–295 Cartis C, Gould NIM, Toint PhL (2011b) Adaptive cubic regularisation methods for unconstrained optimiza- tion. Part II: worst-case function- and derivative-evaluation complexity. Math Program 130:295–319 Cartis C, Gould NIM, Toint PhL (2012) On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM J Optim 22:66–86 Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math Program 169:337–375 Cristofari A, Dehghan Niri T, Lucidi S (2019) On global minimizers of quadratic functions with cubic regularization. Optim Lett 13:1269–1283 Conn AR, Scheinberg K, Vicente LN (2008a) Geometry of interpolation sets in derivative free optimization. Math Program 111:141–172 Conn AR, Scheinberg K, Vicente LN (2008b) Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA J Numer Anal 28:721–748 Conn AR, Scheinberg K, Vicente LN (2009a) Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points. SIAM J Optim 20:387–415 Conn AR, Scheinberg K, Vicente LN (2009b) Introduction to derivative-free optimization. SIAM, Philadel- phia Custódio AL, Rocha H, Vicente LN (2010) Incorporating minimum Frobenius norm models in direct search. Comput Optim Appl 46:265–278 Dodangeh M, Vicente LN, Zhang Z (2016) On the optimal order of worst case complexity of direct search. Optim Lett 10:699–708 Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math Program 91:201–213 Fasano G, Morales JL, Nocedal J (2009) On the geometry phase in model-based algorithms for derivative- free optimization. Optim Methods Softw 24:145–154 Garmanjani R, Júdice D, Vicente LN (2016) Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM J Optim 26:1987–2011 123 A. L. Custódio et al. Gould NIM, Orban D, Toint PhL (2015) CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comp Optim Appl 60:545–557 Grapiglia GN, Yuan J, Yuan Y-X (2015) On the convergence and worst-case complexity of trust-region and regularization methods for unconstrained optimization. Math Program 152:491–520 Hardy GH, Littlewood JE, Pólya G (1934) Inequalities. Cambridge University Press, New York Karas EW, Santos SA, Svaiter BF (2015) Algebraic rules for quadratic regularization of Newton’s method. Comput Optim Appl 60:343–376 Lu S, Wei Z, Li L (2012) A trust region algorithm with adaptive cubic regularization methods for nonsmooth convex minimization. Comput Optim Appl 51:551–573 Martínez JM (2017) On high-order model regularization for constrained optimization. SIAM J Optim 27:2447–2458 Martínez JM, Raydan M (2015) Separable cubic modeling and a trust-region strategy for unconstrained minimization with impact in global optimization. J Global Optim 53:319–342 Martínez JM, Raydan M (2017) Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J Global Optim 68:367–385 Moré JJ, Wild SM (2009) Benchmarking derivative-free optimization algorithms. SIAM J Optim 20:172– Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math Program 108:177–205 Nesterov Y (2004) Introductory lectures on convex optimization. Kluwer Academic Publishers, Dordrecht Scheinberg K, Toint PhL (2010) Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization. SIAM J Optim 20:3512–3532 Vicente LN (2013) Worst case complexity of direct search. EURO J Comput Optim 1:143–153 Xu P, Roosta F, Mahoney MW (2020) Newton-type methods for non-convex optimization under inexact Hessian information. Math Program 184:35–70 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

4ORSpringer Journals

Published: Apr 8, 2023

Keywords: Derivative-free optimization; Fully-linear models; Fully-quadratic models; Cubic regularization; Worst-case complexity; 90C30; 65K05; 90C56; 65D05

There are no references for this article.